From f53eb59ec37ca769e8c064c105a6e5dd6d75d298 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Felix=20F=C3=B6rtsch?= Date: Fri, 13 Mar 2026 21:09:13 +0100 Subject: [PATCH] add greenfield design spec for Vorleser rebuild Co-Authored-By: Claude Opus 4.6 --- .../2026-03-13-vorleser-greenfield-design.md | 283 ++++++++++++++++++ 1 file changed, 283 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md diff --git a/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md b/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md new file mode 100644 index 0000000..7eca2ba --- /dev/null +++ b/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md @@ -0,0 +1,283 @@ +# Vorleser Greenfield Design + +**Date:** 2026-03-13 +**Status:** Draft + +## Overview + +Vorleser is a macOS + iOS app that turns EPUB and plain text files into spoken audio using on-device AI text-to-speech. The user imports a book, sees the text, taps any word to start listening from that position, and the app remembers where they left off. + +Quality is the top priority — if the voice isn't pleasant to listen to, nothing else matters. + +## Technical Stack + +- **TTS model:** Kokoro-82M v1.0 via MLX Swift +- **Phonemization:** MisakiSwift (pure Swift port of Kokoro's official G2P library, misaki) +- **Runtime:** MLX Swift (Apple's ML framework, dynamic shapes, no CoreML bucket pain) +- **Platforms:** iOS + macOS from day one +- **Persistence:** SwiftData +- **EPUB parsing:** ZIPFoundation + SwiftSoup +- **Project generation:** XcodeGen + +No GPL dependencies. No C libraries. Pure Swift throughout. + +**App size note:** Kokoro-82M weights are ~330MB. This is bundled in the app for v1. If App Store review flags the size, on-demand resources or a first-launch download can be added later without architectural changes. + +## Architecture + +### Package Structure + +``` +Vorleser/ +├── VorleserKit/ # Swift Package — the core library +│ ├── Sources/ +│ │ ├── VorleserKit/ # Public API, orchestration, shared types (CharacterOffset, SentenceSegmenter) +│ │ ├── BookParser/ # EPUB + plain text parsing +│ │ ├── Synthesizer/ # Kokoro MLX + MisakiSwift integration +│ │ ├── AudioEngine/ # Playback, buffering, position tracking +│ │ └── Storage/ # SwiftData models, reading state +│ ├── Tests/ +│ └── Package.swift +├── Vorleser-iOS/ # Thin iOS app shell +├── Vorleser-macOS/ # Thin macOS app shell +└── project.yml # XcodeGen +``` + +VorleserKit is the product. The app shells are SwiftUI wrappers. The library is testable and drivable without UI: + +```swift +let kit = VorleserKit() +let book = try kit.open(file: "1984.epub") +let session = try await kit.play(book: book, from: .character(15030)) +``` + +### Dependencies (all via SPM) + +- **MisakiSwift** — text → phonemes +- **mlx-swift** — Kokoro inference +- **ZIPFoundation** — EPUB extraction +- **SwiftSoup** — HTML → text + +## Module Design + +### Shared Types (VorleserKit module) + +Types used across multiple modules live in the top-level VorleserKit module. + +```swift +/// A position in a book, measured in characters from the start. +public typealias CharacterOffset = Int +``` + +### BookParser + +Turns files into a uniform in-memory representation. + +**Supported formats:** +- **EPUB** — unzip → parse OPF spine → extract XHTML chapters → SwiftSoup to plain text +- **Plain text** — split on double newlines into chapters, or treat as single chapter + +**Core types:** + +```swift +public struct Book { + let id: UUID + let title: String + let author: String? + let chapters: [Chapter] + + /// Computed lazily on first access. Sentence segmentation is separate from parsing — + /// parsing extracts chapter text, segmentation splits it for playback and navigation. + lazy var sentences: [Sentence] + func sentenceContaining(offset: CharacterOffset) -> Int // sentence index + func chapterAndLocalOffset(for offset: CharacterOffset) -> (Int, Int) +} + +public struct Chapter { + let index: Int + let title: String + let text: String // plain text, whitespace-normalized +} +``` + +**Character addressing:** Every character has a global offset across all chapters. `Book` provides mapping between global character offset ↔ (chapter index, local offset). A single integer identifies any position in the book. + +**Parsing is eager** — the entire book is parsed on open. EPUBs are typically <1MB of text, so this is fast and avoids lazy loading complexity. + +**Re-parsing:** Books are re-parsed from their source file each time they are opened. The parsed `Book` is an in-memory struct, not cached. Since parsing is fast (<100ms for typical EPUBs), this avoids stale-cache issues and keeps Storage simple. + +**Error handling:** Malformed EPUBs (missing spine, DRM-encrypted content) cause `BookParser` to throw a descriptive error — the import fails and the user sees the reason. Individual chapters with unparseable XHTML are included with empty text and a title indicating the parse failure, so the book structure is preserved even if some chapters are broken. + +### Sentence Segmentation + +Sentence splitting is a shared concern used by AudioEngine (to resolve character offsets and navigate sentences) and the UI (to highlight the active sentence). It lives in the top-level VorleserKit module alongside shared types. + +```swift +public struct SentenceSegmenter { + /// Splits text into sentences with their character ranges. + static func segment(_ text: String) -> [Sentence] +} + +public struct Sentence { + let text: String + let range: Range // character range within the source text +} +``` + +**Implementation:** Uses Foundation's `NLTokenizer` with `.sentence` unit. This handles abbreviations ("Dr.", "U.S.A."), decimal numbers, and other edge cases via Apple's linguistic models. No custom parsing. + +### Synthesizer + +Wraps MisakiSwift + Kokoro MLX into a single interface. Accepts a single sentence and returns its audio. + +**Pipeline:** + +``` +sentence text → MisakiSwift (G2P) → phonemes → Kokoro MLX → PCM audio (24kHz float32) +``` + +**Core interface:** + +```swift +public class Synthesizer { + init(voice: VoicePack) async throws + func synthesize(text: String) async throws -> [Float] // PCM samples at 24kHz +} +``` + +The caller (AudioEngine) is responsible for sentence segmentation. Synthesizer receives sentence-length text and returns raw `[Float]` PCM at 24kHz. AudioEngine wraps this into `AVAudioPCMBuffer` for playback. + +**No internal chunking.** The Synthesizer trusts that it receives sentence-length input. If the input happens to be longer than one sentence, the model will still process it — quality may degrade for very long inputs, but there is no internal splitting or crossfade logic. Keeping this simple avoids duplicating the sentence segmentation that AudioEngine already performs. + +**Voice packs:** Curated set of 2-3 voices shipped as bundled resources. + +```swift +public struct VoicePack { + let name: String // e.g. "af_bella" + let language: String // e.g. "en-us" + + // Loaded from bundle at runtime + static func bundled() -> [VoicePack] +} +``` + +**Model loading:** Kokoro weights + MisakiSwift dictionaries are bundled in the app. No download step. + +**Error handling:** If `init` fails (model cannot be loaded, out of memory on smaller devices), it throws with a descriptive error surfaced to the user. If `synthesize` fails for a specific sentence (MisakiSwift cannot phonemize the text, e.g. non-Latin scripts, mathematical notation), it throws — AudioEngine catches this, skips the sentence, and advances to the next one. The user sees a brief indication that a sentence was skipped. + +### AudioEngine + +Manages playback, buffering, and position tracking. + +**Core interface:** + +```swift +public class AudioEngine { + func play(book: Book, from: CharacterOffset, using: Synthesizer) async throws + func pause() + func resume() + func stop() + func skipForward() // jump to next sentence + func skipBackward() // jump to previous sentence + + var currentPosition: CharacterOffset { get } // observable + var state: PlaybackState { get } // .idle, .synthesizing, .playing, .paused +} +``` + +**Playback flow:** + +AudioEngine uses the book's sentence index to iterate through sentences. Each sentence's text is passed to `Synthesizer.synthesize(text:)`. + +1. Resolve character offset to the enclosing sentence (via `Book`'s sentence index) +2. Synthesize that sentence → PCM audio +3. Play via `AVAudioEngine` +4. While playing, synthesize the next sentence (one-ahead buffer) +5. When current finishes, advance position, start next +6. Update `currentPosition` as each sentence starts playing + +The one-ahead buffer is the only prefetching in v1. Deep pipeline streaming (multi-sentence lookahead, concurrent synthesis) is a later optimization. + +**skipForward/skipBackward:** Navigate the book's sentence index. Skip forward stops current playback and begins synthesis+playback of the next sentence. Skip backward does the same for the previous sentence. + +**Position tracking:** Sentence-level granularity. `currentPosition` updates to the start of the currently playing sentence. This is sufficient for the tap-to-resume use case — tapping a word snaps to the enclosing sentence anyway. Sub-sentence tracking (per-word timestamps) is not planned for v1. + +**Error handling:** +- If `AVAudioEngine` fails to start (another app has exclusive audio, hardware unavailable): throw on `play()`, surface error to user. +- If synthesis of the next sentence fails mid-playback: skip the failed sentence, advance to the one after. Log the failure. +- Audio route changes (Bluetooth disconnect): `AVAudioEngine` handles this automatically — playback continues on the new default route. +- iOS interruptions (phone call, Siri): playback pauses and stays paused — the user resumes manually. This is the standard iOS audiobook/podcast behavior. + +**Platform notes:** +- iOS: `AVAudioSession` playback category, background audio mode, interruption handling as described above. +- macOS: `AVAudioEngine` directly, no session management needed. + +### Storage + +Persists library and reading state via SwiftData. + +```swift +@Model class StoredBook { + var bookID: UUID + var title: String + var author: String? + var sourceFileName: String // filename of the copy in app documents + var dateAdded: Date + var lastPosition: Int // global character offset + var lastRead: Date? + var voiceName: String? // selected voice, nil = default +} +``` + +**File storage:** Imported files are copied into the app's documents directory. `sourceFileName` references the copy, not the original. + +**Duplicate imports:** Importing the same file again creates a new copy and a new `StoredBook`. No deduplication — the user may want to track position separately for a re-read. The file list makes duplicates visible. + +**Missing files:** If the copied source file is missing when the user opens a book (e.g. deleted via Files app), the app shows an error and offers to re-import or remove the entry. + +**Reading position:** Updated on pause, stop, or app backgrounding. Just an integer. + +**Book deletion:** Removing a book deletes the `StoredBook` record and its copied file from app documents. + +**No iCloud sync in v1.** Schema supports it later. + +## App Shells + +Thin SwiftUI layers over VorleserKit. + +### Views + +- **LibraryView** — book list sorted by last read. Import button for EPUB/TXT. Swipe to delete. Tap → ReaderView. +- **ReaderView** — scrollable text. Tap a word → play from there. Active sentence highlighting. Chapter navigation. +- **PlaybackControls** — play/pause, skip sentence forward/back. Bottom of ReaderView. +- **SettingsView** — voice selection with preview. + +### Platform Differences + +| | iOS | macOS | +|---|---|---| +| File import | `.fileImporter` sheet | `.fileImporter` or drag-and-drop | +| Layout | Single column, tab navigation | Sidebar (library) + detail (reader) | +| Text interaction | Tap word | Click word | +| Audio session | AVAudioSession config | Not needed | + +### Tap-to-Play Interaction + +1. User taps a word in the text +2. View resolves tap to character offset using a platform text view (`UITextView` on iOS, `NSTextView` on macOS) wrapped in SwiftUI. These views natively support hit-testing to character index via `closestPosition(to:)` / `characterIndex(for:)`. The text view is styled to look like a reading view (no editing, no cursor). +3. Calls `audioEngine.play(book:from:using:)` with that offset +4. Engine snaps to enclosing sentence boundary (via the book's sentence index), begins synthesis + playback +5. View observes `currentPosition` and uses the book's sentence index to highlight the active sentence via attributed string ranges + +## What's Explicitly Out of Scope (v1) + +- Deep pipeline streaming (multi-sentence lookahead beyond one-ahead buffer) +- iCloud sync +- Playback speed control +- PDF support +- More than 2-3 curated voices +- Localized UI (English only, though architecture supports it) +- Background downloads or model updates +- Per-word position tracking / word-level highlighting +- Caching parsed book text (re-parse on each open) +- Latency optimization (acceptable to wait for synthesis before first audio plays)