add greenfield design spec for Vorleser rebuild

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 21:09:13 +01:00
parent f852e3e97d
commit f53eb59ec3
1 changed files with 283 additions and 0 deletions
--- a/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md
+++ b/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md
@@ -0,0 +1,283 @@
+# Vorleser Greenfield Design
+
+**Date:** 2026-03-13
+**Status:** Draft
+
+## Overview
+
+Vorleser is a macOS + iOS app that turns EPUB and plain text files into spoken audio using on-device AI text-to-speech. The user imports a book, sees the text, taps any word to start listening from that position, and the app remembers where they left off.
+
+Quality is the top priority — if the voice isn't pleasant to listen to, nothing else matters.
+
+## Technical Stack
+
+- **TTS model:** Kokoro-82M v1.0 via MLX Swift
+- **Phonemization:** MisakiSwift (pure Swift port of Kokoro's official G2P library, misaki)
+- **Runtime:** MLX Swift (Apple's ML framework, dynamic shapes, no CoreML bucket pain)
+- **Platforms:** iOS + macOS from day one
+- **Persistence:** SwiftData
+- **EPUB parsing:** ZIPFoundation + SwiftSoup
+- **Project generation:** XcodeGen
+
+No GPL dependencies. No C libraries. Pure Swift throughout.
+
+**App size note:** Kokoro-82M weights are ~330MB. This is bundled in the app for v1. If App Store review flags the size, on-demand resources or a first-launch download can be added later without architectural changes.
+
+## Architecture
+
+### Package Structure
+
+```
+Vorleser/
+├── VorleserKit/                    # Swift Package — the core library
+│   ├── Sources/
+│   │   ├── VorleserKit/            # Public API, orchestration, shared types (CharacterOffset, SentenceSegmenter)
+│   │   ├── BookParser/             # EPUB + plain text parsing
+│   │   ├── Synthesizer/            # Kokoro MLX + MisakiSwift integration
+│   │   ├── AudioEngine/            # Playback, buffering, position tracking
+│   │   └── Storage/                # SwiftData models, reading state
+│   ├── Tests/
+│   └── Package.swift
+├── Vorleser-iOS/                   # Thin iOS app shell
+├── Vorleser-macOS/                 # Thin macOS app shell
+└── project.yml                     # XcodeGen
+```
+
+VorleserKit is the product. The app shells are SwiftUI wrappers. The library is testable and drivable without UI:
+
+```swift
+let kit = VorleserKit()
+let book = try kit.open(file: "1984.epub")
+let session = try await kit.play(book: book, from: .character(15030))
+```
+
+### Dependencies (all via SPM)
+
+- **MisakiSwift** — text → phonemes
+- **mlx-swift** — Kokoro inference
+- **ZIPFoundation** — EPUB extraction
+- **SwiftSoup** — HTML → text
+
+## Module Design
+
+### Shared Types (VorleserKit module)
+
+Types used across multiple modules live in the top-level VorleserKit module.
+
+```swift
+/// A position in a book, measured in characters from the start.
+public typealias CharacterOffset = Int
+```
+
+### BookParser
+
+Turns files into a uniform in-memory representation.
+
+**Supported formats:**
+- **EPUB** — unzip → parse OPF spine → extract XHTML chapters → SwiftSoup to plain text
+- **Plain text** — split on double newlines into chapters, or treat as single chapter
+
+**Core types:**
+
+```swift
+public struct Book {
+	let id: UUID
+	let title: String
+	let author: String?
+	let chapters: [Chapter]
+
+	/// Computed lazily on first access. Sentence segmentation is separate from parsing —
+	/// parsing extracts chapter text, segmentation splits it for playback and navigation.
+	lazy var sentences: [Sentence]
+	func sentenceContaining(offset: CharacterOffset) -> Int  // sentence index
+	func chapterAndLocalOffset(for offset: CharacterOffset) -> (Int, Int)
+}
+
+public struct Chapter {
+	let index: Int
+	let title: String
+	let text: String              // plain text, whitespace-normalized
+}
+```
+
+**Character addressing:** Every character has a global offset across all chapters. `Book` provides mapping between global character offset ↔ (chapter index, local offset). A single integer identifies any position in the book.
+
+**Parsing is eager** — the entire book is parsed on open. EPUBs are typically <1MB of text, so this is fast and avoids lazy loading complexity.
+
+**Re-parsing:** Books are re-parsed from their source file each time they are opened. The parsed `Book` is an in-memory struct, not cached. Since parsing is fast (<100ms for typical EPUBs), this avoids stale-cache issues and keeps Storage simple.
+
+**Error handling:** Malformed EPUBs (missing spine, DRM-encrypted content) cause `BookParser` to throw a descriptive error — the import fails and the user sees the reason. Individual chapters with unparseable XHTML are included with empty text and a title indicating the parse failure, so the book structure is preserved even if some chapters are broken.
+
+### Sentence Segmentation
+
+Sentence splitting is a shared concern used by AudioEngine (to resolve character offsets and navigate sentences) and the UI (to highlight the active sentence). It lives in the top-level VorleserKit module alongside shared types.
+
+```swift
+public struct SentenceSegmenter {
+	/// Splits text into sentences with their character ranges.
+	static func segment(_ text: String) -> [Sentence]
+}
+
+public struct Sentence {
+	let text: String
+	let range: Range<CharacterOffset>  // character range within the source text
+}
+```
+
+**Implementation:** Uses Foundation's `NLTokenizer` with `.sentence` unit. This handles abbreviations ("Dr.", "U.S.A."), decimal numbers, and other edge cases via Apple's linguistic models. No custom parsing.
+
+### Synthesizer
+
+Wraps MisakiSwift + Kokoro MLX into a single interface. Accepts a single sentence and returns its audio.
+
+**Pipeline:**
+
+```
+sentence text → MisakiSwift (G2P) → phonemes → Kokoro MLX → PCM audio (24kHz float32)
+```
+
+**Core interface:**
+
+```swift
+public class Synthesizer {
+	init(voice: VoicePack) async throws
+	func synthesize(text: String) async throws -> [Float]  // PCM samples at 24kHz
+}
+```
+
+The caller (AudioEngine) is responsible for sentence segmentation. Synthesizer receives sentence-length text and returns raw `[Float]` PCM at 24kHz. AudioEngine wraps this into `AVAudioPCMBuffer` for playback.
+
+**No internal chunking.** The Synthesizer trusts that it receives sentence-length input. If the input happens to be longer than one sentence, the model will still process it — quality may degrade for very long inputs, but there is no internal splitting or crossfade logic. Keeping this simple avoids duplicating the sentence segmentation that AudioEngine already performs.
+
+**Voice packs:** Curated set of 2-3 voices shipped as bundled resources.
+
+```swift
+public struct VoicePack {
+	let name: String              // e.g. "af_bella"
+	let language: String          // e.g. "en-us"
+
+	// Loaded from bundle at runtime
+	static func bundled() -> [VoicePack]
+}
+```
+
+**Model loading:** Kokoro weights + MisakiSwift dictionaries are bundled in the app. No download step.
+
+**Error handling:** If `init` fails (model cannot be loaded, out of memory on smaller devices), it throws with a descriptive error surfaced to the user. If `synthesize` fails for a specific sentence (MisakiSwift cannot phonemize the text, e.g. non-Latin scripts, mathematical notation), it throws — AudioEngine catches this, skips the sentence, and advances to the next one. The user sees a brief indication that a sentence was skipped.
+
+### AudioEngine
+
+Manages playback, buffering, and position tracking.
+
+**Core interface:**
+
+```swift
+public class AudioEngine {
+	func play(book: Book, from: CharacterOffset, using: Synthesizer) async throws
+	func pause()
+	func resume()
+	func stop()
+	func skipForward()            // jump to next sentence
+	func skipBackward()           // jump to previous sentence
+
+	var currentPosition: CharacterOffset { get }    // observable
+	var state: PlaybackState { get }    // .idle, .synthesizing, .playing, .paused
+}
+```
+
+**Playback flow:**
+
+AudioEngine uses the book's sentence index to iterate through sentences. Each sentence's text is passed to `Synthesizer.synthesize(text:)`.
+
+1. Resolve character offset to the enclosing sentence (via `Book`'s sentence index)
+2. Synthesize that sentence → PCM audio
+3. Play via `AVAudioEngine`
+4. While playing, synthesize the next sentence (one-ahead buffer)
+5. When current finishes, advance position, start next
+6. Update `currentPosition` as each sentence starts playing
+
+The one-ahead buffer is the only prefetching in v1. Deep pipeline streaming (multi-sentence lookahead, concurrent synthesis) is a later optimization.
+
+**skipForward/skipBackward:** Navigate the book's sentence index. Skip forward stops current playback and begins synthesis+playback of the next sentence. Skip backward does the same for the previous sentence.
+
+**Position tracking:** Sentence-level granularity. `currentPosition` updates to the start of the currently playing sentence. This is sufficient for the tap-to-resume use case — tapping a word snaps to the enclosing sentence anyway. Sub-sentence tracking (per-word timestamps) is not planned for v1.
+
+**Error handling:**
+- If `AVAudioEngine` fails to start (another app has exclusive audio, hardware unavailable): throw on `play()`, surface error to user.
+- If synthesis of the next sentence fails mid-playback: skip the failed sentence, advance to the one after. Log the failure.
+- Audio route changes (Bluetooth disconnect): `AVAudioEngine` handles this automatically — playback continues on the new default route.
+- iOS interruptions (phone call, Siri): playback pauses and stays paused — the user resumes manually. This is the standard iOS audiobook/podcast behavior.
+
+**Platform notes:**
+- iOS: `AVAudioSession` playback category, background audio mode, interruption handling as described above.
+- macOS: `AVAudioEngine` directly, no session management needed.
+
+### Storage
+
+Persists library and reading state via SwiftData.
+
+```swift
+@Model class StoredBook {
+	var bookID: UUID
+	var title: String
+	var author: String?
+	var sourceFileName: String    // filename of the copy in app documents
+	var dateAdded: Date
+	var lastPosition: Int         // global character offset
+	var lastRead: Date?
+	var voiceName: String?        // selected voice, nil = default
+}
+```
+
+**File storage:** Imported files are copied into the app's documents directory. `sourceFileName` references the copy, not the original.
+
+**Duplicate imports:** Importing the same file again creates a new copy and a new `StoredBook`. No deduplication — the user may want to track position separately for a re-read. The file list makes duplicates visible.
+
+**Missing files:** If the copied source file is missing when the user opens a book (e.g. deleted via Files app), the app shows an error and offers to re-import or remove the entry.
+
+**Reading position:** Updated on pause, stop, or app backgrounding. Just an integer.
+
+**Book deletion:** Removing a book deletes the `StoredBook` record and its copied file from app documents.
+
+**No iCloud sync in v1.** Schema supports it later.
+
+## App Shells
+
+Thin SwiftUI layers over VorleserKit.
+
+### Views
+
+- **LibraryView** — book list sorted by last read. Import button for EPUB/TXT. Swipe to delete. Tap → ReaderView.
+- **ReaderView** — scrollable text. Tap a word → play from there. Active sentence highlighting. Chapter navigation.
+- **PlaybackControls** — play/pause, skip sentence forward/back. Bottom of ReaderView.
+- **SettingsView** — voice selection with preview.
+
+### Platform Differences
+
+| | iOS | macOS |
+|---|---|---|
+| File import | `.fileImporter` sheet | `.fileImporter` or drag-and-drop |
+| Layout | Single column, tab navigation | Sidebar (library) + detail (reader) |
+| Text interaction | Tap word | Click word |
+| Audio session | AVAudioSession config | Not needed |
+
+### Tap-to-Play Interaction
+
+1. User taps a word in the text
+2. View resolves tap to character offset using a platform text view (`UITextView` on iOS, `NSTextView` on macOS) wrapped in SwiftUI. These views natively support hit-testing to character index via `closestPosition(to:)` / `characterIndex(for:)`. The text view is styled to look like a reading view (no editing, no cursor).
+3. Calls `audioEngine.play(book:from:using:)` with that offset
+4. Engine snaps to enclosing sentence boundary (via the book's sentence index), begins synthesis + playback
+5. View observes `currentPosition` and uses the book's sentence index to highlight the active sentence via attributed string ranges
+
+## What's Explicitly Out of Scope (v1)
+
+- Deep pipeline streaming (multi-sentence lookahead beyond one-ahead buffer)
+- iCloud sync
+- Playback speed control
+- PDF support
+- More than 2-3 curated voices
+- Localized UI (English only, though architecture supports it)
+- Background downloads or model updates
+- Per-word position tracking / word-level highlighting
+- Caching parsed book text (re-parse on each open)
+- Latency optimization (acceptable to wait for synthesis before first audio plays)