Files
vorleser/docs/superpowers/specs/2026-03-13-vorleser-greenfield-design.md
2026-03-13 21:36:59 +01:00

13 KiB

Vorleser Greenfield Design

Date: 2026-03-13 Status: Draft

Overview

Vorleser is a macOS + iOS app that turns EPUB and plain text files into spoken audio using on-device AI text-to-speech. The user imports a book, sees the text, taps any word to start listening from that position, and the app remembers where they left off.

Quality is the top priority — if the voice isn't pleasant to listen to, nothing else matters.

Technical Stack

  • TTS model: Kokoro-82M v1.0 via MLX Swift
  • Phonemization: MisakiSwift (pure Swift port of Kokoro's official G2P library, misaki)
  • Runtime: MLX Swift (Apple's ML framework, dynamic shapes, no CoreML bucket pain)
  • Platforms: iOS + macOS from day one
  • Persistence: SwiftData
  • EPUB parsing: ZIPFoundation + SwiftSoup
  • Project generation: XcodeGen

No GPL dependencies. No C libraries. Pure Swift throughout.

App size note: Kokoro-82M weights are ~600MB (safetensors format) plus ~14MB for voice embeddings (voices.npz). This is bundled in the app for v1. If App Store review flags the size, on-demand resources or a first-launch download can be added later without architectural changes.

Platform constraints: iOS 18.0+ / macOS 15.0+ (required by KokoroSwift and MisakiSwift). MLX does not work in the iOS Simulator — real device required for TTS testing. Swift 6.2 toolchain required.

Architecture

Package Structure

Vorleser/
├── VorleserKit/                    # Swift Package — the core library
│   ├── Sources/
│   │   ├── VorleserKit/            # Public API, orchestration, shared types (CharacterOffset, SentenceSegmenter)
│   │   ├── BookParser/             # EPUB + plain text parsing
│   │   ├── Synthesizer/            # Kokoro MLX + MisakiSwift integration
│   │   ├── AudioEngine/            # Playback, buffering, position tracking
│   │   └── Storage/                # SwiftData models, reading state
│   ├── Tests/
│   └── Package.swift
├── Vorleser-iOS/                   # Thin iOS app shell
├── Vorleser-macOS/                 # Thin macOS app shell
└── project.yml                     # XcodeGen

VorleserKit is the product. The app shells are SwiftUI wrappers. The library is testable and drivable without UI:

let kit = VorleserKit()
let book = try kit.open(file: "1984.epub")
let session = try await kit.play(book: book, from: .character(15030))

Dependencies (all via SPM)

  • MisakiSwift — text → phonemes
  • mlx-swift — Kokoro inference
  • ZIPFoundation — EPUB extraction
  • SwiftSoup — HTML → text

Module Design

Shared Types (VorleserKit module)

Types used across multiple modules live in the top-level VorleserKit module.

/// A position in a book, measured in characters from the start.
public typealias CharacterOffset = Int

BookParser

Turns files into a uniform in-memory representation.

Supported formats:

  • EPUB — unzip → parse OPF spine → extract XHTML chapters → SwiftSoup to plain text
  • Plain text — split on double newlines into chapters, or treat as single chapter

Core types:

public struct Book {
	let id: UUID
	let title: String
	let author: String?
	let chapters: [Chapter]

	/// Computed lazily on first access. Sentence segmentation is separate from parsing —
	/// parsing extracts chapter text, segmentation splits it for playback and navigation.
	lazy var sentences: [Sentence]
	func sentenceContaining(offset: CharacterOffset) -> Int  // sentence index
	func chapterAndLocalOffset(for offset: CharacterOffset) -> (Int, Int)
}

public struct Chapter {
	let index: Int
	let title: String
	let text: String              // plain text, whitespace-normalized
}

Character addressing: Every character has a global offset across all chapters. Book provides mapping between global character offset ↔ (chapter index, local offset). A single integer identifies any position in the book.

Parsing is eager — the entire book is parsed on open. EPUBs are typically <1MB of text, so this is fast and avoids lazy loading complexity.

Re-parsing: Books are re-parsed from their source file each time they are opened. The parsed Book is an in-memory struct, not cached. Since parsing is fast (<100ms for typical EPUBs), this avoids stale-cache issues and keeps Storage simple.

Error handling: Malformed EPUBs (missing spine, DRM-encrypted content) cause BookParser to throw a descriptive error — the import fails and the user sees the reason. Individual chapters with unparseable XHTML are included with empty text and a title indicating the parse failure, so the book structure is preserved even if some chapters are broken.

Sentence Segmentation

Sentence splitting is a shared concern used by AudioEngine (to resolve character offsets and navigate sentences) and the UI (to highlight the active sentence). It lives in the top-level VorleserKit module alongside shared types.

public struct SentenceSegmenter {
	/// Splits text into sentences with their character ranges.
	static func segment(_ text: String) -> [Sentence]
}

public struct Sentence {
	let text: String
	let range: Range<CharacterOffset>  // character range within the source text
}

Implementation: Uses Foundation's NLTokenizer with .sentence unit. This handles abbreviations ("Dr.", "U.S.A."), decimal numbers, and other edge cases via Apple's linguistic models. No custom parsing.

Synthesizer

Wraps MisakiSwift + Kokoro MLX into a single interface. Accepts a single sentence and returns its audio.

Pipeline:

sentence text → MisakiSwift (G2P) → phonemes → Kokoro MLX → PCM audio (24kHz float32)

Core interface:

public class Synthesizer {
	init(voice: VoicePack) async throws
	func synthesize(text: String) async throws -> [Float]  // PCM samples at 24kHz
}

The caller (AudioEngine) is responsible for sentence segmentation. Synthesizer receives sentence-length text and returns raw [Float] PCM at 24kHz. AudioEngine wraps this into AVAudioPCMBuffer for playback.

No internal chunking. The Synthesizer trusts that it receives sentence-length input. If the input happens to be longer than one sentence, the model will still process it — quality may degrade for very long inputs, but there is no internal splitting or crossfade logic. Keeping this simple avoids duplicating the sentence segmentation that AudioEngine already performs.

Voice packs: Curated set of 2-3 voices shipped as bundled resources.

public struct VoicePack {
	let name: String              // e.g. "af_bella"
	let language: String          // e.g. "en-us"

	// Loaded from bundle at runtime
	static func bundled() -> [VoicePack]
}

Model loading: Kokoro weights + MisakiSwift dictionaries are bundled in the app. No download step.

Error handling: If init fails (model cannot be loaded, out of memory on smaller devices), it throws with a descriptive error surfaced to the user. If synthesize fails for a specific sentence (MisakiSwift cannot phonemize the text, e.g. non-Latin scripts, mathematical notation), it throws — AudioEngine catches this, skips the sentence, and advances to the next one. The user sees a brief indication that a sentence was skipped.

AudioEngine

Manages playback, buffering, and position tracking.

Core interface:

public class AudioEngine {
	func play(book: Book, from: CharacterOffset, using: Synthesizer) async throws
	func pause()
	func resume()
	func stop()
	func skipForward()            // jump to next sentence
	func skipBackward()           // jump to previous sentence

	var currentPosition: CharacterOffset { get }    // observable
	var state: PlaybackState { get }    // .idle, .synthesizing, .playing, .paused
}

Playback flow:

AudioEngine uses the book's sentence index to iterate through sentences. Each sentence's text is passed to Synthesizer.synthesize(text:).

  1. Resolve character offset to the enclosing sentence (via Book's sentence index)
  2. Synthesize that sentence → PCM audio
  3. Play via AVAudioEngine
  4. While playing, synthesize the next sentence (one-ahead buffer)
  5. When current finishes, advance position, start next
  6. Update currentPosition as each sentence starts playing

The one-ahead buffer is the only prefetching in v1. Deep pipeline streaming (multi-sentence lookahead, concurrent synthesis) is a later optimization.

skipForward/skipBackward: Navigate the book's sentence index. Skip forward stops current playback and begins synthesis+playback of the next sentence. Skip backward does the same for the previous sentence.

Position tracking: Sentence-level granularity. currentPosition updates to the start of the currently playing sentence. This is sufficient for the tap-to-resume use case — tapping a word snaps to the enclosing sentence anyway. Sub-sentence tracking (per-word timestamps) is not planned for v1.

Error handling:

  • If AVAudioEngine fails to start (another app has exclusive audio, hardware unavailable): throw on play(), surface error to user.
  • If synthesis of the next sentence fails mid-playback: skip the failed sentence, advance to the one after. Log the failure.
  • Audio route changes (Bluetooth disconnect): AVAudioEngine handles this automatically — playback continues on the new default route.
  • iOS interruptions (phone call, Siri): playback pauses and stays paused — the user resumes manually. This is the standard iOS audiobook/podcast behavior.

Platform notes:

  • iOS: AVAudioSession playback category, background audio mode, interruption handling as described above.
  • macOS: AVAudioEngine directly, no session management needed.

Storage

Persists library and reading state via SwiftData.

@Model class StoredBook {
	var bookID: UUID
	var title: String
	var author: String?
	var sourceFileName: String    // filename of the copy in app documents
	var dateAdded: Date
	var lastPosition: Int         // global character offset
	var lastRead: Date?
	var voiceName: String?        // selected voice, nil = default
}

File storage: Imported files are copied into the app's documents directory. sourceFileName references the copy, not the original.

Duplicate imports: Importing the same file again creates a new copy and a new StoredBook. No deduplication — the user may want to track position separately for a re-read. The file list makes duplicates visible.

Missing files: If the copied source file is missing when the user opens a book (e.g. deleted via Files app), the app shows an error and offers to re-import or remove the entry.

Reading position: Updated on pause, stop, or app backgrounding. Just an integer.

Book deletion: Removing a book deletes the StoredBook record and its copied file from app documents.

No iCloud sync in v1. Schema supports it later.

App Shells

Thin SwiftUI layers over VorleserKit.

Views

  • LibraryView — book list sorted by last read. Import button for EPUB/TXT. Swipe to delete. Tap → ReaderView.
  • ReaderView — scrollable text. Tap a word → play from there. Active sentence highlighting. Chapter navigation.
  • PlaybackControls — play/pause, skip sentence forward/back. Bottom of ReaderView.
  • SettingsView — voice selection with preview.

Platform Differences

iOS macOS
File import .fileImporter sheet .fileImporter or drag-and-drop
Layout Single column, tab navigation Sidebar (library) + detail (reader)
Text interaction Tap word Click word
Audio session AVAudioSession config Not needed

Tap-to-Play Interaction

  1. User taps a word in the text
  2. View resolves tap to character offset using a platform text view (UITextView on iOS, NSTextView on macOS) wrapped in SwiftUI. These views natively support hit-testing to character index via closestPosition(to:) / characterIndex(for:). The text view is styled to look like a reading view (no editing, no cursor).
  3. Calls audioEngine.play(book:from:using:) with that offset
  4. Engine snaps to enclosing sentence boundary (via the book's sentence index), begins synthesis + playback
  5. View observes currentPosition and uses the book's sentence index to highlight the active sentence via attributed string ranges

What's Explicitly Out of Scope (v1)

  • Deep pipeline streaming (multi-sentence lookahead beyond one-ahead buffer)
  • iCloud sync
  • Playback speed control
  • PDF support
  • More than 2-3 curated voices
  • Localized UI (English only, though architecture supports it)
  • Background downloads or model updates
  • Per-word position tracking / word-level highlighting
  • Caching parsed book text (re-parse on each open)
  • Latency optimization (acceptable to wait for synthesis before first audio plays)