How this blog actually works
Posted on: 2026-05-15
tags: meta, python, tts

This blog runs on a static site generator I wrote myself. There were a handful of perfectly good off-the-shelf options I could have picked, but I had one feature in mind that none of them did well: I wanted posts to be read aloud, with the page visibly following alongthe current word highlighted, the current sentence highlighted, and the audio pausing politely whenever it reached a code block so the listener could actually look at it. That single requirement ended up shaping most of the architecture, and it's the bit I'm most proud of.

The build pipeline

The build is driven by a plain make. Three things happen, in order:

  1. SCSS in sass/ is compiled into static/css/ with the Dart sass CLI.
  2. Markdown in content/ is parsed into HTML, and also into a separate plain-text stream that gets fed to espeak-ng.
  3. Static assets are copied into output/ next to the rendered HTML.

The Python entry point is python3 -m site_generator, which exposes a handful of flags that the Makefile wires together. The most useful one day-to-day is --serve, which starts a dev server with file watching and browser live-reload.

Markdown to HTML

Mistune does the heavy lifting here. I subclass its HTMLRenderer to do two slightly unusual things. The first is wrapping every word and sentence in its own <span>, so the TTS layer has something to highlight later:

def text(self, text):
    text = re.sub(r'(\b\w+\b)', r'<span class="word">\1</span>', text)
    text = re.sub(r'([^.!?]+[.!?])', r'<span class="sentence">\1</span>', text)
    return text
Codeblock 0

The second is stamping each fenced code block with an id="code_block_N" and inserting a hidden "Continue TTS" button right after it. The id is what the JavaScript later toggles a .cur class onto so the code block lights up when the audio reaches it. The hidden button is what the listener clicks to resume playback after a paused code block.

For syntax colors I lean on Pygments. When a fenced code block declares a language, the renderer hands the code to Pygments and uses its HTML formatter; otherwise it falls back to plain <pre><code>. The theme is a single static CSS file at static/css/pygments.css, regenerated from a Python one-liner whenever I want to try a different look.

Markdown to speech

A second renderer, PlainTextRenderer, walks the same Markdown tree and produces a flat string of speakable text with SSML markers embedded. Code blocks get replaced with a short stand-in announcement, surrounded by named marks:

def block_code(self, token, state):
    n = self._code_block_counter
    code_block = (
        f"\n<mark name=\"_nospeak_{n}_start\"/>\n"
        f"<mark name=\"code_block_{n}_start\"/>\n"
        f"<prosody pitch=\"high\">\n"
        f"See codeblock {n}.\n"
        f"</prosody>\n"
        f"<mark name=\"code_block_{n}_end\"/>\n"
        f"<mark name=\"_nospeak_{n}_end\"/>\n"
    )
    self._code_block_counter += 1
    return code_block
Codeblock 1

I deliberately don't try to read the code itself aloud, because that's a miserable listening experience for anything more complex than a single line. Instead the audio says "see codeblock N" in a slightly higher pitch, and the position of those named marks in the audio stream is what later lets the JavaScript pause at exactly the right moment.

The C helper

The SSML stream is fed to a tiny C program in espeaklib/ that links against libespeak-ng and libjson-c. It reads from a temp file, drives espeak through its synth callback, and emits a single JSON blob describing exactly when each word, sentence and named mark begins in the audio stream:

typedef struct {
    int index;
    int position;
    int length;
    int audio_pos;
} WordTiming;
Codeblock 2

That JSON gets embedded directly into the rendered post as a <script id="audio-data" type="application/json"> tag, and the WAV file is written next to the HTML. No streaming, no separate fetch, no fragile network synchronization.

The browser side

static/js/tts.js runs an requestAnimationFrame loop that compares the <audio> element's currentTime against the timings list and figures out which word and sentence should be lit. Code-block marks get special treatment: they drive precise setTimeout calls (not the rAF loop) that pause the audio just before the next sentence kicks in, show the "Continue" button for that block, and keep the code block's <pre> outlined until the listener clicks back in.

Why setTimeout rather than the same rAF loop? Frame latency. On a busy tab the rAF callback can run a full 16ms or more after the timing crossover, which is enough for the next sentence to start audibly before the pause catches it. Scheduling the pause precisely against audioEl.currentTime and accounting for output latency makes the pause feel intentional instead of late.

The dev server

python3 -m site_generator --serve starts a dev server with:

  • watchfiles watching content/, templates/, sass/, static/ and site_generator/.
  • A small SSE endpoint at /__livereload that the browser subscribes to.
  • A response interceptor that injects a tiny client script into every served HTML page so the tab reloads itself when the server publishes a reload message.

The dispatcher is dumb on purpose: Markdown edits rebuild the affected post and the tag/index pages; template edits re-render every post (with TTS skipped, since espeak takes seconds-to-minutes per post); SCSS edits recompile to CSS; everything else under static/ just re-copies. Changes to the Python source still need a manual restart.

es.onmessage = function (e) {
    if (e.data === 'reload') {
        window.location.reload();
    }
};
Codeblock 3

That four-line snippet is the entire reload mechanism on the client. The SSE endpoint hangs open per tab and the rebuild loop just shouts "reload" into it.

Things I'd still like

A short list of follow-ups that haven't earned a spot in the build yet:

  • Cleaner sentence splitting. The regex above doesn't understand abbreviations, decimal numbers, or quotation marks; it just picks the next .!?.
  • A playback speed control wired to the audio element so I can re-listen on 1.4x without browser shortcuts.
  • An incremental TTS step so I don't regenerate audio for a post whose prose hasn't actually changed.
  • An RSS feed.

That last one is probably the next thing I'll add.