This blog runs on a static site generator I wrote myself. There were a handful of perfectly good off-the-shelf options I could have picked, but I had one feature in mind that none of them did well: I wanted posts to be read aloud, with the page visibly following along — the current word highlighted, the current sentence highlighted, and the audio pausing politely whenever it reached a code block so the listener could actually look at it. That single requirement ended up shaping most of the architecture, and it's the bit I'm most proud of.
The build is driven by a plain make. Three things happen, in order:
sass/ is compiled into static/css/ with the Dart sass CLI.content/ is parsed into HTML, and also into a separate plain-text stream that gets fed to espeak-ng.output/ next to the rendered HTML.The Python entry point is python3 -m site_generator, which exposes a handful of flags that the Makefile wires together. The most useful one day-to-day is --serve, which starts a dev server with file watching and browser live-reload.
Mistune does the heavy lifting here. I subclass its HTMLRenderer to do two slightly unusual things. The first is wrapping every word and sentence in its own <span>, so the TTS layer has something to highlight later:
def text(self, text):
text = re.sub(r'(\b\w+\b)', r'<span class="word">\1</span>', text)
text = re.sub(r'([^.!?]+[.!?])', r'<span class="sentence">\1</span>', text)
return text
The second is stamping each fenced code block with an id="code_block_N" and inserting a hidden "Continue TTS" button right after it. The id is what the JavaScript later toggles a .cur class onto so the code block lights up when the audio reaches it. The hidden button is what the listener clicks to resume playback after a paused code block.
For syntax colors I lean on Pygments. When a fenced code block declares a language, the renderer hands the code to Pygments and uses its HTML formatter; otherwise it falls back to plain <pre><code>. The theme is a single static CSS file at static/css/pygments.css, regenerated from a Python one-liner whenever I want to try a different look.
A second renderer, PlainTextRenderer, walks the same Markdown tree and produces a flat string of speakable text with SSML markers embedded. Code blocks get replaced with a short stand-in announcement, surrounded by named marks:
def block_code(self, token, state):
n = self._code_block_counter
code_block = (
f"\n<mark name=\"_nospeak_{n}_start\"/>\n"
f"<mark name=\"code_block_{n}_start\"/>\n"
f"<prosody pitch=\"high\">\n"
f"See codeblock {n}.\n"
f"</prosody>\n"
f"<mark name=\"code_block_{n}_end\"/>\n"
f"<mark name=\"_nospeak_{n}_end\"/>\n"
)
self._code_block_counter += 1
return code_block
I deliberately don't try to read the code itself aloud, because that's a miserable listening experience for anything more complex than a single line. Instead the audio says "see codeblock N" in a slightly higher pitch, and the position of those named marks in the audio stream is what later lets the JavaScript pause at exactly the right moment.
The SSML stream is fed to a tiny C program in espeaklib/ that links against libespeak-ng and libjson-c. It reads from a temp file, drives espeak through its synth callback, and emits a single JSON blob describing exactly when each word, sentence and named mark begins in the audio stream:
typedef struct {
int index;
int position;
int length;
int audio_pos;
} WordTiming;
That JSON gets embedded directly into the rendered post as a <script id="audio-data" type="application/json"> tag, and the WAV file is written next to the HTML. No streaming, no separate fetch, no fragile network synchronization.
static/js/tts.js runs an requestAnimationFrame loop that compares the <audio> element's currentTime against the timings list and figures out which word and sentence should be lit. Code-block marks get special treatment: they drive precise setTimeout calls (not the rAF loop) that pause the audio just before the next sentence kicks in, show the "Continue" button for that block, and keep the code block's <pre> outlined until the listener clicks back in.
Why setTimeout rather than the same rAF loop? Frame latency. On a busy tab the rAF callback can run a full 16ms or more after the timing crossover, which is enough for the next sentence to start audibly before the pause catches it. Scheduling the pause precisely against audioEl.currentTime and accounting for output latency makes the pause feel intentional instead of late.
python3 -m site_generator --serve starts a dev server with:
watchfiles watching content/, templates/, sass/, static/ and site_generator/./__livereload that the browser subscribes to.reload message.The dispatcher is dumb on purpose: Markdown edits rebuild the affected post and the tag/index pages; template edits re-render every post (with TTS skipped, since espeak takes seconds-to-minutes per post); SCSS edits recompile to CSS; everything else under static/ just re-copies. Changes to the Python source still need a manual restart.
es.onmessage = function (e) {
if (e.data === 'reload') {
window.location.reload();
}
};
That four-line snippet is the entire reload mechanism on the client. The SSE endpoint hangs open per tab and the rebuild loop just shouts "reload" into it.
A short list of follow-ups that haven't earned a spot in the build yet:
.!?.That last one is probably the next thing I'll add.