Apr 25 · in engineering
by Ryan Welch · 9 Min Read
My recent posts on Kokoro and Qwen3 covered TTS models that produce narration with word-level timestamps. As part of a larger captioning project, I now need to split that stream of timestamped words into subtitle pages. A subtitle page is a short chunk of text that appears on screen for a few seconds and then disappears. The segmentation problem is deciding which words belong together on the same page and where to break between pages. It sounds simple, but doing it well takes a mix of timing data, linguistic rules, and machine learning.
The input to subtitle segmentation is a list of words with timestamps showing when each word started and ended. Where these timestamps come from depends on your pipeline. ASR/STT (automatic speech recognition) is the most common source, where you transcribe existing audio and get word-level timing as part of the output. TTS engines can also produce this data during synthesis. Either way, the segmentation problem is the same. Given timestamped words, you want to group them into subtitle pages that appear and disappear in sync with the audio.
For this article I’m using the following example sentence:
Golden hour hit different. We stood there, wordless, watching the day quietly surrender to night.
Each word arrives with a startMs and endMs. The gaps between consecutive words reflect natural pauses in the speech. The job is deciding where one subtitle page ends and the next begins.
I’ll start with a few naive heuristics, then layer in timing and syntax, and finish with the constraints that make production subtitles harder than a toy example.
Subtitle reading is time-pressured. The text appears, you read it, and it disappears. You can’t scroll back. If a page breaks at the wrong moment, the reader’s eye lands on a fragment like “We” and has to wait for the next page to understand what it belongs to. By the time the next page appears, some viewers have already looked away. Others have to re-read to reconstruct the sentence and fall behind.
The simplest approach is to slice the word array into fixed-size chunks.
function chunkByCount(words: TimedWord[], n: number): SubtitlePage[] { const pages: SubtitlePage[] = []; for (let i = 0; i < words.length; i += n) { const chunk = words.slice(i, i + n); pages.push({ words: chunk, startMs: chunk[0].startMs, endMs: chunk[chunk.length - 1]!.endMs, }); } return pages;}Every break is arbitrary, landing wherever the count happens to fall. “We” gets split from “stood” across two pages. The reader sees “We” and has to wait for the next page to understand the subject.
One step up is to flush the current page when a word ends with ., !, or ?, instead of waiting for the word count to fill up.
function endsSentence(word: string): boolean { return /[.!?]["')\]]*$/.test(word);}This immediately creates a problem. “Dr.” triggers the sentence-ending check because the period looks like a full stop to a regex. “St.” in “St. Pauls Lane” does the same. Periods pull double duty in English (sentence termination and abbreviation markers) and a regex can’t tell the difference without knowing the word.
The fix is a set of known abbreviations to skip:
const ABBREVIATIONS = new Set([ "mr", "mrs", "ms", "dr", "prof", "st", "ave", "blvd", "rd", "vs", "etc", "approx", "dept", "jan", "feb", "mar" /* ... */,]);
function endsSentence(word: string): boolean { if (!/[.!?]["')\]]*$/.test(word)) return false; const stem = word.replace(/[.!?]["')\]]*$/, "").toLowerCase(); return !ABBREVIATIONS.has(stem);}Abbreviation lists are always finite and language-specific. They also can’t handle context. “May” could be a month, a name, or a modal verb. Domain-specific abbreviations (medical, legal, technical) need extending for specialised content.
“Golden hour hit different.” now breaks correctly at the period. But the timing is still off. The speaker pauses for half a second after “wordless,” and the captions have no idea. Pages appear and disappear based on word count and punctuation alone. During long silences the previous caption sits frozen on screen, then the next page snaps in when speech resumes. The text is correct, but it doesn’t follow the audio.
The timestamp data includes gaps between consecutive words. If the gap exceeds a threshold, that’s a natural break point where the speaker paused.
const gap = words[i].startMs - words[i - 1].endMs;if (gap > maxPauseMs) { pages.push(currentPage); currentPage = [];}The captions now breathe with the speaker. The 500ms gap after “wordless,” triggers a page break, so there’s no more frozen text during silence. The sentence boundary after “different.” naturally produces a long gap too, so it breaks there as well.
@remotion/captions createTikTokStyleCaptions does roughly this: time-gap-based page grouping with no linguistic awareness. It’s a reasonable starting point if you just need something working.
Pause detection still leaves one awkward split: watching the day quietly surrender / to night. It respects timing and the 5-word cap, but it cuts through a prepositional phrase. The more natural subtitle would be watching the day / quietly surrender to night.
Word lists can’t solve this. You need phrase boundary information, meaning some knowledge of which words form a unit that should not be split. In a real system this comes from an NLP parser. Here’s a sketch of what that looks like:
import { parsePhrases } from "mock-nlp-lib";
// parsePhrases returns a set of word indices after which// a phrase boundary exists. For our sentence it identifies:// index 10 → end of "watching the day" (noun phrase)// index 11 → end of "quietly" (adverb modifying the verb phrase)// so we can break after index 10 without splitting a phrase.
async function getPhraseBoundaries(words: TimedWord[]): Promise<Set<number>> { const text = words.map((w) => w.text).join(" "); const phrases = await parsePhrases(text);
// phrases is an array of token spans: [{ start: 0, end: 3 }, ...] // We want the index of the last word in each phrase. const boundaries = new Set<number>(); for (const phrase of phrases) { boundaries.add(phrase.end - 1); } return boundaries;}
// Then use it in the paging loop:const phraseBreakAfter = await getPhraseBoundaries(words);
if ( gap > maxPauseMs || endsSentence(word.text) || currentPage.length >= wordsPerPage || phraseBreakAfter.has(i)) { pages.push(currentPage); currentPage = [];}The key detail is that parsePhrases returns spans based on the parser’s understanding of sentence structure (noun phrases, verb phrases, prepositional phrases) rather than hardcoded indices. In this example it would identify “watching the day” as a noun phrase (span ending at index 10) and “quietly surrender to night” as a verb phrase, so the break falls at the right place.
For this sentence, that produces the split you actually want. The timing rules still apply first; the phrase boundary adds one more valid break point inside the final chunk.
The four techniques above cover a lot of ground for English with clean punctuation, but they’re still only the first layer. Once you leave the demo and start generating real subtitles, two more constraints show up immediately: the page has to be readable on screen, and the break has to hold up across messier language.
Finding a linguistically good break point is not enough on its own. The break also has to make sense with the audio timing and fit within subtitle layout rules.
A subtitle page can contain multiple lines. The demos above show single-line pages for simplicity, but production subtitles usually allow two lines per page with a maximum character count per line (Netflix specifies 42 characters; BBC uses 37). The line break within a page follows the same basic rule as the page break itself: do not split a phrase if you can avoid it. Subtitle Edit, an open-source subtitle authoring tool, applies many of these Netflix/BBC rules in its line-breaking logic and is worth reading even if you don’t use it directly.
Reading speed is part of segmentation, not something you fix afterwards. Real systems enforce maximum lines, maximum characters per line, minimum event duration, and some target reading rate in characters per second or words per minute. Netflix specifies a minimum of 5/6 of a second per subtitle event. The BBC targets 160-180 words per minute, about 0.33 seconds per word. A page that appears and disappears in 200ms is unreadable no matter how grammatically perfect the break is.
This is where the tradeoff becomes real. Pauses happen where the speaker breathes, not necessarily where clauses end. In practice, timing usually defines where a split is even feasible, and the linguistic rules help you choose the least awkward option inside that window.
Word lists and regexes also run out of steam pretty quickly. Text without terminal punctuation, which is common in dialogue and social captions, never triggers sentence breaks. Abbreviation lists miss domain-specific terms. And none of these techniques actually understand grammar.
If you want stronger boundaries, NLP tools can help. A dependency parser such as spaCy can identify clause structure directly, so you’d never split a verb from its object or a preposition from its complement. Named Entity Recognition can tag “St. Pauls” as a location without relying on an abbreviation list.
wtpsplit takes a different approach. It uses small transformer models trained specifically for text segmentation, including subtitle segmentation, across 85 languages. It can handle sentence boundary detection and subtitle-specific line breaking, which makes it useful as a pre-processing step before the simpler heuristics.
Language support makes the problem even less uniform. Chinese and Japanese need word segmentation before you can even talk about phrase boundaries. Arabic and Hebrew bring right-to-left rendering issues. Agglutinative languages like Finnish, Turkish, and Hungarian pack more meaning into single words, so a word-count limit behaves very differently there than it does in English.
Most real systems end up hybrid. They use hard rules for non-negotiables like line length and minimum duration, timing data to decide where a split is plausible, and linguistic rules, parsers, or ML models to rank the plausible break points by naturalness.
That’s the part that makes subtitle segmentation harder than it looks. A simple heuristic gets you something on screen quickly, but the last stretch is about balancing timing, readability, and syntax without letting any one of them dominate the others.
Liked the post?
Get in touch
Have a question? Spotted a mistake? Or just want to say thank you?
Send me an email at hello@ryanwelch.co.uk - seriously! I love hearing from you.