Apr 15
Getting word timestamps out of kokoro-js
by Ryan Welch · 5 Min Read
Kokoro-82M ↗ is a compact TTS model that runs entirely in-process via kokoro-js ↗ and the ONNX runtime. It’s fast and produces decent quality audio, but the public API returns audio and nothing else.
I needed word-level timestamps for caption animations and didn’t want to add a forced alignment step on top of an already local model. The timing data already exists inside the model, but kokoro-js throws it away, so the practical option is to intercept it before the library discards it.
Note
This approach depends on private internals rather than a public, stable API.
Kokoro is a neural TTS model. Like most sequence models, it works in phoneme space: text gets converted to phonemes first (the smallest units of sound; “cat” is /k ae t/), then the model synthesizes audio from those phonemes.
During synthesis, the model outputs a tensor called pred_dur alongside the audio waveform. Each value is the predicted duration of one phoneme token, in half-frame units. At 24kHz, a 600-sample frame is 25ms, so a half-frame is 12.5ms. That means you can convert each pred_dur value to seconds with value / 80. The model also prepends a BOS (begin-of-sequence) token to the input, so pred_dur[0] is overhead: skip it, and pred_dur[1..N] maps to the phoneme characters.
kokoro-js receives this from the ONNX runtime but only destructures waveform from the result. pred_dur gets discarded.
kokoro-js synthesizes through a method called model._call. Temporarily replacing it with a wrapper lets us capture pred_dur before the library discards it:
const capture = { predDur: null as number[] | null };
const originalCall = (tts as any).model._call.bind((tts as any).model);(tts as any).model._call = async (modelInputs: Record<string, unknown>) => { const result = await originalCall(modelInputs); if (result.pred_dur) { capture.predDur = Array.from( result.pred_dur.data as Float32Array | Int32Array, ); } return result;};
try { // Audio comes from the stream chunk — this call is purely to trigger _call and capture pred_dur await (tts as any).generate_from_ids(input_ids, { voice: config.voice, speed: config.speed, });} finally { (tts as any).model._call = originalCall;}kokoro-js streams sentence by sentence, so this runs once per sentence chunk. The audio for each chunk comes from the stream itself; generate_from_ids is a second forward pass purely to trigger model._call and capture pred_dur. That adds some latency per sentence. The finally block ensures the original method is always restored.
If you’d rather not rely on the override at all, the change needed in the library itself is small: generate_from_ids already has pred_dur in scope, it just needs to be returned or exposed on the instance. Forking and patching that one spot is straightforward if you want something more stable.
Note
If a future release renames or restructures these internals, the capture silently stops working, so pin your kokoro-js version if you use this.
I plan to open a PR for exposing pred_dur. If it gets merged, I’ll update this post to use the public API instead of the override.
With pred_dur in hand, accumulating the values gives a start and end time for each phoneme character in the phoneme string. The string uses spaces to separate words, so "Hello world" might phonemize to "həloʊ wɜːld". Split on spaces and you have per-word phoneme groups. Sum the durations within each group and you have the word’s time range.
The complication is that phonemization doesn’t always produce the same word count as the original text. Abbreviations and numbers expand: "Dr. Smith" becomes "dɒktə smɪθ" (still two groups here, but "Dr." expanded its phonemes), and "on 3rd Street" might phonemize to more words than three. For the article, it’s easier to show the word-level step directly and leave character expansion for later. This version aligns words by position. If a source word has no matching phoneme group, it falls back to a zero-duration range at the current end time:
type WordAlignment = { word: string; start: number; end: number;};
function buildWordAlignmentData( originalText: string, phonemeString: string, predDur: number[], chunkOffsetSeconds: number,): WordAlignment[] { const phonemeChars = Array.from(phonemeString); const durOffset = 1; // skip BOS token
// First pass: turn each phoneme character into a [start, end] range. let currentTime = chunkOffsetSeconds; const phonemeStartTimes: number[] = []; const phonemeEndTimes: number[] = [];
for (let i = 0; i < phonemeChars.length; i++) { const durIdx = i + durOffset; const durValue = durIdx < predDur.length ? predDur[durIdx]! : 0; const durationSeconds = durValue / 80; // half-frames to seconds phonemeStartTimes.push(currentTime); phonemeEndTimes.push(currentTime + durationSeconds); currentTime += durationSeconds; }
const phonemeWords = phonemeString.split(" "); const originalWords = originalText.trim().split(/\s+/);
// Second pass: map each original word to the corresponding phoneme group. const words: WordAlignment[] = []; let phonemeCharIdx = 0;
for (let wordIdx = 0; wordIdx < originalWords.length; wordIdx++) { const originalWord = originalWords[wordIdx]!; const phonemeWordChars = Array.from(phonemeWords[wordIdx] ?? "");
let wordStart: number; let wordEnd: number;
if ( phonemeWordChars.length === 0 || phonemeCharIdx >= phonemeChars.length ) { // No phoneme mapping for this word, so keep it at the current end time. wordStart = currentTime; wordEnd = currentTime; } else { wordStart = phonemeStartTimes[phonemeCharIdx] ?? currentTime; const lastIdx = phonemeCharIdx + phonemeWordChars.length - 1; wordEnd = phonemeEndTimes[ Math.min(lastIdx, phonemeEndTimes.length - 1) ] ?? currentTime; phonemeCharIdx += phonemeWordChars.length; }
// Skip the space between phoneme words if ( phonemeCharIdx < phonemeChars.length && phonemeChars[phonemeCharIdx] === " " ) { phonemeCharIdx++; }
words.push({ word: originalWord, start: wordStart, end: wordEnd }); }
return words;}If you need character-level output after this, you can expand each word’s [start, end] range across its original characters. That is still an approximation, because some sounds take longer than others, but for highlighting whole words in sync with playback it is usually accurate enough.
buildWordAlignmentData runs once per sentence chunk. The chunkOffsetSeconds argument is totalSamplesProducedSoFar / sampleRate, which shifts all timestamps for that chunk to the correct position in the final audio. Chunks are then merged by concatenating the word arrays.
pred_dur is a prediction made during synthesis, not a measurement of the actual audio. Timing errors are small but accumulate over longer texts. For word-level caption highlighting it holds up well. For precise subtitle alignment on long-form content, running a dedicated forced aligner against the finished audio will be more accurate. Qwen3-ForcedAligner ↗ is worth looking at if that’s your use case.
Note
The next post will cover using Qwen for TTS and getting alignment from the forced aligner.
Last edited Apr 17