Kokoro for TTS and word timestamps

Kokoro-82M is a compact TTS model that runs entirely in-process via Python or JavaScript. It is fast and produces good audio for its size. I needed word-level timestamps for caption animations and did not want to add a forced alignment step on top of an already local model.

In Python, Kokoro exposes timing-related data directly. In JavaScript, the public ONNX streaming API gives you the text for each chunk and the audio for each chunk, but not the model’s native alignment output. So there are two practical paths: use Python when you want the model’s timing data directly, or in JavaScript measure each streamed chunk and distribute that duration across its original characters.

Demo output

Results first! Here are a few short sample clips with matching timings from the Python and JavaScript approaches below. The JavaScript demo uses the chunk-duration approximation explained later, so you can compare it directly with the Python version, which uses Kokoro’s native alignment output.

kokoro python sample

kokoro js approximation

In my testing, the female voices were generally stronger than the male ones. The sample below uses a male voice that sounds a bit rougher. The Kokoro team also calls out voice quality differences in their VOICES.md, where they rank voices by training quality. Overall, Kokoro’s voice selection is still fairly limited.

kokoro python male sample

Note

For sentence- or phrase-level highlighting, the difference is still hard to notice in practice, even though Python has the more accurate word timings. If the two clips differ slightly in intonation, that is from the model itself. Kokoro can vary a bit from run to run.

Python: use the timing data Kokoro already exposes

The simplest path is Python, because Kokoro exposes token objects with timing fields during synthesis. The example below turns those token spans into word-level timestamps and writes both the audio and a small JSON transcript.

The core of the script looks like this:

from pathlib import Path
import json

import numpy as np
import soundfile as sf
from kokoro import KPipeline

TEXT = (
    "All around the light is golden and liquid and heavy, like it's just "
    "beginning on its second glass of wine."
)
VOICE = "af_heart"
SAMPLE_RATE = 24_000

def append_words(words, tokens, chunk_offset_seconds):
    current_text = []
    current_start = None
    current_end = None
    last_end = chunk_offset_seconds

    for token in tokens or []:
        text = getattr(token, "text", "")
        if not text:
            continue

        start_ts = getattr(token, "start_ts", None)
        end_ts = getattr(token, "end_ts", None)

        if current_start is None and start_ts is not None:
            current_start = chunk_offset_seconds + float(start_ts)
        if end_ts is not None:
            current_end = chunk_offset_seconds + float(end_ts)

        current_text.append(text)

        if getattr(token, "whitespace", ""):
            word_text = "".join(current_text).strip()
            if word_text:
                start = current_start if current_start is not None else last_end
                end = current_end if current_end is not None else start
                words.append(
                    {
                        "text": word_text,
                        "startMs": round(start * 1000),
                        "endMs": round(end * 1000),
                    }
                )
                last_end = end

            current_text = []
            current_start = None
            current_end = None

pipeline = KPipeline(lang_code="a", device="cpu")
results = list(pipeline(TEXT, voice=VOICE, speed=1, split_pattern=r"\n+"))

The important part is result.tokens. Each token can include text, start_ts, end_ts, and whitespace information, which is enough to rebuild word spans from the model output itself. If you are not running in a browser, this is a fast way to generate speech with alignment data. Still, Kokoro is especially well suited to the browser and edge devices, where Python may not be an option.

JavaScript: chunk timings instead of native alignment

On the JavaScript side, kokoro-js is still appealing because it is lightweight and runs comfortably on CPU. That is a big part of what makes this model useful.

The limitation is that the public ONNX and kokoro-js path do not expose the native model alignment output. In this case that is because the ONNX model itself does not expose the required outputs, not just because of the JS library.

We can still approximate the alignment data. Each item yielded by tts.stream(...) gives you chunk.text, chunk.audio.audio, and chunk.audio.sampling_rate. That is enough to derive the chunk duration from chunk.audio.audio.length / sampleRate and the chunk start offset from totalSamplesProducedSoFar / sampleRate. From there, you can distribute that chunk’s duration across its original characters, then merge those character spans back into words.

Here is an example in JavaScript:

import { KokoroTTS, TextSplitterStream } from "kokoro-js";

const SAMPLE_TEXT =
    "All around the light is golden and liquid and heavy, like it's just beginning on its second glass of wine.";
const VOICE = "af_heart";

function buildWordAlignmentData(
    text,
    chunkDurationSeconds,
    chunkOffsetSeconds,
) {
    const chars = Array.from(text);
    const charDuration =
        chars.length > 0 ? chunkDurationSeconds / chars.length : 0;
    const words = [];
    let currentText = "";
    let currentStartMs = null;
    let currentEndMs = null;

    chars.forEach((char, index) => {
        const startSeconds = chunkOffsetSeconds + index * charDuration;
        const endSeconds = startSeconds + charDuration;
        const startMs = Math.round(startSeconds * 1000);
        const endMs = Math.round(endSeconds * 1000);

        if (/\s/.test(char)) {
            if (currentText) {
                words.push({
                    text: currentText,
                    startMs: currentStartMs,
                    endMs: currentEndMs,
                });
                currentText = "";
                currentStartMs = null;
                currentEndMs = null;
            }
            return;
        }

        if (currentStartMs === null) {
            currentStartMs = startMs;
        }

        currentText += char;
        currentEndMs = endMs;
    });

    if (currentText) {
        words.push({
            text: currentText,
            startMs: currentStartMs,
            endMs: currentEndMs,
        });
    }

    return words;
}

const tts = await KokoroTTS.from_pretrained(
    "onnx-community/Kokoro-82M-v1.0-ONNX",
    {
        dtype: "q8",
        device: "cpu",
    },
);

const splitter = new TextSplitterStream();
const stream = tts.stream(splitter, { voice: VOICE });

splitter.push(SAMPLE_TEXT);
splitter.close();

let totalSamples = 0;
const chunks = [];

for await (const chunk of stream) {
    const sampleRate = chunk.audio.sampling_rate;
    const chunkSamples = chunk.audio.audio.length;
    const chunkDurationSeconds = chunkSamples / sampleRate;
    const chunkOffsetSeconds = totalSamples / sampleRate;

    chunks.push({
        text: chunk.text,
        startMs: Math.round(chunkOffsetSeconds * 1000),
        endMs: Math.round((chunkOffsetSeconds + chunkDurationSeconds) * 1000),
        words: buildWordAlignmentData(
            chunk.text,
            chunkDurationSeconds,
            chunkOffsetSeconds,
        ),
    });

    totalSamples += chunkSamples;
}

console.log(
    JSON.stringify(
        {
            text: SAMPLE_TEXT,
            words: chunks.flatMap((chunk) => chunk.words),
        },
        null,
        2,
    ),
);

The timing data comes from the chunk audio length, and the text comes from chunk.text. There is no hidden alignment API involved.

Note

If you later need higher-quality timings in JS, you will need a different path: either a model export that exposes duration output to JavaScript, or a real forced aligner after synthesis.

Why this is usable

This fallback works better than it sounds because kokoro-js streams sentence by sentence by default. You are usually distributing time across one sentence, not an entire paragraph.

That keeps the error bounded. It can still be wrong on long sentences, especially when punctuation adds pauses or one word is spoken much more slowly than another. But for short sentence chunks, it is usually close enough for rough transcript highlighting or captions.

Accuracy

Kokoro is still a useful local TTS option, especially when you care about size and portability. If you can stay in Python, you can get timing data directly from synthesis without needing a forced aligner. If you need JavaScript for the browser or the edge, the current public ONNX path does not expose that alignment output, so chunk-duration approximation is the best alternative.

When you need more trustworthy alignment, a forced aligner is the next step. Qwen3 is a higher-quality TTS with more expressive output and provides alignment data through a forced aligner. The next article covers that Qwen3 workflow.