Apr 19 · in engineering
by Ryan Welch · 4 Min Read
I have been looking for a local TTS workflow that gives me both better speech quality and usable alignment data, mainly as an alternative to ElevenLabs’ timestamped TTS API. My previous attempt used Kokoro, which is lightweight, very small, and can expose timing data during synthesis. That part worked, but I was not that happy with the output quality. It is great for the model size, but the intonation sometimes sounded off, and sometimes it produced random garbled noises.
So I wanted to try a stronger model. Qwen3-TTS was released in January 2026, and the launch post and official GitHub repository show why it got attention so quickly: better speech quality, custom voice support, and useful prompting controls for voice style. The catch is that it does not expose timing data during synthesis. If you want word-level timestamps, the recommended approach is to synthesize first and then run Qwen3-ForcedAligner against the finished audio. The aligner is a separate model, which is the point: it measures the actual waveform after synthesis rather than relying on predicted durations.
Qwen3-TTS sounds good out of the box, and the larger models are especially flexible because you can prompt for a more specific speaking style. The Qwen3 launch post has some excellent examples and prompting ideas. This demo is a basic sample generated with the smaller 0.6B model, paired with word timings from Qwen3-ForcedAligner.
As mentioned above, with Qwen3 you need to run the forced aligner after synthesis to extract timing data.
The full flow looks like this:
[start, end] spansFor the examples, I’m using uv because it makes small scripts like this easy to manage with inline dependencies. The examples also focus on running Qwen3 on Apple Silicon macOS.
The easiest setup I found was a small Python script that uses MLX-backed models from mlx-audio and runs through uv. If you want the reference implementation for the standard PyTorch path, the official Qwen3-TTS repository has that. The dependency block lives inside the script, so there is no separate environment setup to document.
# /// script# requires-python = ">=3.10"# dependencies = [# "mlx-audio",# "soundfile",# "numpy",# ]# ///
from pathlib import Pathimport json
import numpy as npimport soundfile as sffrom mlx_audio.tts.utils import load_model as load_tts_modelfrom mlx_audio.stt import load as load_aligner
text = "The practical way to get timestamps from Qwen is to align the finished audio."language = "English"voice = "Chelsie"wav_out = Path("/tmp/qwen-output.wav")json_out = Path("/tmp/qwen-alignment.json")
tts_model = load_tts_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")results = list(tts_model.generate(text=text, voice=voice, language=language))
sample_rate = results[0].sampling_rateaudio = np.array(results[0].audio)sf.write(str(wav_out), audio, sample_rate)
aligner = load_aligner("mlx-community/Qwen3-ForcedAligner-0.6B-8bit")result = aligner.generate(audio=str(wav_out), text=text, language=language)
words = [ { "text": item.text, "start": float(item.start_time), "end": float(item.end_time), } for item in result]
json_out.write_text(json.dumps(words))For non-Apple-Silicon machines, the same idea works with a PyTorch-based script instead. The important part is keeping synthesis and alignment as two explicit steps, regardless of backend.
Note
The aligner needs the exact transcript used for synthesis. If you normalize punctuation, expand numbers, or otherwise change the text between the TTS step and the alignment step, the word spans can drift or fail.
Here I’m using the 0.6B model because it is a good balance of quality and size. You can also use the larger model, but it increases the hardware requirements.
If you want better quality from a local model, Qwen3 is a good choice. The speech quality is better than what I was getting from Kokoro, and the forced aligner gives you timestamps based on the finished waveform rather than synthesis-time duration predictions.
However, it really wants hardware acceleration to feel usable. It is slower than kokoro-js, the first run needs to download the model weights, and there is an extra inference pass because alignment is a separate step.
That makes Qwen3 a good fit when output quality matters more than startup cost or raw speed: narration, generated clips, lessons, dubbing, or anything else where you care about how the voice sounds and want more trustworthy word timing.
I would still use kokoro-js when I want something quick and simple on the edge, especially on a phone or in the browser. It is a much lighter fit for that environment. Even if the voice quality is worse, it still produces good synthesis in most cases.
Liked the post?
Get in touch
Have a question? Spotted a mistake? Or just want to say thank you?
Send me an email at hello@ryanwelch.co.uk - seriously! I love hearing from you.