Extracting word-level timestamps from kokoro-js which discards the phoneme duration data the ONNX model produces internally.