CaptionPal integrates a Machine Learning model capable of detecting human speech inside an audio track (with an accuracy of ~ 82%). Final syncronization is highly accurate though, because errors are compensated with the length of the video.
The model has been trained with approximately 3 hours of English audio from two television series. The dataset is properly balanced between audio and non-audio sequences.
Thanks to this model, CaptionPal approximately knows in any video when there is human speech and when there is not. Although it's been trained with English audio, it is likely that this will work in other languages, supposing that human speech shares similar characteristics independently of the tongue. This remains to be verified.
Synchronization is then done by aligning the detected speech with the subtitle. This is performed using a quick brute-force search to find the best combination of subtitle delay and framerate.
Several improvements are planned on the roadmap:
This application has been inspired by a few sources. Credits where it's due:
This application could not be possible without the constant hard work from the people writing the subtitles. CaptionPal fetches TV-series subs from https://www.addic7ed.com. Feel free to consider making a donation to them. If you want to make a donation to CaptionPal, consider making one to a charity instead or star the project on Gitlab. Or if you can code, contributions are welcomed!