How Automatic Video Subtitles and Translation Work in Ray’s FFmpeg Commander Toolbox



How Automatic Subtitles and Translation Work in FFmpeg Commander

FFmpeg Commander subtitle and translation dialog showing Whisper transcription settings

Ever watched a video in a foreign language and wished the subtitles would just appear by themselves? That is exactly what the subtitle and translation feature in FFmpeg Commander does — and it does it entirely on your computer, with no internet subscription, no uploading your footage to a cloud service, and no technical knowledge required.

In this post we will walk you through what actually happens under the hood, from the moment you click Transcribe to the moment your finished video has subtitles burned in and ready to watch.


The Big Picture: Four Steps to Subtitled Video

The whole process happens in four clean stages:

  1. Audio is extracted from your video
  2. A speech recognition model listens to the audio and writes down everything it hears
  3. A translation model converts the text into your chosen language
  4. The subtitles are burned into the video using FFmpeg

Each step feeds directly into the next. You just choose your language, pick your quality level, and press go.


Step 1 — Listening to the Audio: Meet Whisper

The speech recognition engine powering FFmpeg Commander is called Whisper, an open-source AI model built by OpenAI and released to the public. Whisper was trained on hundreds of thousands of hours of real human speech in dozens of languages, which makes it remarkably good at understanding natural conversation — accents, background noise, overlapping voices and all.

Whisper does not just transcribe words. It also timestamps every segment of speech, noting exactly when each phrase starts and ends. Those timestamps are what make your subtitles sync up with the video.

Choosing the Right Model Size

Whisper comes in several sizes, and the size you pick is a trade-off between speed and accuracy:

  • Tiny / Base — Very fast, works on almost any computer. Best for clear, slow speech with minimal background noise. Can miss short phrases and quiet voices.
  • Small — A good everyday balance. Catches most normal conversation without being too slow.
  • Medium — Noticeably better at difficult audio: strong accents, fast speech, brief exchanges at the start of a scene.
  • Large / Large-v3 — The most accurate model available. Handles short utterances, overlapping speakers, and challenging accents far better than smaller models. Slower on CPU but worth it when quality matters.

Tip: If you notice that subtitles are missing for short conversations — especially near the beginning of a video — switching from Small to Large will often solve it. Larger models are much better at detecting brief bursts of speech.

Balanced vs Accurate Mode

FFmpeg Commander also lets you choose between Balanced and Accurate processing modes.

  • Balanced uses a compressed version of the model (called int8 quantization) that runs faster but makes slightly more mistakes, particularly with quiet or short speech.
  • Accurate runs the model at full precision, giving you the best possible transcript. It is slower on CPU but is the right choice when you need every word to be captured correctly.

CPU vs GPU — Why Hardware Matters for Speed

Transcribing speech with an AI model is computationally demanding. The device you choose inside FFmpeg Commander makes a dramatic difference in how long you wait.

CPU Mode

CPU mode works on every computer with no additional setup. It is the universal fallback. The downside is speed — transcribing a 10-minute video with the Large model on a modern CPU can take anywhere from 5 to 20 minutes depending on your processor. Perfectly usable, just not fast.

NVIDIA GPU — CUDA Acceleration

If your Windows machine has an NVIDIA graphics card, FFmpeg Commander can install a CUDA-accelerated version of the transcription engine with a single click using the built-in GPU Installer. CUDA offloads the heavy matrix calculations from your CPU onto your GPU, which is purpose-built for exactly this kind of parallel number crunching.

The real-world difference is significant. A transcription job that takes 15 minutes on CPU can complete in 1 to 2 minutes on a mid-range NVIDIA GPU. High-end cards like the RTX 3080 or 4090 can process audio faster than real time — meaning a 10-minute video is done in under a minute.

FFmpeg Commander handles the entire CUDA setup for you. It downloads and installs the correct version of PyTorch and the faster-whisper engine into a self-contained environment so nothing interferes with the rest of your system. You just click Install GPU Addon, wait a few minutes, and from that point on every transcription runs at full GPU speed.

Recommended for: Anyone who transcribes frequently, works with long videos, or simply does not want to wait. Any NVIDIA GTX 1000 series or newer will see a major improvement.

Apple VideoToolbox — Mac Hardware Acceleration

On macOS, FFmpeg Commander uses Apple VideoToolbox and the Metal GPU framework, which are built directly into every Mac. There is nothing to install — Apple Silicon Macs (M1, M2, M3, M4) in particular are exceptionally fast at this kind of workload because their Neural Engine is designed for exactly this type of AI inference.

An M2 MacBook Pro running the Large model can transcribe a 10-minute video in roughly 2 to 3 minutes — competitive with a dedicated NVIDIA GPU, while consuming a fraction of the power. VideoToolbox is enabled automatically on Mac with no configuration needed.

Speed Comparison at a Glance

Device Typical speed (Large model, 10 min video) Setup required
CPU (modern desktop) 5 – 20 minutes None — works out of the box
NVIDIA GPU (CUDA) 1 – 3 minutes One-click GPU Addon install
Apple Silicon (VideoToolbox) 2 – 4 minutes None — automatic on Mac

Step 2 — What Is an SRT File?

Once Whisper has finished listening, the result is saved as an SRT file — which stands for SubRip Text. It is one of the most widely supported subtitle formats in the world, and it is surprisingly simple. Here is what one looks like:

1
00:00:03,200 --> 00:00:05,800
Hello, welcome to our show.

2
00:00:06,500 --> 00:00:09,100
Today we are talking about something exciting.

Each entry has three parts:

  1. A sequence number
  2. A timecode showing when the subtitle appears and disappears (hours:minutes:seconds,milliseconds)
  3. The subtitle text itself

That is it. Plain text. Any video player, editor, or streaming platform can read it. SRT files are also what services like YouTube and Vimeo accept when you upload your own captions.

FFmpeg Commander saves this file alongside your video automatically so you always have a standalone subtitle file you can edit, share, or reuse.


Step 3 — Translation: Meet ArgosTranslate

If you want subtitles in a language other than the one spoken in the video, FFmpeg Commander passes the SRT text through a second AI model called ArgosTranslate.

ArgosTranslate is an open-source translation engine that runs entirely on your machine — no API keys, no sending your script to a third-party server. It works by downloading a language pack for the specific pair of languages you need. For example, if you want English speech translated into Polish subtitles, it downloads the English-to-Polish language pack (typically around 100–500 MB depending on the language).

Language packs are downloaded once and reused forever. The next time you translate to the same language, it happens instantly with no additional download. If you click Transcribe without downloading the pack first, FFmpeg Commander detects the missing pack and downloads it automatically before proceeding — nothing breaks, it just takes a moment longer on the first run.

How the Timing Stays Accurate After Translation

One challenge with translated subtitles is that the translated text is often a different length than the original. A sentence that takes two seconds to say in English might translate to a much longer phrase in German.

FFmpeg Commander handles this automatically with a smart timing system:

  • Word-level timestamps — The speech recognition model notes exactly when each individual word was spoken. Subtitles are trimmed to disappear when the last word finishes, not when a long silence ends.
  • Reading speed floor — Even if speech was very fast, the subtitle stays on screen long enough for a normal reader to finish reading the translated text.
  • Gap enforcement — A small breathing gap is always preserved between consecutive subtitles so the screen never feels cluttered.
  • Hard maximum — No subtitle lingers on screen for more than 6 seconds, even if the original segment was unusually long.

Step 4 — Burning Subtitles Into the Video

The final step is handled by FFmpeg, the same powerful open-source video engine that drives the rest of FFmpeg Commander. It reads the SRT file and burns the subtitle text directly into the video frames — this is called hardcoding or burning in subtitles.

Hardcoded subtitles are permanently part of the video. They show up on any device, any player, any platform — no separate file needed, no settings to configure. Perfect for sharing on social media, messaging apps, or anywhere you cannot guarantee the viewer has a subtitle-capable player.

You can customise the look of your subtitles before burning — font, size, colour, outline, and position on screen are all adjustable inside FFmpeg Commander.


Everything Runs on Your Computer

It is worth emphasising: none of this requires an internet connection after the initial model download. Your video footage never leaves your machine. There are no usage limits and no privacy concerns about uploading sensitive recordings to a cloud service.

The AI models run locally using your CPU, NVIDIA GPU, or Apple Silicon. The first time you use a new language or model size, FFmpeg Commander downloads it automatically in the background. After that, everything is instant.


Quick Reference: The Full Workflow

Stage What happens Technology
1. Audio extraction Audio is pulled from the video file FFmpeg
2. Speech recognition Audio is transcribed to timestamped text Whisper AI
3. SRT generation Timestamped text is saved as a subtitle file FFmpeg Commander
4. Translation (optional) Subtitle text is translated to your chosen language ArgosTranslate
5. Subtitle burn-in Subtitles are permanently embedded into the video FFmpeg

Supported Translation Languages

FFmpeg Commander currently supports subtitle translation into the following languages:

  • English
  • Polish
  • Spanish
  • French
  • German
  • Portuguese
  • Chinese (Simplified)
  • Japanese
  • Korean
  • Italian
  • Dutch

Additional languages are planned for future updates.


Getting Started

To use the subtitle and translation feature in FFmpeg Commander:

  1. Open your video in FFmpeg Commander
  2. Go to the Subtitles / Transcription section
  3. Choose your model size (Large is recommended for best results)
  4. Select CUDA or let the app auto-detect your GPU for faster processing
  5. Select your target language if you want translation
  6. Click Transcribe — the app handles everything else automatically

The first run will download any required models and language packs. Every run after that uses the locally cached versions and completes much faster.


This entry was posted in video editing. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.