Microphone with illuminated sound waves, symbolising voice recording and transcription using OpenAI Whisper on Raspberry Pi.

OpenAI Whisper on the Raspberry Pi 4 – Live transcription

Estimated reading time: 8 minutes

At MaibornWolff, we don't just develop software for smart devices, we also work with new technologies. I wanted to know whether live transcription with OpenAI Whisper would also work on a small single-board computer, so I tested it out. In this article, I'll show you what the language model can do so far – and where its limits lie.

What exactly is OpenAI Whisper?

What exactly is OpenAI Whisper? Whisper is an AI-based, automatic speech recognition system from OpenAI that has been trained using an extensive audio data set from the internet. Whisper is not limited to a specific application or language, but can be used universally. Whether it's converting spoken words into written text, translating between different languages, or identifying speakers by their voice, Whisper excels at multitasking.

Unlike traditional models, which are limited to simple linear sequences of words, Whisper has an impressive ability: it can recognise complex dependencies between words and sentences. How does it do this? It's simple: Whisper is based on transformers, an innovative architecture for neural networks. These transformers enable the model to understand complex relationships and the context between sentences.

Whisper has a wide range of applications. Wherever human language plays a role, there is potential for Whisper to be used – whether in voice-based assistance systems or in embedded devices that can be controlled naturally through speech. Another major advantage of Whisper is that the AI technology is capable of understanding and responding to multiple languages. This makes it possible to use Whisper in a multilingual environment, which is particularly important if an application is to be used worldwide.

Motivation for live transcription on the Raspberry Pi 4

Whisper was developed specifically for transcribing audio files. On a powerful computer, such as a modern laptop, such a file can be converted into text relatively quickly. On a less powerful computer, such as the Raspberry Pi, this process may take longer, but the result remains the same: an accurate transcription.

However, when we talk about live transcription, we mean a system that responds to voice input in real time and delivers results quickly. The time factor therefore plays a much greater role here than when transcribing a simple audio file.

Instead of performing live transcription on a powerful computer, I deliberately decided to test the whole thing on a small single-board computer with limited computing power. I found the question of whether this could be done exciting and it motivated me to start the experiment with the Raspberry Pi 4. From our team's point of view, a successful implementation could lay the foundation for numerous applications of Whisper on embedded devices. The appeal for me was to explore the limits of the technology on a less powerful but still versatile device. At this point, you might ask why I didn't go for the Raspberry Pi 5 right away. The answer is simple: it wasn't available at the time.

The hardware I used

Raspberry Pi 4 Model B, 8GB RAM
Official Raspberry Pi power supply
USB microphone

Raspberry Pi, a small circuit board for implementing OpenAI Whisper.

The hardware I used

Raspberry Pi 4 Model B, 8GB RAM Official Raspberry Pi power supply USB microphone

The path to live transcription and how it works

So, how did I go about it? First, I did a lot of research and looked at what was already available in the open-source community. I found numerous projects and even complete re-implementations related to Whisper, some of which even had approaches to live transcription. I then tried these out using trial and error and tested them. A lot of them either didn't work at all or were very slow. Eventually, I came across the ‘Whisper-Ctranslate2’ project, which – at first glance – already included a promising live transcription feature.

Whisper-Ctranslate2 is a project that integrates the ‘Faster-Whisper’ project into a command line programme. Faster-Whisper, in turn, is a re-implementation of OpenAI's Whisper and is up to four times faster with the same accuracy and even requires less memory. Efficiency can be further increased through 8-bit quantisation, both on the CPU (central processing unit) and on the GPU (graphics processing unit). Therefore, it seemed sensible to me to continue with Faster-Whisper at this point.

However, live transcription still did not work satisfactorily in Whisper-Ctranslate2. Based on tests and observation of the behaviour, I suspected that this was not due to the transcription speed. So I took a closer look at the code and thought about how I could get it to work.

The code for live transcription works as follows: The microphone is continuously read in blocks of 30 ms. The 30 ms data sets are then periodically passed to a function. This function detects whether someone is speaking or not based on a block. To do this, the root mean square (RMS) is calculated from the block and the frequency is evaluated. The RMS value can be thought of as the average volume, which must be above a threshold value defined by the programme. The frequency must be within a defined range that roughly corresponds to the frequency band of a voice. In the code, it looks like this:

Python code snippet describing a function of OpenAI Whisper on Raspberry Pi.

If both conditions are met, a voice is recognised and the programme starts recording. The programme now saves the recording in blocks in a temporary memory. However, it checks after each block whether the person is still speaking. If no voice is recognised for a number of consecutive blocks specified by the programme, the recording is stopped.

There must be several blocks so that short pauses or breaths are not interpreted as the end of the recording. A completed recording is now transferred to another thread via a queue and transcribed in parallel. Meanwhile, the first thread continues to evaluate blocks and is ready to start a new recording at any time. The following figure illustrates this process (greatly simplified):

The diagram shows parallel recording and transcription with OpenAI Whisper on Raspberry Pi.

3 obstacles – 3 solutions

Once I understood the principle behind the code, I was able to identify three problems:

The responsiveness of the live transcription was slow
When sentences are recorded faster than the Raspberry Pi 4 can transcribe them, they are not output in the correct order
The first word is often swallowed during live transcription

The first problem was just a configuration issue. The programme waited too long after speaking for the blocks in which no speech was detected. I reduced this time to just under 1 second, which made the live transcription feel more responsive.

The second problem was a data structure issue. The recordings were simply inserted into a buffer and then removed again (stack principle).

This worked as long as there was only one recording in the buffer. However, if there was already a recording in the buffer and another was added, the new recording was transcribed first, not the one that came in first. The solution to this was to use a queue that works according to the FIFO principle. The queue is filled from the front and the recordings are removed from the back.

The solution to the third problem presented me with the biggest challenge compared to the other two. I suspected that the speech recognition was somewhat inaccurate and that the voice was only recognised in the middle of the first word. The Whisper model then only receives a half-spoken word and, because it cannot make sense of the rest, it simply cuts off the word. That was my assumption.

I then asked myself how I could solve this problem better and came up with the idea of pre-storing a certain number of blocks even without active speech recognition. During recording, these blocks are written to a fixed-size queue, i.e. only a certain number of blocks can be in the queue. Specifically, I used a double-ended queue (Deque), a doubly linked queue. This allows elements to be added and removed. In principle, you can think of the Deque as a tube: blocks are pushed into the left side of the Deque and when it is full, the oldest block falls out on the right and is deleted.

When a voice is recognised, these pre-stored blocks are simply appended to the beginning of the speech sequence. After a bit of trial and error, I found that 15 blocks or 450 ms achieved the desired effect. So my assumption was correct, and my third problem was solved: the words were finally recognised in their entirety and were no longer swallowed during live transcription.

In the code, the whole thing now looks like this. The representation here is only a small excerpt from the Python code to show my change with the ‘PreBufferedBlockSize’.

Detailed Python code for implementing speech processing with OpenAI Whisper on Raspberry Pi.

Which models are suitable for the Raspberry Pi 4?

The runtime of Whisper, i.e. the actual transcription, is strongly influenced by the choice of model. There are five model sizes in total, offering a compromise between speed and accuracy. This means that the smallest model works the fastest but is less accurate. The lack of accuracy can then manifest itself, for example, in missing or incorrect words.

A brief runtime analysis showed me which models are suitable for the Raspberry Pi 4. Since a live scenario can never be completely reproduced, I used a randomly selected English-language podcast (length 100 seconds) for my analysis. The transcription was measured three times per model using the command line programme ‘time’ and averaged to compensate for inaccuracies. The transcription was performed on the CPU and the Raspberry Pi 4 was run in headless mode without a GUI to avoid possible interference.

Results of the runtime analysis:

On the Raspberry Pi 4, the runtime exceeds real time starting with the Small model. Therefore, only the Tiny and Base models are suitable for live transcription in this case. The Tiny model offers faster live transcription at the expense of accuracy. In contrast, the Base model is more accurate, but live transcription is already noticeably slower.

Module	Runtime	Relation to real time
Tiny	35s	2,85x
Base	53s	1,88x
Small	2m 16s	0,73x
Medium	5m 52s	0,28x
Large-v2	10m 51s	0,15x

Conclusion and outlook

For home users, the runtime and accuracy of the Tiny or Base models may be satisfactory. The larger models offer higher accuracy, but the Raspberry Pi 4 does not have enough power to transcribe quickly enough for a live application. In terms of potential commercial end-user solutions, live transcription on a Raspberry Pi 4 is therefore unlikely to be an option.

However, in my opinion, there are a number of approaches that could be taken at this point:

On the software side, transcription with Faster Whisper is already more efficient than with OpenAI Whisper and will certainly be further developed in the future. Therefore, it would make sense to simply use a Raspberry Pi 5, for example, which is about 2–3 times more powerful.
Another approach would be to test GPU-based transcription. However, GPU-based transcription is limited to CUDA-enabled GPUs from Nvidia. Nvidia's Jetson Nano could be a good option here and should significantly reduce transcription time.
If live transcription is to be used specifically for voice control in an embedded device, it would make sense to use a wake word engine to start voice recording with a keyword/phrase instead of RMS and frequency. This should significantly improve stability.

In my next experiment, I will investigate one of these approaches in more detail.