Our AI Transcription Service for Telegram eliminates the need for manual note-taking and tedious playback. By combining OpenAI’s industry-leading Whisper ASR model with Groq’s revolutionary LPU hardware acceleration, we process one minute of audio in under 3 seconds. Whether it’s a compressed OGG file from a chat or a long meeting recording, our bot identifies 97 languages, handles heavy accents, and provides perfect punctuation. With a strict 5-minute data deletion policy and seamless export to Google Docs, we provide a secure, high-speed solution for professionals who need to act on information as fast as it’s spoken.
Transcription is the act of taking an audio file and changing it into text format. Transcription can be a simple job, but in the background of transcribing there are lots of pieces of technology. When sound is recorded, a microphone records the sound, and then algorithms break the sound down into smaller segments and identify phonemes. These phonemes are combined into words by the machine, which also confirms how these words sound when used in a specific context. The end result is a searchable, editable, and analyzable text document.
All previous audio transcription documents were transcribed manually, with every hour of transcribed audio documented step-by-step. Today, neural networks learn to recognise patterns in audio recordings by analysing millions of hours of audio without the hassle of human interaction. It is similar to comparing manufacturing watches by hand, with all of the labour and delays involved, with manufacturing watches through a modern assembly line, where everything happens quickly and simply.
There are three essential components of the process of recognising audio:
OpenAI's Whisper is an incredible Machine Learning tool for transcribing audio notes that has been trained on a very large dataset of nearly 680,000 hours of recorded audio files. It not only captures the spoken word but also adds punctuation and identifies the context of the spoken word. The LPU (Language Processing Unit) chip from Groq processes information simultaneously, rather than in order as a traditional processor does. According to OpenAI's 2023 statistics, Whisper can transcribe 96% of English-language audio accurately, with comparable performance levels (94% accuracy) for most major languages when clear (nice and clean) audio recordings are made. When background or other distracting noises occur, Whisper typically does lose an additional 5% - 8% of its ability to accurately transcribe, but its output remains significantly higher than Google's or Azure's transcription capabilities.
Voice Messages in Telegram are recorded as an OGG Opus file, which is an audio compression format with a low bitrate range of 16 - 32 kbps. The audio file may sound acceptable when played back through a speaker, but recognition systems experience difficulties when trying to transcribe this type of audio; therefore the accuracy of transcription may be affected by distortions creating a potential drop of up to 15% from the original file.
Because the Whisper model is designed to accurately transcribe audio recordings that have been distorted and compressed, it handles these types of files far better than most other systems on the market. In addition, Groq has built additional features into its processing, including volume normalization, noise suppression and amplification of quiet areas, improving accuracy when processing these files to only a decrease of 2% to 3% when compared to processing high-quality WAV files.
Telegram controls file handling using the Telegram API. A bot will download the OGG audio file using a unique identifier. If necessary the audio will then be converted to WAV, sent to Groq for transcription and returned back to the user via the chat function. The entire process typically takes between 8 - 12 seconds for a one-minute file, and does not require a user to register.
Transcription has become very popular for use in many different industries.
When it comes to business, transcribing meetings offers huge benefits by freeing up the time and resources needed to conduct and record meeting notes. As an example, if a manager spends 6 hours each week in meetings, and one-third of the content discussed is lost, the productivity level of that person may be reduced by 66%. With transcription, this time and effort is eliminated, and discussions can be documented and analyzed at a later time. In addition, task creation can be automated in a customer relationship management system (CRM). Through studies conducted by McKinsey, we see that the use of such tools can result in an increase of anywhere from 12% - 18% in revenue.
When used in education, transcriptions allow students to access recorded lectures and seminar discussions. Based on the Ebbinghaus Forgetting Curve, it can take up to 90% of a student's knowledge and memories and make it difficult for them to recall that knowledge or experience. Utilizing these types of resources makes it possible for students to review individual topics, as well as study material covered during a given time period, which can result in a doubling of learning efficiency.
For personal use, many people record their ideas and thoughts through spoken word. How fast can dictating a thought become written into Google Docs or Trello? It can be less than 30 seconds! In addition to the money one would save using dictation instead of typing, a journalist can save up to 40% of their writing time by dictating first drafts rather than typing them.
As a very developer-friendly platform, Telegram has a completely open application programming interface (API) for anyone to build and use their own bot with the BotFather, and then begin working. As opposed to other messaging platforms (i.e., WhatsApp, Viber), where access to the API is limited in scope, the ease of access that Telegram provides is unparalleled. All audio files that are recorded remain on Telegram's servers until deleted by the user, have unique IDs associated with them, and can be downloaded by using the getFile method or by a direct link to the recorded file.
Besides not having any traffic limitations, there is also no fee associated with downloading audio files through Telegram, although there is a limit on the size of an audio file that can be downloaded, which is 20 MB. You can use Telegram Bots in two different ways: either by using polling to check for updates from the server periodically; or by using webhooks to push notifications automatically. We opted for the second option, which allows any time someone posts a voice note to Telegram to be saved directly into our Redis Queue for processing by one of our workers who send all processed voice notes to Groq for transcription. Within the Telegram API, there are options to add rich formatting to text messages, including support for Markdown, HTML, buttons, and inline keyboards. Once a voice note has been transcribed, the bot will present buttons to copy the transcription, send it to email, or save it into Google Docs directly in the chat, maximizing convenience for the end user.
When you send a voice note to the bot, the bot captures any metadata needed to perform transcription, retrieves the audio of your voice note, converts the file if it is required to meet format specifications for the transcription service provider (Groq) and sends it to Groq with any specified parameters (model, language and return type). Once the transcription has been completed, the bot will return the completed text to the chat as a reply immediately.
On average, it takes approximately six seconds to process one minute of audio: two seconds for download, three seconds for recognition, and one second to deliver the completed text back to the user. If the length of the audio file exceeds five minutes, a typing indicator will appear in the chat window to indicate that processing is occurring and that the user will receive the completed text within a short period of time. If an error occurs, the bot will automatically re-attempt the request. If an error occurs due to corrupted audio, the bot will ask the user to re-record the audio and send it again to the bot for transcription.
Telegram provides a unique opportunity for users to communicate with each other using voice messages. The voice messages can be sent using any mobile device or computer (cross-platform capable). Telegram does not impose strict limitations on the duration of voice messages. Voice messages sent via Telegram can be as long as two hours in length, while many competing messaging platforms limit voice messages to just a few minutes.
The privacy offered by Telegram is exceptional. The option to send voice messages as 'secret' allows users to send voice messages that have end-to-end encryption. On the other hand, the use of regular voice messages does not allow for end-to-end encryption. In addition, Telegram does not keep your voice messages; Telegram processes voice message data for you, and then deletes them.
Telegram has no restrictions on the usage of its public API, nor does Telegram impose any costs or limitations on the delivery of voice messages (Telegram allows up to a maximum of 5,000 requests in one hour from a single bot).
Groq's Whisper Technology, a form of hardware acceleration, uses Groq-specific hardware and the LPU chip for performing neural network operations. Unlike a GPU, which is basically a collection of multiple processors working together to coordinate and execute tasks simultaneously, the LPU solves neural network tasks at extreme speed by executing sequentially (like a hockey puck) without requiring coordination.
The Groq TSP chip consists of 230 million transistors, it operates at a clock speed of 750 MHz, and it is capable of producing 188 trillion operations per second (TOPS). Although the NVIDIA A100 can achieve a higher level of TOPS (approximately 312), it is priced at a substantially higher price point as well. Groq also offers a latency of between 18 and 22 milliseconds, making Groq TSP the ideal choice for real-time applications.
As of February 2024, Groq has made available an official API for public use. During the first six months of public access, Groq's API processed approximately 2 billion tokens. With this throughput, Groq's text generation is capable of generating between 300 and 500 tokens per second, or about 10 to 15 times faster than OpenAI's GPT-4.
Groq Whisper takes an average of 2.8 seconds to process 1 minute of audio, making it 4x faster than GPU's and 20x faster than a standard CPU.
The Whisper ASR model has been trained using a transformer architecture with both an encoder and decoder. It uses 680,000 hours of audio recordings across many different languages, including over 117,000 hours of non-English speech data. The Whisper ASR model will support 97 total languages, with the large-v3 model being optimally tuned for handling complex noise and heavy accents.
The main features of the Whisper ASR model include:
For example, the standard Whisper running on an i9 processor takes between 45 and 60 seconds to process 1 minute of audio. The Whisper ASR model on a high-end NVIDIA RTX 4090 takes between 8 to 12 seconds. On Groq LPU, it takes only 2.8 seconds to process the same amount of audio. For the user, this is practically instantaneous; the results from the Whisper ASR model are delivered within 6 to 10 seconds (including the time to download and transmit). If you process 100 recordings a day, you would save a total of over 90 minutes.
Additionally, batch uploads of up to 10 files at once are permitted for processing through Groq Whisper. This allows for the quick processing of very large archives. For example, processing 200 recordings through standard methods may take approximately two hours to upload and process; whereas Groq Whisper processes the same 200 recordings in approximately 15 minutes.
The times and accuracies of each technology compared to Groq Whisper.
| Service | Speed (60s audio) | Accuracy: Clean | Accuracy: Noisy | Languages | Price per hour |
|---|---|---|---|---|---|
| Groq Whisper | 2.8 seconds | 94% | 89% | 97 | — |
| OpenAI Whisper API | 12 seconds | 94% | 88% | 97 | $0.36 |
| Google Speech-to-Text | 8 seconds | 89% | 81% | 125 | $0.24 |
| Azure Cognitive Services | 10 seconds | 87% | 79% | 90 | $1.00 |
In comparison, while Google is affordable, it also requires OAuth and complicated Google Cloud setup, making it difficult to use for one-off jobs. Azure is the most expensive and takes the most time. The OpenAI Whisper API does an amazing job and is very accurate, but it is still more expensive and slower than Groq, which is both faster and more accessible.
Our bot captures voice recordings sent via Telegram and routes them to Groq Whisper for rapid delivery to you in written form without any hassle or registration process.
You would use Telegram to create a voice message either in a private chat, in a group, or in a channel. Then, you may simply forward the message to the bot.
The bot would then identify the file identifier, use the Telegram API to download it as an OGG file, convert it into a WAV file as required, and send it through to the Groq API.
The bot then receives the transcription from Groq and sends the response to you in the same chat.
Your identity is safe with us. We don't keep your files longer than five minutes and we won't ever ask for any personal identifying information. If your audio file is greater than 20 MB then your bot will ask you to compress or split your recording. If you are a business and need a dedicated server and fast access to Telegram CDNs, we can provide this.
Your transcription can be ready:
It is important to remember that your phone's microphone quality also affects how well Groq functions; built-in microphones yield 92-94% accuracy, headsets yield 96-98% accuracy. Other determining factors include accents (native = 95% accuracy; heavy = 88-90% accuracy), normal ambient noise levels (indoors ~80% accuracy; outdoors about 89% accuracy; on a subway line about 82% accuracy) and speech rate (the higher your speech rate, the lower the accuracy).
The formats that will be provided are as follows:
Steps you will need to take:
Fast and Accurate — Groq's primary focus is speed. An audio file of one minute in length will be recognized in approximately 2.8 seconds. This means the audio file was processed 21 times faster than the time it would take to transcribe the file 'manually', as it would take 3-4 minutes, or more depending on your average typing speed. Groq processes it instantly.
The accuracy of the transcripts is about 94% for most major language audio files without the presence of background noise. Common causes of inaccuracy in recognizing audio files includes names (both first and last), slang (or regional dialect) and/or technical terminology (or high-end).
To improve our transcripts, we have added a database of 5,000 terms related to the crypto, marketing, and IT sectors, and accuracy has increased to 96% as a result.
The following procedures are put in place to ensure your data is handled securely:
Companies can use our on-premises solutions which provide complete data isolation between the company and Groq, and also require execution of NDAs.
All audio files are processed and automatically deleted 5 minutes after being processed. The historical record of audio files and transcripts is maintained solely by Telegram. We and our partners are committed to full compliance with all GDPR standards.
