

Voice assistants are evolving at a staggering pace. Companies that previously spent months searching for and training operators are now deploying bots over a single weekend. Such solutions save thousands of hours of routine work. We’ve been through this ourselves — from indexing blockchain nodes and automating crypto-arbitrage to building an entire ecosystem of products that operate almost without programmers. Now, I’m ready to share how you can build your own voice assistant from scratch: which technologies to choose, how to avoid going broke, and how to truly save time.
“We spent two years indexing nodes and developing sentiment models to provide answers based on real-time blockchain data rather than simple web searches. GPT and other LLM models do not work directly with the blockchain, which makes our approach unique in crypto-analytics.”
A voice assistant is a program that listens to you, understands you via NLP, recognizes speech through ASR, and responds with a voice via TTS. It performs functions ranging from simple reminders to complex analytics.
A voice bot is a specialized version. It is tailored to a specific task: taking orders, technical support, or booking. It operates based on scripts or a combination of scripts and NLP.
The difference is simple. An assistant acting as a bank employee is capable, surprisingly enough, of answering a wide variety of requests: checking balances, transferring money, or providing loan consultations. A bot for the same bank might only accept loan applications. And that’s it.

At ASCN.AI, our voice agents automate news parsing, token analysis, and report generation. A client asks a question via voice on Telegram — the agent extracts data from on-chain nodes, news aggregators, and social media, providing a structured response within 10 seconds. This saves an analyst up to 40 man-hours of manual labor per month.
NLP (Natural Language Processing) — these are the most advanced algorithms that allow the system to understand human language. For example, GPT-4 is built on transformers: it analyzes all words simultaneously rather than sequentially. Therefore, it can process multi-step queries and generate relevant answers.
ASR (Automatic Speech Recognition) — converting sound to text. Popular systems include OpenAI Whisper, Google Cloud Speech-to-Text, and Yandex SpeechKit. Whisper was trained on 680,000 hours of audio in 97 languages; its recognition accuracy exceeds 90% even in noisy environments (OpenAI, 2022).
TTS (Text-To-Speech) — converting text to speech. The latest models (ElevenLabs, Google Cloud TTS, Amazon Polly) reproduce human-like voices with proper intonation and pauses. One of the current trends — voice cloning from a 10-minute recording — allows for the creation of personalized assistants.
Voice Automation combines all components: ASR listens, NLP makes decisions, and TTS responds — all without operator involvement. In short, such bots exist in call centers; they take on the customer's problem, find the answer using GPT, speak it to the customer, and save the data in a CRM.
| Technology | Task | Example Systems | Accuracy / Quality |
|---|---|---|---|
| ASR | Speech Recognition | Whisper, Google STT, Yandex SpeechKit | 90-95% accuracy |
| NLP | Text Understanding | GPT-4, Claude, LLaMA | Context up to 128k tokens |
| TTS | Speech Synthesis | ElevenLabs, Google TTS, Amazon Polly | Naturalness 4.5/5 |
| Voice Automation | Full Processing Cycle | ASCN.AI, n8n + GPT | Reduction in time up to 70% |
As a technological foundation, ASCN.AI utilizes several tools: Whisper for recognition, GPT-4 for generating responses, and ElevenLabs for voicing the results. A Telegram user asks: “Why did the XYZ token price increase?” — the agent gathers information, pulls data from on-chain nodes, scans Telegram channels, and forms a voice response based on a set of queries in just 30 seconds. This replaces 20–30 minutes of manual searching.
A voice assistant consists of three modules, each performing its sequential task:
The entire process typically takes between 5 and 15 seconds, depending on API load and query complexity.
The workflow is as follows:
A query like “Give me the sentiment for Bitcoin over the last 24 hours” is processed in 10 seconds instead of 30–40 minutes of an analyst's manual work.
GPT (Generative Pre-trained Transformer) is a language model trained on billions of texts. It is capable of creating coherent text while understanding the context of the query.
In voice bots, GPT performs several functions:
Examples:
Modern TTS systems based on neural networks (WaveNet, Tacotron, VITS) produce speech that is almost indistinguishable from a human. Text undergoes phonemic breakdown, is analyzed for intonation parameters, and the sound wave is synthesized.
| Platform | Quality | Synthesis Speed | Languages | Cost | Voice Cloning |
|---|---|---|---|---|---|
| ElevenLabs | 4.7/5 | 1-2 sec per sentence | 29 languages | $5–99/mo | Yes, via 10 min recording |
| Google Cloud TTS | 4.3/5 | 1–3 sec | 40+ | $4 per 1M characters | No |
| Amazon Polly | 4.0/5 | 2–4 sec | 30+ | $4 per 1M characters | No |
| Yandex SpeechKit | 4.2/5 (Russian) | 1–2 sec | 3 | ₽80 per hour of audio | No |
The choice depends on the project. For a project in Russian, Yandex SpeechKit or ElevenLabs is suitable. For multilingual tasks, Google Cloud TTS or ElevenLabs is preferred. If voice personalization is required, ElevenLabs with its cloning feature is the only option. At ASCN.AI, we use ElevenLabs to voice analytical reports with natural intonation.
Before creating an assistant, you must address three tasks: speech recognition (ASR), response generation (NLP), and speech synthesis (TTS). You can use separate tools or ready-made stacks.
Basic options:
Recommendations for these tasks are as follows:
In the case of ASCN.AI, the average query processing time is 10 seconds, with a cost of approximately $0.05 — $0.08 per question.
The process can be structured as a pipeline: Audio → ASR → Text → GPT → Response → TTS → Audio → User.
A simple Python example using the OpenAI API:
import openai
openai.api_key = "YOUR_API_KEY"
# Speech recognition
audio_file = open("user_voice.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
user_text = transcript["text"]
# Response generation
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "You are a voice assistant."},
{"role": "user", "content": user_text}
]
)
assistant_text = response["choices"][0]["message"]["content"]
# Speech synthesis
tts_response = openai.Audio.create_speech(
model="tts-1",
voice="alloy",
input=assistant_text
)
with open("assistant_voice.mp3", "wb") as f:
f.write(tts_response["audio"])
For those not interested in programming, there are no-code services like ASCN.AI or n8n — where a ready-made workflow of blocks (Telegram, Whisper ASR, GPT-4, ElevenLabs TTS) can be assembled in 20–30 minutes without a single line of code. So, all we need to do is write a bit of code for a Telegram bot that utilizes GPT's capabilities. First, we need to create the bot itself using @BotFather, and then install the libraries:
pip install python-telegram-bot openai elevenlabs
My minimal code:
import openai
from telegram import Update
from telegram.ext import Updater, MessageHandler, Filters, CallbackContext
from elevenlabs import generate, set_api_key
openai.api_key = "YOUR_OPENAI_KEY"
set_api_key("YOUR_ELEVENLABS_KEY")
TELEGRAM_TOKEN = "YOUR_TELEGRAM_TOKEN"
def handle_voice(update: Update, context: CallbackContext):
voice = update.message.voice.get_file()
voice.download("user_voice.ogg")
with open("user_voice.ogg", "rb") as audio:
transcript = openai.Audio.transcribe("whisper-1", audio)
user_text = transcript["text"]
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": "You are a crypto market analyst. Respond briefly."},
{"role": "user", "content": user_text}
]
)
assistant_text = response["choices"][0]["message"]["content"]
audio_data = generate(text=assistant_text, voice="Antoni")
with open("assistant_voice.mp3", "wb") as f:
f.write(audio_data)
update.message.reply_voice(voice=open("assistant_voice.mp3", "rb"))
updater = Updater(TELEGRAM_TOKEN)
updater.dispatcher.add_handler(MessageHandler(Filters.voice, handle_voice))
updater.start_polling()
updater.idle()
Each request is processed in 10–15 seconds — instead of searching for hours, traders simply ask and get an answer.
Voice automation is already transforming routine tasks — automating up to 70% of repetitive duties.
An IT company received up to five hundred calls per day, 60% of which were repetitive questions. A voice bot based on GPT, Whisper, and ElevenLabs reduced the average handling time from 5 minutes to 1 minute, lowered operator load by 65%, and saved 120 hours per month.
Traders spend 30-60 minutes a day analyzing news and on-chain data. Our voice agent in Telegram collects and voices summaries for tokens in 30 seconds. This saves up to 40 hours a month on analytics. Averaged values reported by clients following implementation indicate that profitability increased by 15-20%.
More about the ASCN.AI case during the Falcon Finance crash
A restaurant chain had about 200 calls per day. The application of a voice bot implemented using Yandex SpeechKit and GPT-4 required no human intervention in 90% of calls, reduced order loss from 15% to 2%, and increased revenue by 8%.
Investments pay off in 2–6 months — particularly noticeable in customer support, trade, analytics, and HR.
At ASCN.AI, a session can consist of up to 20 messages — maintaining a coherent context in the dialogue.
At ASCN.AI, the cost of a request is about $0.06; processing takes 10 seconds. Prompt optimization allowed us to save a quarter of the budget without sacrificing quality.
It takes only 2–4 hours to develop a basic prototype using the OpenAI API. A production version with integration, logging, and security takes 1–2 weeks. On no-code platforms (ASCN.AI, n8n), you can launch in 1–2 days without any programming.
For no-code solutions — no. For complex scenarios and custom integrations, basic knowledge of Python is useful. However, no-code builders allow you to assemble a powerful assistant visually, without code.
ASR loses accuracy in noise — down to 70–80%, whereas in quiet environments it reaches 95%. GPT sometimes generates inaccurate responses — so-called hallucinations. TTS does not always perfectly convey complex intonation. There are also request limits and potential delays during high API load.
Use HTTPS, delete audio after processing, anonymize logs, keep keys secure, comply with personal data laws, and use on-premise systems for highly sensitive information.
The information in this article is for general informational purposes and does not replace investment, legal, or security advice. The use of AI assistants requires a conscious approach and an understanding of the functions of specific platforms.