VocaText - Free Audio to Text Transcription Online

How does it work?

Upload your audio file

Drag and drop or click to select an audio file from your computer. We accept MP3, WAV, M4A, OGG and FLAC formats up to 25 MB. Pro ✦ users can upload up to 200 MB and 10 files in parallel.
Start the transcription

Click "Transcribe" and our AI engine analyses your audio to extract text with remarkable accuracy.
Get your text

Copy the result to your clipboard or download it as a .txt file. It's that simple.

Why convert audio to text?

Save time - Manually transcribing a one-hour recording typically takes three to four hours of focused work. An automated transcription delivers the same result in minutes, freeing you to focus on what matters: analysing, editing and sharing.

Make your content searchable - A text version of your audio is indexable by search engines and full-text search tools. Find a quote, a name or a specific topic in seconds, even across hundreds of recordings.

Improve accessibility - A written transcription opens your content to deaf and hard-of-hearing audiences, to non-native speakers and to anyone who simply prefers reading over listening.

Repurpose your content - Turn a podcast episode into a blog post, a meeting into minutes, an interview into an article. One audio file can feed many text-based deliverables.

Who is VocaText for?

Audio-to-text transcription is useful far beyond a single profession. Here are the people and teams who get the most value from VocaText.

Students - Lecture recordings, group discussions, oral revision sessions. Searchable lecture notes turn a one-hour class into a study aid you can scan in minutes.

Journalists and reporters - Interviews, press conferences, off-the-record briefings, field recordings. A clean transcript makes it ten times faster to pull a quote, fact-check a claim or write the article itself.

Doctors and healthcare professionals - Dictated patient notes, consultation summaries, medical reports. Save hours of typing and let yourself focus on the patient instead of the keyboard.

Lawyers and legal professionals - Client meetings, deposition recordings, hearing notes. A written record is essential, and starting from a near-complete draft beats starting from a blank page.

Podcasters and content creators - Episode transcripts boost discoverability on Google and feed show notes, blog posts, social-media quotes and SEO-friendly archives.

Business teams - Meeting minutes, brainstorming sessions, customer calls. A shared transcript turns ephemeral conversations into a knowledge base the whole team can search.

Researchers - Qualitative interviews, focus groups, ethnographic fieldwork. Transcription is often the bottleneck of academic research; automation removes it.

Writers and authors - Voice notes, dictated drafts, ideas captured on the move. Speak when inspiration hits, edit the text later at your desk.

Manual vs automated transcription

Should you transcribe by hand or let an AI do it? The trade-offs become clearer once you put numbers on them. The table below compares the two approaches for a typical one-hour recording.

Aspect	Manual transcription	Automated transcription (VocaText)
Time for one hour of audio	3 to 5 hours	1 to 5 minutes
Cost	60 to 150 EUR (professional rate)	Free
Accuracy on clean audio	95 to 99 percent	92 to 97 percent
Effort required	Sustained focus, hours of typing	Drop the file, click Transcribe
Privacy	Audio shared with a human transcriber	File processed and deleted immediately
Best suited for	Court evidence, legally binding records	Drafts, content production, internal notes, research

Most people just run an automated transcription first and clean up the handful of errors that actually matter. You get most of the manual-level quality in minutes instead of hours - and without paying a cent.

Do a lot more with VocaText Pro ✦ - $7.99/month, cancel in one click.

How does audio transcription actually work?

Behind a single click, several stages run in sequence to turn your audio recording into readable text.

1. Audio decoding - The uploaded file is decoded from its format (MP3, WAV, M4A, OGG, FLAC) into a raw audio signal that the recognition engine can process.

2. Voice activity detection - The signal is scanned to separate speech from silence, music and background noise. Only the segments that actually contain a voice are sent to the next stage.

3. Language detection - A short sample is analysed to identify the spoken language automatically, so you do not need to declare it manually.

4. Speech recognition - A deep-learning model converts the acoustic features into phonemes, then into words, weighing each candidate against a language model that knows which sequences of words are plausible.

5. Text formatting - Punctuation, capitalisation and basic paragraphing are applied so the final transcription reads as natural prose rather than a continuous block of words.

Transcription vs transcoding: don't mix them up

These two words sound alike and are often confused, yet they describe very different operations.

Transcription turns spoken words into written text. The input is audio (a voice, an interview, a meeting) and the output is a text file. This is exactly what VocaText does.

Transcoding turns one audio format into another - for example converting a WAV file to MP3, or an M4A to OGG. The input is audio and the output is also audio, just encoded differently. No text is produced.

In short: transcription changes the medium (from sound to text), while transcoding only changes the wrapper (from one audio format to another).

If you have searched for "audio script", "voice script", "script de voix" or "retranscription", that is exactly what transcription means - different vocabulary, same operation.

Supported formats

VocaText supports all common audio formats. Whether you have a meeting recording, a voice memo or a podcast episode, we can transcribe it.

🎵

MP3 MPEG Audio Layer 3

🔊

WAV Waveform Audio

🎧

M4A MPEG-4 Audio

📀

OGG Ogg Vorbis

💿

FLAC Free Lossless Audio

A brief history of the supported audio formats

Each of the formats VocaText accepts comes from a specific era and was designed to solve a particular problem. Knowing where they come from helps you pick the right one for your recordings.

WAV (1991) - Co-developed by Microsoft and IBM, WAV is one of the oldest digital audio formats still in everyday use. It stores audio as raw, uncompressed data, which makes files large but preserves the original signal exactly. It remains a reference for studio recording and archival.

MP3 (1993) - Standardised by the Fraunhofer Institute and the Moving Picture Experts Group, MP3 popularised lossy audio compression. By discarding sound information that the human ear barely perceives, it shrinks files by an order of magnitude. MP3 became the de-facto format for podcasts and voice memos.

OGG Vorbis (1994–2000) - Developed by the Xiph.Org Foundation as a free, open and patent-unencumbered alternative to MP3. OGG often delivers better quality than MP3 at the same bitrate and is widely used in open-source software and games.

FLAC (2001) - Free Lossless Audio Codec. Like WAV, it preserves the audio signal exactly, but it compresses the data by 30 to 60 percent. FLAC is the favourite of audiophiles and anyone who wants smaller archives without losing a single sample.

M4A (early 2000s) - A container based on the MPEG-4 standard, popularised by Apple with iTunes and the iPhone. It usually carries audio encoded with AAC, a lossy codec that improves on MP3 in efficiency and quality. Voice memos recorded on iOS devices are typically M4A files.

Supported languages

VocaText automatically detects the spoken language from the first seconds of your recording - you do not need to declare it. Accuracy varies depending on how well-represented the language is in the underlying speech-recognition model.

Highest accuracy

English, French, Spanish, German, Italian, Portuguese, Dutch. These languages benefit from the largest amount of training data and reach the most reliable transcription quality.

Supported with possible variations

Dozens of additional languages are recognised, including regional and minority languages. Accuracy on rarer languages depends heavily on the clarity of the recording and the speaker's accent.

Mixing two languages in the same file (code-switching) usually produces partial results - the engine commits to a single language per segment. For best results on multilingual content, transcribe one language at a time.

How to record a high-quality audio file

The accuracy of any transcription depends, first and foremost, on the quality of the audio you feed it. Even the best AI cannot recover what the microphone never captured. Here is what you need and how to do it well.

What devices do you need?

A smartphone is enough for most needs. Modern phones embed surprisingly capable microphones and ship with a built-in voice memo app (Voice Memos on iOS, Recorder on Android). For a one-on-one interview or a personal note, that is all you need.

A computer with a USB or XLR microphone raises the quality noticeably. A USB condenser microphone (around 60 to 150 EUR) is plug-and-play and produces broadcast-grade voice recordings. Pair it with free software like Audacity or your operating system's built-in recorder.

A dedicated handheld recorder (Zoom, Tascam, Sony) is the right choice for field recording, group meetings or anything that needs to capture several voices in a room without being tethered to a computer.

Pro tips for cleaner recordings

Choose a quiet room - Close windows and doors, turn off fans, air conditioning and notifications. Background hum is the number-one cause of misrecognised words.

Mind reverberation - Empty rooms with hard walls echo. Soft surfaces (curtains, carpets, bookshelves, even a duvet draped near you while recording at home) absorb reflections and dramatically improve clarity.

Keep the microphone close - Aim for 15 to 20 centimetres from the speaker's mouth, slightly off-axis to avoid plosives (the popping sound on hard P and B). Closer means more voice and less room.

Speak clearly, not loudly - Articulation matters more than volume. Avoid eating, chewing gum or covering your mouth while talking.

Record one speaker at a time when you can - Overlapping voices are hard for any speech engine to separate. In a meeting, encourage participants to wait their turn before speaking.

Pick the right format - For voice, MP3 at 128 kbps or M4A at 96 kbps is more than enough and keeps files small. Reach for WAV or FLAC only when you plan to edit the audio afterwards.

Do a 10-second test - Record a short sample, listen back, adjust position and gain, then start the real recording. This single habit prevents most disappointments.

Best practices by recording scenario

The general recording tips above cover most situations, but each scenario has its own pitfalls. Here is a practical playbook for the most common contexts.

One-on-one interview

Sit close enough that a single microphone reaches both speakers - typically 30 to 60 centimetres apart, with the mic equidistant. Avoid talking over each other; long, alternating turns are far easier to transcribe than rapid back-and-forth.

Small meeting (3 to 6 people)

A single omnidirectional mic in the centre of a small table works for groups up to four. Beyond that, prefer a dedicated conference microphone or have each speaker use their phone as a personal recorder; you can then transcribe each track and merge the texts.

Conference room or large group

Distance from the microphone is the killer of accuracy. If you cannot mic each speaker individually, accept that the transcription will be partial and prioritise capturing the people who speak the most. Repeat key decisions out loud, close to the mic, so they make it cleanly into the record.

Field recording (outdoors)

Wind, traffic and reverberation are your enemies. Use a windscreen (foam or fluffy "dead cat") on the microphone, get as close as politely possible to the speaker and monitor through headphones to catch problems while you can still fix them.

Personal dictation

Speak in complete sentences with brief pauses, and do not say punctuation out loud - VocaText adds it automatically and "comma" or "full stop" would end up in the text. Keep the phone 15 to 20 centimetres from your mouth, slightly to the side.

Phone or video call

Record locally if you can; cloud-recorded calls are often heavily compressed and lose detail. On Zoom, Teams or Meet, enable "record to this computer" rather than the cloud option, then upload the resulting file. Mind the legal requirement to obtain consent from all parties before recording.

Limits of the tool

File size - Each upload is capped at 25 MB. For a typical voice MP3 at 128 kbps, this corresponds to roughly 25 minutes of audio. Re-encoding to a lower bitrate (64 or 96 kbps) is enough to fit longer recordings within the limit.

Recording duration - There is no hard duration limit, but beyond one hour of audio we cannot guarantee a flawless result. Long recordings tend to drift in volume, accumulate background events and stretch the recognition engine. For best accuracy on lengthy material, split it into segments of 30 to 60 minutes and transcribe them one by one.

Audio quality - Heavy background noise, very low volume, strong reverberation, overlapping speakers or thick accents on a noisy line can all reduce accuracy. The cleaner the input, the better the output - see the recording tips above.

Languages - VocaText auto-detects the language. Common languages (English, French, Spanish, German, Italian, Portuguese) reach the highest accuracy. Rare languages or strong dialects may produce more errors.

Accuracy and benchmarks

No transcription engine is 100 percent accurate, and we believe in setting honest expectations. Accuracy in speech recognition is usually measured by the Word Error Rate (WER) - the proportion of words that are wrong, missing or inserted compared to a perfect human transcription. The figures below are realistic ranges based on real-world recordings.

Recording condition	Typical accuracy	Word Error Rate (WER)
Studio-quality, single native speaker	95 to 98 percent	2 to 5 percent
Quiet office, single speaker, close mic	92 to 96 percent	4 to 8 percent
Meeting room, multiple speakers	85 to 92 percent	8 to 15 percent
Phone-call quality audio	80 to 88 percent	12 to 20 percent
Strong accent or non-standard dialect	75 to 90 percent	10 to 25 percent
Heavy background noise or reverberation	60 to 80 percent	20 to 40 percent
Very low volume or overlapping speakers	Below 70 percent	Above 30 percent

These numbers are not promises - they are calibration. If your audio sits in the upper rows of the table, you can probably ship the transcription with light proofreading. If it sits in the lower rows, plan for a manual editing pass or, better, re-record under cleaner conditions.

Privacy & Security

Your audio files are processed securely and never stored on our servers. Processing happens in real-time and your file is deleted immediately after transcription.

We do not share any data with third parties. Your content remains yours, always.

Why choose VocaText?

Free & no sign-up - No need to create an account. Drop your file and get your transcription instantly.

Cutting-edge AI - Our speech recognition engine uses the latest advances in artificial intelligence for maximum accuracy.

Multi-language - VocaText automatically detects the spoken language and adapts to give you the best possible result.

Frequently Asked Questions

Is VocaText really free?

Yes, VocaText is 100% free. No sign-up, no subscription, no hidden fees. You can transcribe your audio files as many times as you like.

How do I turn a voice message into text?

Save or export the voice message from your messaging app, then drop the file here to get the text. Voice notes from WhatsApp, Messenger, Telegram, Signal, Instagram, as well as SMS/MMS voice clips, are typically stored as .opus, .m4a or .mp3 files, all supported. On WhatsApp, open the voice message, tap the share icon and save it to your device or forward it to yourself; the steps are similar on the other apps. Once you have the audio file, upload it and copy your transcription in seconds.

What languages are supported?

VocaText automatically detects the spoken language and supports dozens of languages, including English, French, Spanish, German, Portuguese, Italian, Dutch, Japanese, Chinese and many more.

What is the maximum file size?

You can upload audio files up to 25 MB. For larger files, we recommend splitting them into smaller parts before transcribing, or use Pro ✦ for files up to 200 MB.

Are my audio files stored on your servers?

No. Your files are processed in real-time and deleted immediately after transcription. We do not store any audio files or transcriptions on our servers.

How accurate is the transcription?

VocaText uses a cutting-edge AI engine that delivers high accuracy, even for recordings with background noise. However, the quality of the result depends on the clarity of the source audio.

Why is my transcription full of errors?

Almost always a quality-of-input problem rather than a model problem. Listen to your file: is there hiss, traffic, an air-conditioner running? Are speakers far from the microphone? Is the volume so low you have to strain to hear? Re-record in a quieter setting, with the microphone closer, and the transcription will improve dramatically.

Why did my upload fail?

Three usual causes: the file exceeds 25 MB, the format is not in the accepted list (MP3, WAV, M4A, OGG, FLAC), or the network connection dropped mid-upload. Check the file size first, then verify the extension matches one of the supported formats. If your files are regularly above 25 MB, Pro ✦ unlocks files up to 200 MB.

Why are multiple speakers merged together with no labels?

VocaText currently outputs a single continuous transcription without speaker diarisation (the technique that labels who said what). For multi-speaker recordings where attribution matters, add speaker names manually as you proofread, or split the audio into per-speaker tracks before uploading if your setup allows it - Pro ✦ batches up to 10 tracks at once.

Why was the wrong language detected?

Language detection looks at the first few seconds of audio. If the recording starts with a foreign-language greeting, a non-speech sound or a long silence, the engine may guess wrong. Trim the silence at the beginning of the file or start the recording with a clear sentence in the target language.

Why does the transcription stop before the end of my audio?

Either the file size hit the 25 MB cap and was truncated on upload, or there was an unrecoverable corruption mid-file. Re-export the audio at a slightly lower bitrate to bring it under the cap, open it in a tool like Audacity to confirm the audio data extends to the end, or upgrade to Pro ✦ for files up to 200 MB.

Why does my MP3 show as an unsupported format?

The .mp3 extension does not always reflect the actual encoding inside the file. Some apps export "MP3" containers that hold non-standard data, or the file extension was simply renamed by hand. Re-encode the file as a true MP3 with a tool like Audacity and try again.

How can I improve accuracy on long recordings?

Split the audio into 30-to-60-minute segments and transcribe each one separately. Long files accumulate small drifts in audio quality and stretch the recognition engine's context window; segmented transcription consistently yields cleaner results, then assemble the texts - or batch them with Pro ✦ to process up to 10 segments in parallel.

Glossary of audio and transcription terms

A short reference for the technical terms that appear elsewhere on this page.

ASR (Automatic Speech Recognition): The general field - and the technology - that converts spoken language into written text. VocaText uses a state-of-the-art ASR model under the hood.
Bitrate: The amount of data used per second of audio, usually expressed in kilobits per second (kbps). Higher bitrates mean better quality and larger files. For voice, 96 to 128 kbps is plenty; for music, 192 to 320 kbps.
Codec: Short for "coder-decoder" - the algorithm that compresses audio when recording and decompresses it when playing back. MP3, AAC, Vorbis and FLAC are all codecs; WAV is technically uncompressed.
Container format: The wrapper file that holds compressed audio data along with metadata. M4A and OGG are containers; the actual audio inside them is encoded with codecs (AAC, Vorbis and so on).
Diarisation (speaker diarisation): The task of identifying who is speaking when in a multi-speaker recording, so the transcript can be labelled "Speaker A", "Speaker B" and so on. VocaText does not yet perform diarisation.
Language model: A statistical model that knows which sequences of words are plausible in a given language. The speech recogniser uses it to choose between similar-sounding word candidates ("their" vs "there").
Lossless compression: Compression that reduces file size without losing any audio data. The original signal can be perfectly reconstructed. FLAC is the most popular lossless format.
Lossy compression: Compression that achieves smaller files by discarding audio information the human ear barely perceives. The original cannot be perfectly recovered. MP3, AAC and Vorbis are all lossy.
Phoneme: The smallest unit of sound that distinguishes one word from another in a language - like the difference between "p" and "b" in "pat" and "bat". ASR models recognise phonemes before assembling them into words.
Sampling rate: How many times per second the audio signal is measured during recording, expressed in hertz (Hz) or kilohertz (kHz). 16 kHz is enough for voice; 44.1 or 48 kHz is the standard for music and video.
Transcript (also: script, retranscription): Three names for the same thing: the written text produced from an audio recording. "Transcript" is the standard technical term, "script" or "audio script" is the everyday word people often type into search engines, and "retranscription" is the formal French term. VocaText produces a transcript from any supported audio file.
Voice activity detection (VAD): The first step of most speech-recognition pipelines: scanning a recording to flag the segments that contain speech and ignore silence, music or noise.
Word Error Rate (WER): The standard metric for transcription accuracy: the number of incorrect, missing or inserted words divided by the total number of words in a perfect human reference. A WER of 5 percent means roughly 95 percent of words are correct.

About VocaText

The service was created to make accurate audio-to-text transcription accessible to everyone, without sign-up, subscription or hidden fees.

Audio files uploaded for transcription are processed in real time and deleted immediately afterwards - they are never stored long-term. Personal data is handled in compliance with the GDPR.

For any question, feedback or partnership request, please use the contact form.

Transform your voice into text

On this page