How Professional Transcriptionists Actually Clean Transcripts

Every guide to cleaning transcripts describes what to fix. Almost none describe the order to fix it in, the tools to have open while doing it, or why the sequence matters as much as the work itself.

This is that guide.

What follows is the actual workflow experienced transcriptionists use when cleaning AI-generated transcripts. It is not the polished version for a training document. It is the operational reality: what to do first, what to skip until later, and what most transcriptionists who are billing at lower rates than they should be are getting wrong.

Before You Start: The 10-Minute Setup That Saves 40 Minutes Later

The single most consistent mistake beginners make is opening the audio and transcript simultaneously and starting a combined listen-and-fix pass. This feels like efficiency. It produces worse results on both dimensions and takes longer than doing them separately.

Before touching the transcript, do this:

1. Read the job brief completely. Not skim it. Read it. The client's verbatim mode, speaker names, glossary terms, timestamp requirements, and formatting rules all live in the brief. Missing one requirement means re-doing work after a rejection.

2. Open the client's style guide side-by-side with the transcript. If you are working on a Rev job, have the Rev style guide open. If you have a client with specific rules, have those rules open. You will reference them during the formatting pass — having them accessible means not breaking your flow to search for them.

3. Build a quick glossary. Scan the job brief for names, company names, product names, technical terms. Search for any you do not recognize. Spend 5 minutes understanding how key terms should be spelled before you encounter them in the transcript. Finding out mid-listen-pass that a term you transcribed 12 times is spelled differently costs more time than the 5-minute upfront search.

4. Check the audio length vs. the transcript word count. A rough benchmark: one minute of clear audio produces approximately 130-150 words. If a 30-minute recording has a 2,000-word transcript, something went wrong — either the audio has significant silence, the AI skipped segments, or the verbatim mode collapsed filler words incorrectly. Catch this before starting, not after spending 45 minutes on a document with missing content.

Pass 1: The Structural Pass (No Audio)

The first pass through the transcript does not involve audio. It is done entirely by reading.

What you are looking for:

Missing sections (gaps in timestamps that suggest skipped audio)
Speaker label inconsistencies (the same speaker labeled differently in different places)
Obvious formatting violations (timestamps in wrong format, tags in wrong format)
Paragraph break logic (new speaker = new paragraph, always; topic shift = new paragraph, usually)
Macro-level issues that would require restructuring the document

What you are NOT doing in this pass:

Correcting individual word errors
Listening to audio
Fixing punctuation
Verifying proper nouns

Why this order: Structural problems require restructuring — adding or removing paragraphs, updating speaker labels globally, reformatting sections. Doing this pass first means you are not correcting individual words in a section that you later restructure. Word-level corrections in a structurally correct document are faster and less likely to be re-done.

Time benchmark: 4-8 minutes per 30 minutes of transcript, done at reading pace.

Pass 2: The Global Find-and-Replace Pass

Before the listen pass, do a mechanical cleanup using find-and-replace. This is work you can do in 3-5 minutes that catches errors a listen pass would take 20 minutes to catch one at a time.

Standard find-and-replace checks:

Speaker label format normalization (find every variation, replace with the canonical format)
Common AI substitution errors for words that appear frequently in the content type (for example, "gonna" in clean verbatim jobs should be "going to" — do a search)
Timestamp format normalization (if using a tool that auto-generates timestamps, verify they are all zero-padded to the correct format)
Tag format normalization ([Inaudible] → [inaudible], [INAUDIBLE] → [inaudible])
Double space removal
Trailing space removal

Why this order: Find-and-replace catches consistent errors that are invisible in a read-through because your brain normalizes them. A speaker label that is wrong in 12 places looks "right" when you have seen 40 correct versions of it. A find-and-replace does not normalize. It finds every instance.

Time benchmark: 3-5 minutes per job, regardless of length.

Pass 3: The Proper Noun Pass (No Audio)

Before listening to any audio, do a targeted scan for proper nouns using the glossary you built in setup.

Search for the expected proper nouns. Verify they appear consistently and correctly throughout the document. Flag anywhere the proper noun appears in a phonetically similar but wrong form.

This works because: AI substitution errors for proper nouns are phonetically consistent. If "Kubernetes" was mis-transcribed as "cue bernetes" in minute 4, it was likely mis-transcribed similarly in minutes 12, 27, and 43. Finding all instances first and batch-correcting them is faster than catching each one individually during a listen pass.

For proper nouns that were mis-transcribed in ways you cannot identify from reading (because you are not sure what the correct version is), mark them as uncertain during this pass and resolve them during the listen pass.

Time benchmark: 5-10 minutes per job, depending on vocabulary density.

Pass 4: The Listen Pass (Targeted, Not Full)

This is the expensive pass. It is the one where time gets lost.

The mistake most transcriptionists make is treating the listen pass as a full document review — playing the audio from beginning to end while following along in the transcript and correcting everything encountered.

This approach is slow because it is linear. It means spending the same time on segments where the transcript is perfect as on segments where errors exist. It also means making decisions about formatting, word choice, and style while simultaneously tracking audio — cognitively expensive and error-prone.

A more efficient approach:

First, categorize what needs listening. During the structural and proper noun passes, you should have flagged segments that need audio verification. Go to those first. They are the high-value segments. Segments that read cleanly in the structural pass and contain no flagged proper nouns are likely accurate — they may not need audio verification at all.

Listen at 1.25x or 1.5x speed on clean segments. For segments with good audio quality and no flagged issues, you are verifying, not deciphering. Faster playback means more coverage in the same time.

Listen at normal speed on problem segments. Background noise, cross-talk, technical vocabulary, fast speech — these need full-speed review. Do not try to save time here. The errors in these segments cost the most when they reach clients.

Flag-as-you-go, fix-at-the-end. During the listen pass, flag uncertain words and errors. Do not stop audio playback to fix them immediately unless the fix is a one-keystroke correction. The audio-stopping break for a multi-word correction costs you the context of what was said before and after. Flag, continue, fix during a final edit sweep.

Time benchmark: 0.5x–0.8x the audio length for a targeted listen pass on a well-structured AI transcript. Full-length listen passes (1:1 or worse) indicate either very poor AI accuracy or an inefficient listen-pass strategy.

Pass 5: The Formatting and Style Pass

The listen pass is complete. Every content decision has been made. Now apply formatting.

What belongs in the formatting pass:

Number formatting (spell out 1-10, digits for 11+, check every number)
Punctuation review (particularly for full verbatim jobs — em dashes, ellipses, comma placement)
Style guide compliance check (compare each rule against the document systematically)
Final speaker label verification
Timestamp final check

What most transcriptionists do instead: They mix formatting decisions into the listen pass, which means they are making formatting decisions while tracking audio content. This produces worse outcomes on both. Formatting errors are more likely because the attention is divided. Accuracy errors are more likely because formatting decisions interrupt audio tracking.

The pass order is the workflow. Content first. Structure first within content. Formatting last.

The Time Math

A proficient transcriptionist working this way should hit approximately:

Audio Length	Setup	Structural Pass	Find-Replace	Proper Noun Pass	Listen Pass	Formatting Pass	Total
30 min	8 min	5 min	3 min	5 min	20 min	8 min	~49 min
60 min	10 min	9 min	4 min	8 min	38 min	12 min	~81 min
90 min	12 min	13 min	5 min	10 min	55 min	16 min	~111 min

These numbers are for good-quality AI transcripts (95%+ accuracy) on clean-to-moderate audio. They are also the numbers for transcriptionists who have internalized this pass order. Transcriptionists doing a combined listen-and-fix pass on the same content typically spend 40-60% more time to produce comparable quality.

What Changes When You Have Formatting Rules Pre-Applied

The formatting pass is the one that varies most by workflow. Transcriptionists who are applying style rules from scratch on every job — reading the style guide, checking the document, updating manually — spend significantly more time on this pass than those who have the rules documented and structured before starting.

The difference in the formatting pass when style rules are pre-structured: you are verifying compliance, not applying rules. Verification is faster than application because you know what you are looking for and you are checking, not deciding.

Bring your transcript to a structured formatting layer before the cleanup pass

Get the cleanest possible AI transcript to start from

The setup matters. The pass order matters. The time you save by having structure before you start is time you do not spend redoing work later.

How Professional Transcriptionists Actually Clean Transcripts

Before You Start: The 10-Minute Setup That Saves 40 Minutes Later

Pass 1: The Structural Pass (No Audio)

Pass 2: The Global Find-and-Replace Pass

Pass 3: The Proper Noun Pass (No Audio)

Pass 4: The Listen Pass (Targeted, Not Full)

Pass 5: The Formatting and Style Pass

The Time Math

What Changes When You Have Formatting Rules Pre-Applied

Comments

More from this blog

The Hidden Cost of Multi-Tool Transcription Workflows

Why Formatting Is Still the Most Annoying Part of Transcription

Manual Timestamp Fixing Is Wasting Hours of Your Week

How Agencies Process 100+ Hours of Audio Per Week

Speaker Diarization Problems Nobody Talks About

Command Palette

Before You Start: The 10-Minute Setup That Saves 40 Minutes Later

Pass 1: The Structural Pass (No Audio)

Pass 2: The Global Find-and-Replace Pass

Pass 3: The Proper Noun Pass (No Audio)

Pass 4: The Listen Pass (Targeted, Not Full)

Pass 5: The Formatting and Style Pass

The Time Math

What Changes When You Have Formatting Rules Pre-Applied

Comments

More from this blog