50+ Speech Analytics Interview Questions & Answers (2026)

Question 1

Q1: Explain the difference between WER and CER. When would you use each?

Answer

Answer:

WER (Word Error Rate) measures the percentage of words incorrectly transcribed (insertions, deletions, substitutions). It's the standard metric for English ASR.

CER (Character Error Rate) measures errors at the character level. It's better for:

Languages without clear word boundaries (Chinese, Japanese)
Evaluating punctuation and capitalization
Assessing partial word errors (e.g., "running" → "runnin")

Use WER for: English speech recognition benchmarks
Use CER for: Asian languages, detailed error analysis, or when spelling matters

Question 2

Q2: Your ASR system has 15% WER on clean audio but 40% WER on customer calls. How would you debug this?

Answer

Answer:

Systematic approach:

Analyze error patterns: Are errors random or systematic? (e.g., all finance terms wrong)
Check audio quality: SNR, codec artifacts, sample rate mismatch
Domain mismatch: Model trained on broadcast news but customer calls have different vocabulary
Speaker characteristics: Accents, speaking rate, disfluencies more common in real calls
Environment: Background noise, crosstalk, echo in call centers

Solutions:

Fine-tune on customer call data (even 10-50 hours helps significantly)
Add audio preprocessing (noise reduction, echo cancellation)
Use domain-specific language models
Implement confidence scoring to flag low-quality segments

Question 3

Q3: Why might you choose Kaldi over Whisper for a production speech analytics system?

Answer

Answer:

Reasons to choose Kaldi:

Streaming capability: Kaldi supports real-time, incremental decoding. Whisper is offline-only.
CPU-efficient: Kaldi runs well on CPU. Whisper requires GPU for reasonable latency.
Cost at scale: Processing millions of calls/day on CPU is cheaper than GPU infrastructure.
Customization: Easier to plug in custom language models, pronunciation dictionaries.
Proven stability: Kaldi has been in production for 10+ years at major companies.

However, Whisper wins on:

Out-of-box accuracy (especially with accents, noise)
Multilingual support (99 languages vs separate Kaldi models)
Faster development time (no recipe engineering)

Answer shows: Understanding of production trade-offs beyond just accuracy.

Question 4

Q4: What is speaker diarization and why is it critical for speech analytics?

Answer

Answer:

Speaker diarization is the process of partitioning audio into segments by speaker identity—answering "who spoke when?" without necessarily knowing who the speakers are.

Critical for analytics because:

Agent vs Customer separation: Call centers need to analyze agent behavior separately
Talk-time ratios: Sales coaching requires knowing how much each person spoke
Turn-taking analysis: Detect interruptions, monologues, engagement patterns
Sentiment attribution: "Who was frustrated?" requires knowing who said what
Compliance: Legal/regulatory requires speaker-attributed transcripts

Without diarization: You just have a wall of text with no context about who said what.

Question 5

Q5: Design a speaker diarization system for a call center with 100K calls/day. What are the key components and trade-offs?

Answer

Answer:

System Architecture:

Audio Input → VAD → Speaker Embedding → Clustering → Resegmentation → Output
                                   ↓
                              Speaker DB (optional)

Key Components:

1. Voice Activity Detection (VAD):

Use WebRTC VAD or Silero VAD (fast, accurate)
Reduces computation by 40-60% (skip silence)

2. Speaker Embedding Extraction:

Options: x-vectors (Kaldi), ECAPA-TDNN (pyannote.audio)
Trade-off: x-vectors faster on CPU, ECAPA more accurate
Extract embeddings every 1-2 seconds with overlap

3. Clustering:

Agglomerative hierarchical clustering (standard)
Challenge: Unknown number of speakers
Solution: Use PLDA scoring + threshold tuning

4. Optimization for Scale (100K calls/day):

Batch processing: Group calls, process in parallel
Model size: Use smaller embedding model if accuracy permits
Caching: For known speakers (agents), cache embeddings
Infrastructure: CPU-based pipeline (cheaper at scale)

Typical Performance:

DER: 5-10% on call center audio (good)
Latency: 0.1-0.3x real-time on CPU
Cost: ~$0.001/minute (vs $0.02+ for cloud APIs)

Question 6

Q6: How would you handle overlapping speech in speaker diarization?

Answer

Answer:

Problem: Traditional diarization assumes one speaker at a time. Real conversations have overlaps (interruptions, backchannel responses).

Solutions:

1. EEND (End-to-End Neural Diarization):

Trained to output multiple speakers per frame
Can detect overlaps directly
Cons: Requires lots of training data, fixed max speakers

2. Post-processing detection:

After initial diarization, detect potential overlap regions
Look for high energy in "silence" between speakers
Re-analyze those segments with overlap-aware models

3. Multi-channel audio:

If you have separate microphones, use beamforming
Separate sources before diarization

Practical approach for call centers:

Accept 2-5% error rate from overlaps (usually acceptable)
Focus on clean turn boundaries (90% of speech)
Flag overlap regions for manual review if critical

Question 7

Q7: How would you extract action items from a sales call transcript?

Answer

Answer:

Approach 1: Rule-Based (Fast, Explainable):

Look for patterns: "I'll [verb]", "Let's schedule", "Action item:", "TODO:"
Extract entities: dates, times, people, deliverables
Works well for structured calls with consistent language

Approach 2: NER + Dependency Parsing:

Train NER model to tag: ACTION, ASSIGNEE, DEADLINE
Use dependency parsing to link entities
More robust to variation than regex

Approach 3: LLM-Based (2026 Standard):

Use GPT-4 or Claude with prompt engineering
Provide few-shot examples of action items
Ask for structured JSON output

Prompt: "Extract action items from this sales call. 
Return JSON with: {task, assignee, deadline, priority}"

Transcript: [...]

Output: [
  {"task": "Send pricing proposal", "assignee": "Sarah", 
   "deadline": "2026-01-20", "priority": "high"}
]

Production considerations:

LLMs are expensive ($0.01-0.10/call)
Hybrid: Use rules for 80% of cases, LLM for complex ones
Always show confidence scores to users

Question 8

Q8: Design a topic modeling system for analyzing 1M customer support calls. How would you make it actionable?

Answer

Answer:

Step 1: Preprocessing

Transcribe calls (ASR)
Clean: remove filler words, agent scripts (boilerplate)
Focus on customer utterances (more signal)

Step 2: Topic Modeling Approaches

Traditional: LDA (Latent Dirichlet Allocation)

Pros: Interpretable, fast, proven
Cons: Requires manual topic count tuning
Best for: Stable domains, periodic analysis

Modern: BERTopic

Uses BERT embeddings + UMAP + HDBSCAN
Automatically determines topic count
Better coherence than LDA
Best for: Dynamic domains, one-time analysis

Step 3: Making It Actionable

Bad outcome: "Topic 7 has words: refund, policy, return, unhappy"

Actionable outcome: "Refund Policy Confusion (23% of calls, ↑8% vs last month)"

How to get there:

Label topics meaningfully: Use LLM to generate human-readable topic names
Track over time: Topic prevalence trends (which issues growing?)
Link to metrics: CSAT, resolution time by topic
Alert on anomalies: "Shipping delays mentions up 50% today"
Sample calls per topic: Let managers listen to examples

System Design:

Batch processing: Daily overnight job
Incremental updates: New calls assigned to existing topics
Quarterly re-training: Discover new emerging topics
Dashboard: Show top topics, trends, drill-down to calls

Question 9

Q9: Why is sentiment analysis on transcribed speech harder than on written text?

Answer

Answer:

Challenges unique to speech:

1. ASR Errors Impact Sentiment:

"I'm not happy" → "I'm happy" (transcription error completely flips sentiment)
Misspelled sentiment words ("grate" vs "great")

2. Missing Prosody (Tone):

"That's great." (flat tone = sarcasm, missed in text)
Excitement vs anger: same words, different meaning
Solution: Use acoustic features (pitch, energy) alongside text

3. Disfluencies:

"Um, well, I guess it's, uh, okay maybe?" is negative despite "okay"
Hesitation patterns indicate uncertainty/dissatisfaction

4. Context Dependency:

"I'll have to think about it" (rejection in sales context)
Requires understanding of speaker role (agent vs customer)

Best Practices:

Multi-modal: Combine text (BERT sentiment) + audio (prosody features)
Speaker-aware: Analyze customer sentiment separately from agent
Utterance-level: Don't average sentiment over full call
Confidence scores: Flag uncertain predictions for review

Question 10

Q10: You're building a real-time sentiment dashboard for call center managers. How would you design the ML pipeline?

Answer

Answer:

Requirements:

Low latency: Sentiment visible within 5-10 seconds
Accuracy: Good enough for intervention decisions
Scale: 1000 concurrent calls

Architecture:

Call Audio Stream
    ↓
Streaming ASR (Kaldi/Conformer-RNN-T)
    ↓
Utterance Buffer (5-10s windows)
    ↓
Sentiment Model (DistilBERT fine-tuned)
    ↓
WebSocket → Dashboard
    ↓
Alert System (if negative sentiment detected)

Key Design Decisions:

1. ASR Choice:

Must be streaming (Whisper won't work)
Options: Kaldi, Conformer-RNN-T, Deepgram API
Trade-off: Build vs buy (Kaldi = cheaper at scale, API = faster to market)

2. Sentiment Model:

DistilBERT (6x faster than BERT, 97% accuracy)
Fine-tuned on call center data (critical!)
Inference: <50ms on CPU

3. Windowing Strategy:

Analyze 5-10 second chunks (balance latency vs context)
Moving average: Smooth out noise
Spike detection: Alert when sentiment drops suddenly

4. Alert Logic:

Escalate if: Sustained negative sentiment (>30s) OR sudden drop
Provide context: "Customer said 'this is unacceptable' at 2:34"
Suggested action: "Offer supervisor escalation"

Infrastructure:

ASR: GPU instances (or streaming API)
Sentiment: CPU inference (batched)
WebSockets: Real-time dashboard updates
Redis: Store current call states

Question 11

Q11: Design a scalable speech analytics platform like Gong. Walk through the architecture from audio ingestion to insights delivery.

Answer

Answer:

High-Level Architecture:

Audio Ingestion → Processing Pipeline → Analytics → Storage → API/UI
     ↓                    ↓                  ↓         ↓         ↓
  Zoom/Meet          ASR+Diarization    NLP Models    DB    Dashboard

Detailed Components:

1. Audio Ingestion:

Integrations: Zoom, Google Meet, Webex APIs
Recording capture: Auto-join meetings as bot
Queue: Kafka for async processing
Storage: S3 for raw audio (lifecycle policy: delete after 90 days)

2. Processing Pipeline (Airflow/Temporal):

Task 1: Audio preprocessing (format conversion, enhancement)
Task 2: ASR (Whisper on GPU)
Task 3: Speaker diarization (pyannote.audio)
Task 4: Merge ASR + diarization outputs
Task 5: NLP analytics (parallel jobs)

3. NLP Analytics (Parallel Processing):

Sentiment analysis per speaker
Topic extraction
Action items detection
Question identification
Talk-time ratios, interruptions, speaking rate
Keyword/competitor mentions

4. Storage Layer:

PostgreSQL: Structured data (users, meetings, metadata)
Elasticsearch: Full-text search on transcripts
Vector DB (Pinecone/Weaviate): Semantic search ("find calls about pricing objections")
Redis: Caching, session data

5. API Layer (FastAPI):

REST API for CRUD operations
GraphQL for complex queries
WebSockets for real-time features
Rate limiting, authentication (OAuth)

6. Frontend (React):

Dashboard: Call library, analytics charts
Call player: Transcript + audio synchronized
Search: Semantic + keyword search
Insights: AI-generated summaries, coaching tips

Scale Considerations:

Processing: 10K meetings/day = ~20K hours audio/month
ASR cost: Whisper at $0.02/min = $24K/month (optimize with batching)
Storage: 1 hour audio ≈ 50 MB, transcript ≈ 100 KB
Compute: Auto-scaling GPU pools for ASR, CPU for NLP

Key Optimizations:

Batch process non-urgent calls (overnight)
Priority queue for "urgent" calls (live meetings)
Caching: Pre-compute common analytics views
CDN: Serve audio files from edge locations

Question 12

Q12: Explain how you would fine-tune a pre-trained BERT model for call center sentiment analysis. What are the key steps?

Answer

Answer:

Dataset Preparation:

Collect 5K-10K labeled call transcripts (customer utterances)
Labels: Positive, Neutral, Negative (or 5-point scale)
Balance classes (over/undersample if needed)
Split: 80% train, 10% validation, 10% test

Preprocessing:

Tokenize with BERT tokenizer (WordPiece)
Max length: 512 tokens (truncate longer utterances)
Handle ASR errors: Keep as-is (model learns robustness)

Model Architecture:

Base: `bert-base-uncased` (110M params)
Add classification head: Linear(768 → 3 classes)
Alternative: Use DistilBERT (6x faster, similar accuracy)

Training:

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=3
)

# Hyperparameters
learning_rate = 2e-5  # Small LR for fine-tuning
epochs = 3-5
batch_size = 16
optimizer = AdamW

Evaluation:

Metrics: Accuracy, F1 (macro), confusion matrix
Error analysis: Which sentiment mistakes are critical?
Calibration: Check confidence scores align with accuracy

Deployment:

Export to ONNX for faster inference
Inference: <100ms on CPU for single utterance
Monitor: Track accuracy on production data, retrain quarterly

Question 13

Q13: Write a function to calculate speaker talk-time percentages from a diarization output.

Answer

Answer:

def calculate_talk_time(diarization_output):
    """
    Calculate talk-time percentage per speaker.
    
    Args:
        diarization_output: List of tuples [(start, end, speaker), ...]
        e.g., [(0.0, 5.2, 'SPEAKER_00'), (5.2, 12.8, 'SPEAKER_01'), ...]
    
    Returns:
        Dict of {speaker: talk_time_percentage}
    """
    speaker_durations = {}
    total_duration = 0
    
    for start, end, speaker in diarization_output:
        duration = end - start
        speaker_durations[speaker] = speaker_durations.get(speaker, 0) + duration
        total_duration += duration
    
    # Calculate percentages
    talk_time_pct = {
        speaker: (duration / total_duration) * 100 
        for speaker, duration in speaker_durations.items()
    }
    
    return talk_time_pct

# Example
diarization = [
    (0.0, 5.2, 'SPEAKER_00'),
    (5.2, 12.8, 'SPEAKER_01'),
    (12.8, 18.5, 'SPEAKER_00'),
    (18.5, 25.0, 'SPEAKER_01')
]

result = calculate_talk_time(diarization)
# {'SPEAKER_00': 46.0, 'SPEAKER_01': 54.0}

Follow-up questions interviewers might ask:

How would you handle overlapping speech? (Add overlap handling logic)
What if there are gaps (silence)? (Track silence separately)
How to optimize for large datasets? (Use NumPy for vectorization)

Question 14

Q14: Tell me about a time you had to make a trade-off between model accuracy and latency in production.

Answer

Answer Framework (STAR):

Situation: "At [Company], we were deploying real-time sentiment analysis for live customer calls. Initial model was BERT-large with 95% accuracy but 300ms latency."

Task: "Product required <100ms latency for real-time agent coaching. I needed to reduce latency by 3x without sacrificing too much accuracy."

Action:

"Benchmarked alternatives: DistilBERT, TinyBERT, lightweight CNNs"
"DistilBERT gave 92% accuracy at 50ms (6x faster)"
"Quantized to INT8, reducing latency to 35ms"
"Implemented confidence thresholding: only show predictions >0.85 confidence"
"A/B tested with customer success team"

Result: "Deployed DistilBERT-INT8. Achieved 40ms latency, 91% accuracy on production data. CS team reported 30% faster resolution times. Small accuracy drop (95% → 91%) was acceptable given 7x latency improvement."

Why this answer works: Shows quantitative thinking, practical trade-offs, validation with stakeholders.

50+ Speech Analytics Interview Questions & Answers (2026)

Jump to Section:

ASR & Transcription Basics

Speaker Diarization

NLP & Text Analytics

Sentiment Analysis

System Design

ML Theory

Coding Challenges

Behavioral Questions

Company-Specific Focus Areas

Preparation Checklist

Additional Resources

Ready to Apply?

50+ Speech Analytics Interview Questions & Answers (2026)

Jump to Section:

ASR & Transcription Basics

Speaker Diarization

NLP & Text Analytics

Sentiment Analysis

System Design

ML Theory

Coding Challenges

Behavioral Questions

Company-Specific Focus Areas

Preparation Checklist

Additional Resources

Ready to Apply?

Related Articles