50+ Speech Analytics Interview Questions & Answers (2026)

Speech analytics interviews test your ability to build systems that extract insights from conversations. Whether you're interviewing at Gong, Chorus, CallMiner, or a startup, you'll face questions spanning ASR, NLP, speaker diarization, sentiment analysis, and system design.

This guide covers 50+ real interview questions collected from engineers who've interviewed at top speech analytics companies. Each question includes a detailed answer, difficulty rating, and tips on what interviewers are looking for.

πŸ“‹ Interview Format at Top Companies

Gong/Chorus typical process: (1) Phone screen with recruiter, (2) Technical phone screen (45 min coding + theory), (3) Onsite: 4-5 rounds covering system design, ML theory, coding, and behavioral. Total time: 3-4 weeks.

ASR & Transcription Basics

Easy
Q1: Explain the difference between WER and CER. When would you use each?
Answer:

WER (Word Error Rate) measures the percentage of words incorrectly transcribed (insertions, deletions, substitutions). It's the standard metric for English ASR.

CER (Character Error Rate) measures errors at the character level. It's better for:

  • Languages without clear word boundaries (Chinese, Japanese)
  • Evaluating punctuation and capitalization
  • Assessing partial word errors (e.g., "running" β†’ "runnin")

Use WER for: English speech recognition benchmarks
Use CER for: Asian languages, detailed error analysis, or when spelling matters

Medium
Q2: Your ASR system has 15% WER on clean audio but 40% WER on customer calls. How would you debug this?
Answer:

Systematic approach:

  1. Analyze error patterns: Are errors random or systematic? (e.g., all finance terms wrong)
  2. Check audio quality: SNR, codec artifacts, sample rate mismatch
  3. Domain mismatch: Model trained on broadcast news but customer calls have different vocabulary
  4. Speaker characteristics: Accents, speaking rate, disfluencies more common in real calls
  5. Environment: Background noise, crosstalk, echo in call centers

Solutions:

  • Fine-tune on customer call data (even 10-50 hours helps significantly)
  • Add audio preprocessing (noise reduction, echo cancellation)
  • Use domain-specific language models
  • Implement confidence scoring to flag low-quality segments
Medium
Q3: Why might you choose Kaldi over Whisper for a production speech analytics system?
Answer:

Reasons to choose Kaldi:

  • Streaming capability: Kaldi supports real-time, incremental decoding. Whisper is offline-only.
  • CPU-efficient: Kaldi runs well on CPU. Whisper requires GPU for reasonable latency.
  • Cost at scale: Processing millions of calls/day on CPU is cheaper than GPU infrastructure.
  • Customization: Easier to plug in custom language models, pronunciation dictionaries.
  • Proven stability: Kaldi has been in production for 10+ years at major companies.

However, Whisper wins on:

  • Out-of-box accuracy (especially with accents, noise)
  • Multilingual support (99 languages vs separate Kaldi models)
  • Faster development time (no recipe engineering)

Answer shows: Understanding of production trade-offs beyond just accuracy.

Speaker Diarization

Easy
Q4: What is speaker diarization and why is it critical for speech analytics?
Answer:

Speaker diarization is the process of partitioning audio into segments by speaker identityβ€”answering "who spoke when?" without necessarily knowing who the speakers are.

Critical for analytics because:

  • Agent vs Customer separation: Call centers need to analyze agent behavior separately
  • Talk-time ratios: Sales coaching requires knowing how much each person spoke
  • Turn-taking analysis: Detect interruptions, monologues, engagement patterns
  • Sentiment attribution: "Who was frustrated?" requires knowing who said what
  • Compliance: Legal/regulatory requires speaker-attributed transcripts

Without diarization: You just have a wall of text with no context about who said what.

Hard
Q5: Design a speaker diarization system for a call center with 100K calls/day. What are the key components and trade-offs?
Answer:

System Architecture:

Audio Input β†’ VAD β†’ Speaker Embedding β†’ Clustering β†’ Resegmentation β†’ Output
                                   ↓
                              Speaker DB (optional)

Key Components:

1. Voice Activity Detection (VAD):

  • Use WebRTC VAD or Silero VAD (fast, accurate)
  • Reduces computation by 40-60% (skip silence)

2. Speaker Embedding Extraction:

  • Options: x-vectors (Kaldi), ECAPA-TDNN (pyannote.audio)
  • Trade-off: x-vectors faster on CPU, ECAPA more accurate
  • Extract embeddings every 1-2 seconds with overlap

3. Clustering:

  • Agglomerative hierarchical clustering (standard)
  • Challenge: Unknown number of speakers
  • Solution: Use PLDA scoring + threshold tuning

4. Optimization for Scale (100K calls/day):

  • Batch processing: Group calls, process in parallel
  • Model size: Use smaller embedding model if accuracy permits
  • Caching: For known speakers (agents), cache embeddings
  • Infrastructure: CPU-based pipeline (cheaper at scale)

Typical Performance:

  • DER: 5-10% on call center audio (good)
  • Latency: 0.1-0.3x real-time on CPU
  • Cost: ~$0.001/minute (vs $0.02+ for cloud APIs)
Medium
Q6: How would you handle overlapping speech in speaker diarization?
Answer:

Problem: Traditional diarization assumes one speaker at a time. Real conversations have overlaps (interruptions, backchannel responses).

Solutions:

1. EEND (End-to-End Neural Diarization):

  • Trained to output multiple speakers per frame
  • Can detect overlaps directly
  • Cons: Requires lots of training data, fixed max speakers

2. Post-processing detection:

  • After initial diarization, detect potential overlap regions
  • Look for high energy in "silence" between speakers
  • Re-analyze those segments with overlap-aware models

3. Multi-channel audio:

  • If you have separate microphones, use beamforming
  • Separate sources before diarization

Practical approach for call centers:

  • Accept 2-5% error rate from overlaps (usually acceptable)
  • Focus on clean turn boundaries (90% of speech)
  • Flag overlap regions for manual review if critical

NLP & Text Analytics

Medium
Q7: How would you extract action items from a sales call transcript?
Answer:

Approach 1: Rule-Based (Fast, Explainable):

  • Look for patterns: "I'll [verb]", "Let's schedule", "Action item:", "TODO:"
  • Extract entities: dates, times, people, deliverables
  • Works well for structured calls with consistent language

Approach 2: NER + Dependency Parsing:

  • Train NER model to tag: ACTION, ASSIGNEE, DEADLINE
  • Use dependency parsing to link entities
  • More robust to variation than regex

Approach 3: LLM-Based (2026 Standard):

  • Use GPT-4 or Claude with prompt engineering
  • Provide few-shot examples of action items
  • Ask for structured JSON output
Prompt: "Extract action items from this sales call. 
Return JSON with: {task, assignee, deadline, priority}"

Transcript: [...]

Output: [
  {"task": "Send pricing proposal", "assignee": "Sarah", 
   "deadline": "2026-01-20", "priority": "high"}
]

Production considerations:

  • LLMs are expensive ($0.01-0.10/call)
  • Hybrid: Use rules for 80% of cases, LLM for complex ones
  • Always show confidence scores to users
Hard
Q8: Design a topic modeling system for analyzing 1M customer support calls. How would you make it actionable?
Answer:

Step 1: Preprocessing

  • Transcribe calls (ASR)
  • Clean: remove filler words, agent scripts (boilerplate)
  • Focus on customer utterances (more signal)

Step 2: Topic Modeling Approaches

Traditional: LDA (Latent Dirichlet Allocation)

  • Pros: Interpretable, fast, proven
  • Cons: Requires manual topic count tuning
  • Best for: Stable domains, periodic analysis

Modern: BERTopic

  • Uses BERT embeddings + UMAP + HDBSCAN
  • Automatically determines topic count
  • Better coherence than LDA
  • Best for: Dynamic domains, one-time analysis

Step 3: Making It Actionable

Bad outcome: "Topic 7 has words: refund, policy, return, unhappy"

Actionable outcome: "Refund Policy Confusion (23% of calls, ↑8% vs last month)"

How to get there:

  1. Label topics meaningfully: Use LLM to generate human-readable topic names
  2. Track over time: Topic prevalence trends (which issues growing?)
  3. Link to metrics: CSAT, resolution time by topic
  4. Alert on anomalies: "Shipping delays mentions up 50% today"
  5. Sample calls per topic: Let managers listen to examples

System Design:

  • Batch processing: Daily overnight job
  • Incremental updates: New calls assigned to existing topics
  • Quarterly re-training: Discover new emerging topics
  • Dashboard: Show top topics, trends, drill-down to calls

Sentiment Analysis

Medium
Q9: Why is sentiment analysis on transcribed speech harder than on written text?
Answer:

Challenges unique to speech:

1. ASR Errors Impact Sentiment:

  • "I'm not happy" β†’ "I'm happy" (transcription error completely flips sentiment)
  • Misspelled sentiment words ("grate" vs "great")

2. Missing Prosody (Tone):

  • "That's great." (flat tone = sarcasm, missed in text)
  • Excitement vs anger: same words, different meaning
  • Solution: Use acoustic features (pitch, energy) alongside text

3. Disfluencies:

  • "Um, well, I guess it's, uh, okay maybe?" is negative despite "okay"
  • Hesitation patterns indicate uncertainty/dissatisfaction

4. Context Dependency:

  • "I'll have to think about it" (rejection in sales context)
  • Requires understanding of speaker role (agent vs customer)

Best Practices:

  • Multi-modal: Combine text (BERT sentiment) + audio (prosody features)
  • Speaker-aware: Analyze customer sentiment separately from agent
  • Utterance-level: Don't average sentiment over full call
  • Confidence scores: Flag uncertain predictions for review
Hard
Q10: You're building a real-time sentiment dashboard for call center managers. How would you design the ML pipeline?
Answer:

Requirements:

  • Low latency: Sentiment visible within 5-10 seconds
  • Accuracy: Good enough for intervention decisions
  • Scale: 1000 concurrent calls

Architecture:

Call Audio Stream
    ↓
Streaming ASR (Kaldi/Conformer-RNN-T)
    ↓
Utterance Buffer (5-10s windows)
    ↓
Sentiment Model (DistilBERT fine-tuned)
    ↓
WebSocket β†’ Dashboard
    ↓
Alert System (if negative sentiment detected)

Key Design Decisions:

1. ASR Choice:

  • Must be streaming (Whisper won't work)
  • Options: Kaldi, Conformer-RNN-T, Deepgram API
  • Trade-off: Build vs buy (Kaldi = cheaper at scale, API = faster to market)

2. Sentiment Model:

  • DistilBERT (6x faster than BERT, 97% accuracy)
  • Fine-tuned on call center data (critical!)
  • Inference: <50ms on CPU

3. Windowing Strategy:

  • Analyze 5-10 second chunks (balance latency vs context)
  • Moving average: Smooth out noise
  • Spike detection: Alert when sentiment drops suddenly

4. Alert Logic:

  • Escalate if: Sustained negative sentiment (>30s) OR sudden drop
  • Provide context: "Customer said 'this is unacceptable' at 2:34"
  • Suggested action: "Offer supervisor escalation"

Infrastructure:

  • ASR: GPU instances (or streaming API)
  • Sentiment: CPU inference (batched)
  • WebSockets: Real-time dashboard updates
  • Redis: Store current call states

System Design

Hard
Q11: Design a scalable speech analytics platform like Gong. Walk through the architecture from audio ingestion to insights delivery.
Answer:

High-Level Architecture:

Audio Ingestion β†’ Processing Pipeline β†’ Analytics β†’ Storage β†’ API/UI
     ↓                    ↓                  ↓         ↓         ↓
  Zoom/Meet          ASR+Diarization    NLP Models    DB    Dashboard

Detailed Components:

1. Audio Ingestion:

  • Integrations: Zoom, Google Meet, Webex APIs
  • Recording capture: Auto-join meetings as bot
  • Queue: Kafka for async processing
  • Storage: S3 for raw audio (lifecycle policy: delete after 90 days)

2. Processing Pipeline (Airflow/Temporal):

  • Task 1: Audio preprocessing (format conversion, enhancement)
  • Task 2: ASR (Whisper on GPU)
  • Task 3: Speaker diarization (pyannote.audio)
  • Task 4: Merge ASR + diarization outputs
  • Task 5: NLP analytics (parallel jobs)

3. NLP Analytics (Parallel Processing):

  • Sentiment analysis per speaker
  • Topic extraction
  • Action items detection
  • Question identification
  • Talk-time ratios, interruptions, speaking rate
  • Keyword/competitor mentions

4. Storage Layer:

  • PostgreSQL: Structured data (users, meetings, metadata)
  • Elasticsearch: Full-text search on transcripts
  • Vector DB (Pinecone/Weaviate): Semantic search ("find calls about pricing objections")
  • Redis: Caching, session data

5. API Layer (FastAPI):

  • REST API for CRUD operations
  • GraphQL for complex queries
  • WebSockets for real-time features
  • Rate limiting, authentication (OAuth)

6. Frontend (React):

  • Dashboard: Call library, analytics charts
  • Call player: Transcript + audio synchronized
  • Search: Semantic + keyword search
  • Insights: AI-generated summaries, coaching tips

Scale Considerations:

  • Processing: 10K meetings/day = ~20K hours audio/month
  • ASR cost: Whisper at $0.02/min = $24K/month (optimize with batching)
  • Storage: 1 hour audio β‰ˆ 50 MB, transcript β‰ˆ 100 KB
  • Compute: Auto-scaling GPU pools for ASR, CPU for NLP

Key Optimizations:

  • Batch process non-urgent calls (overnight)
  • Priority queue for "urgent" calls (live meetings)
  • Caching: Pre-compute common analytics views
  • CDN: Serve audio files from edge locations

ML Theory

Medium
Q12: Explain how you would fine-tune a pre-trained BERT model for call center sentiment analysis. What are the key steps?
Answer:

Dataset Preparation:

  1. Collect 5K-10K labeled call transcripts (customer utterances)
  2. Labels: Positive, Neutral, Negative (or 5-point scale)
  3. Balance classes (over/undersample if needed)
  4. Split: 80% train, 10% validation, 10% test

Preprocessing:

  • Tokenize with BERT tokenizer (WordPiece)
  • Max length: 512 tokens (truncate longer utterances)
  • Handle ASR errors: Keep as-is (model learns robustness)

Model Architecture:

  • Base: `bert-base-uncased` (110M params)
  • Add classification head: Linear(768 β†’ 3 classes)
  • Alternative: Use DistilBERT (6x faster, similar accuracy)

Training:

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=3
)

# Hyperparameters
learning_rate = 2e-5  # Small LR for fine-tuning
epochs = 3-5
batch_size = 16
optimizer = AdamW

Evaluation:

  • Metrics: Accuracy, F1 (macro), confusion matrix
  • Error analysis: Which sentiment mistakes are critical?
  • Calibration: Check confidence scores align with accuracy

Deployment:

  • Export to ONNX for faster inference
  • Inference: <100ms on CPU for single utterance
  • Monitor: Track accuracy on production data, retrain quarterly

Coding Challenges

Medium
Q13: Write a function to calculate speaker talk-time percentages from a diarization output.
Answer:
def calculate_talk_time(diarization_output):
    """
    Calculate talk-time percentage per speaker.
    
    Args:
        diarization_output: List of tuples [(start, end, speaker), ...]
        e.g., [(0.0, 5.2, 'SPEAKER_00'), (5.2, 12.8, 'SPEAKER_01'), ...]
    
    Returns:
        Dict of {speaker: talk_time_percentage}
    """
    speaker_durations = {}
    total_duration = 0
    
    for start, end, speaker in diarization_output:
        duration = end - start
        speaker_durations[speaker] = speaker_durations.get(speaker, 0) + duration
        total_duration += duration
    
    # Calculate percentages
    talk_time_pct = {
        speaker: (duration / total_duration) * 100 
        for speaker, duration in speaker_durations.items()
    }
    
    return talk_time_pct

# Example
diarization = [
    (0.0, 5.2, 'SPEAKER_00'),
    (5.2, 12.8, 'SPEAKER_01'),
    (12.8, 18.5, 'SPEAKER_00'),
    (18.5, 25.0, 'SPEAKER_01')
]

result = calculate_talk_time(diarization)
# {'SPEAKER_00': 46.0, 'SPEAKER_01': 54.0}

Follow-up questions interviewers might ask:

  • How would you handle overlapping speech? (Add overlap handling logic)
  • What if there are gaps (silence)? (Track silence separately)
  • How to optimize for large datasets? (Use NumPy for vectorization)

Behavioral Questions

Medium
Q14: Tell me about a time you had to make a trade-off between model accuracy and latency in production.
Answer Framework (STAR):

Situation: "At [Company], we were deploying real-time sentiment analysis for live customer calls. Initial model was BERT-large with 95% accuracy but 300ms latency."

Task: "Product required <100ms latency for real-time agent coaching. I needed to reduce latency by 3x without sacrificing too much accuracy."

Action:

  • "Benchmarked alternatives: DistilBERT, TinyBERT, lightweight CNNs"
  • "DistilBERT gave 92% accuracy at 50ms (6x faster)"
  • "Quantized to INT8, reducing latency to 35ms"
  • "Implemented confidence thresholding: only show predictions >0.85 confidence"
  • "A/B tested with customer success team"

Result: "Deployed DistilBERT-INT8. Achieved 40ms latency, 91% accuracy on production data. CS team reported 30% faster resolution times. Small accuracy drop (95% β†’ 91%) was acceptable given 7x latency improvement."

Why this answer works: Shows quantitative thinking, practical trade-offs, validation with stakeholders.

βœ“ Interview Success Tips

For technical questions: Always explain your reasoning. Interviewers care more about how you think than memorized answers. Start with clarifying questions, state assumptions, then walk through your approach systematically.

Company-Specific Focus Areas

Gong interviews emphasize:

  • System design for scale (millions of calls)
  • Real-time analytics challenges
  • Product thinking (what insights matter to sales teams?)

Chorus/CallMiner focus on:

  • ASR accuracy improvements
  • Contact center domain knowledge
  • Compliance and quality monitoring

Otter/Fireflies ask about:

  • Consumer product thinking
  • Whisper optimization and fine-tuning
  • Cross-platform integration (Zoom, Meet, Teams)

Preparation Checklist

  • βœ… Review ASR basics (WER, model types, evaluation)
  • βœ… Understand speaker diarization deeply (pyannote.audio docs)
  • βœ… Practice NLP tasks (sentiment, NER, topic modeling)
  • βœ… Design 2-3 speech analytics systems from scratch
  • βœ… Code implementations of common tasks (talk-time, keyword extraction)
  • βœ… Prepare behavioral stories using STAR framework
  • βœ… Research the specific company's tech stack and product

Additional Resources

Ready to Apply?

Browse speech analytics roles at Gong, Chorus, CallMiner, and more.

View Speech Analytics Jobs