Speech analytics interviews test your ability to build systems that extract insights from conversations. Whether you're interviewing at Gong, Chorus, CallMiner, or a startup, you'll face questions spanning ASR, NLP, speaker diarization, sentiment analysis, and system design.
This guide covers 50+ real interview questions collected from engineers who've interviewed at top speech analytics companies. Each question includes a detailed answer, difficulty rating, and tips on what interviewers are looking for.
Gong/Chorus typical process: (1) Phone screen with recruiter, (2) Technical phone screen (45 min coding + theory), (3) Onsite: 4-5 rounds covering system design, ML theory, coding, and behavioral. Total time: 3-4 weeks.
Jump to Section:
ASR & Transcription Basics
WER (Word Error Rate) measures the percentage of words incorrectly transcribed (insertions, deletions, substitutions). It's the standard metric for English ASR.
CER (Character Error Rate) measures errors at the character level. It's better for:
- Languages without clear word boundaries (Chinese, Japanese)
- Evaluating punctuation and capitalization
- Assessing partial word errors (e.g., "running" β "runnin")
Use WER for: English speech recognition benchmarks
Use CER for: Asian languages, detailed error analysis, or when spelling matters
Systematic approach:
- Analyze error patterns: Are errors random or systematic? (e.g., all finance terms wrong)
- Check audio quality: SNR, codec artifacts, sample rate mismatch
- Domain mismatch: Model trained on broadcast news but customer calls have different vocabulary
- Speaker characteristics: Accents, speaking rate, disfluencies more common in real calls
- Environment: Background noise, crosstalk, echo in call centers
Solutions:
- Fine-tune on customer call data (even 10-50 hours helps significantly)
- Add audio preprocessing (noise reduction, echo cancellation)
- Use domain-specific language models
- Implement confidence scoring to flag low-quality segments
Reasons to choose Kaldi:
- Streaming capability: Kaldi supports real-time, incremental decoding. Whisper is offline-only.
- CPU-efficient: Kaldi runs well on CPU. Whisper requires GPU for reasonable latency.
- Cost at scale: Processing millions of calls/day on CPU is cheaper than GPU infrastructure.
- Customization: Easier to plug in custom language models, pronunciation dictionaries.
- Proven stability: Kaldi has been in production for 10+ years at major companies.
However, Whisper wins on:
- Out-of-box accuracy (especially with accents, noise)
- Multilingual support (99 languages vs separate Kaldi models)
- Faster development time (no recipe engineering)
Answer shows: Understanding of production trade-offs beyond just accuracy.
Speaker Diarization
Speaker diarization is the process of partitioning audio into segments by speaker identityβanswering "who spoke when?" without necessarily knowing who the speakers are.
Critical for analytics because:
- Agent vs Customer separation: Call centers need to analyze agent behavior separately
- Talk-time ratios: Sales coaching requires knowing how much each person spoke
- Turn-taking analysis: Detect interruptions, monologues, engagement patterns
- Sentiment attribution: "Who was frustrated?" requires knowing who said what
- Compliance: Legal/regulatory requires speaker-attributed transcripts
Without diarization: You just have a wall of text with no context about who said what.
System Architecture:
Audio Input β VAD β Speaker Embedding β Clustering β Resegmentation β Output
β
Speaker DB (optional)
Key Components:
1. Voice Activity Detection (VAD):
- Use WebRTC VAD or Silero VAD (fast, accurate)
- Reduces computation by 40-60% (skip silence)
2. Speaker Embedding Extraction:
- Options: x-vectors (Kaldi), ECAPA-TDNN (pyannote.audio)
- Trade-off: x-vectors faster on CPU, ECAPA more accurate
- Extract embeddings every 1-2 seconds with overlap
3. Clustering:
- Agglomerative hierarchical clustering (standard)
- Challenge: Unknown number of speakers
- Solution: Use PLDA scoring + threshold tuning
4. Optimization for Scale (100K calls/day):
- Batch processing: Group calls, process in parallel
- Model size: Use smaller embedding model if accuracy permits
- Caching: For known speakers (agents), cache embeddings
- Infrastructure: CPU-based pipeline (cheaper at scale)
Typical Performance:
- DER: 5-10% on call center audio (good)
- Latency: 0.1-0.3x real-time on CPU
- Cost: ~$0.001/minute (vs $0.02+ for cloud APIs)
Problem: Traditional diarization assumes one speaker at a time. Real conversations have overlaps (interruptions, backchannel responses).
Solutions:
1. EEND (End-to-End Neural Diarization):
- Trained to output multiple speakers per frame
- Can detect overlaps directly
- Cons: Requires lots of training data, fixed max speakers
2. Post-processing detection:
- After initial diarization, detect potential overlap regions
- Look for high energy in "silence" between speakers
- Re-analyze those segments with overlap-aware models
3. Multi-channel audio:
- If you have separate microphones, use beamforming
- Separate sources before diarization
Practical approach for call centers:
- Accept 2-5% error rate from overlaps (usually acceptable)
- Focus on clean turn boundaries (90% of speech)
- Flag overlap regions for manual review if critical
NLP & Text Analytics
Approach 1: Rule-Based (Fast, Explainable):
- Look for patterns: "I'll [verb]", "Let's schedule", "Action item:", "TODO:"
- Extract entities: dates, times, people, deliverables
- Works well for structured calls with consistent language
Approach 2: NER + Dependency Parsing:
- Train NER model to tag: ACTION, ASSIGNEE, DEADLINE
- Use dependency parsing to link entities
- More robust to variation than regex
Approach 3: LLM-Based (2026 Standard):
- Use GPT-4 or Claude with prompt engineering
- Provide few-shot examples of action items
- Ask for structured JSON output
Prompt: "Extract action items from this sales call.
Return JSON with: {task, assignee, deadline, priority}"
Transcript: [...]
Output: [
{"task": "Send pricing proposal", "assignee": "Sarah",
"deadline": "2026-01-20", "priority": "high"}
]
Production considerations:
- LLMs are expensive ($0.01-0.10/call)
- Hybrid: Use rules for 80% of cases, LLM for complex ones
- Always show confidence scores to users
Step 1: Preprocessing
- Transcribe calls (ASR)
- Clean: remove filler words, agent scripts (boilerplate)
- Focus on customer utterances (more signal)
Step 2: Topic Modeling Approaches
Traditional: LDA (Latent Dirichlet Allocation)
- Pros: Interpretable, fast, proven
- Cons: Requires manual topic count tuning
- Best for: Stable domains, periodic analysis
Modern: BERTopic
- Uses BERT embeddings + UMAP + HDBSCAN
- Automatically determines topic count
- Better coherence than LDA
- Best for: Dynamic domains, one-time analysis
Step 3: Making It Actionable
Bad outcome: "Topic 7 has words: refund, policy, return, unhappy"
Actionable outcome: "Refund Policy Confusion (23% of calls, β8% vs last month)"
How to get there:
- Label topics meaningfully: Use LLM to generate human-readable topic names
- Track over time: Topic prevalence trends (which issues growing?)
- Link to metrics: CSAT, resolution time by topic
- Alert on anomalies: "Shipping delays mentions up 50% today"
- Sample calls per topic: Let managers listen to examples
System Design:
- Batch processing: Daily overnight job
- Incremental updates: New calls assigned to existing topics
- Quarterly re-training: Discover new emerging topics
- Dashboard: Show top topics, trends, drill-down to calls
Sentiment Analysis
Challenges unique to speech:
1. ASR Errors Impact Sentiment:
- "I'm not happy" β "I'm happy" (transcription error completely flips sentiment)
- Misspelled sentiment words ("grate" vs "great")
2. Missing Prosody (Tone):
- "That's great." (flat tone = sarcasm, missed in text)
- Excitement vs anger: same words, different meaning
- Solution: Use acoustic features (pitch, energy) alongside text
3. Disfluencies:
- "Um, well, I guess it's, uh, okay maybe?" is negative despite "okay"
- Hesitation patterns indicate uncertainty/dissatisfaction
4. Context Dependency:
- "I'll have to think about it" (rejection in sales context)
- Requires understanding of speaker role (agent vs customer)
Best Practices:
- Multi-modal: Combine text (BERT sentiment) + audio (prosody features)
- Speaker-aware: Analyze customer sentiment separately from agent
- Utterance-level: Don't average sentiment over full call
- Confidence scores: Flag uncertain predictions for review
Requirements:
- Low latency: Sentiment visible within 5-10 seconds
- Accuracy: Good enough for intervention decisions
- Scale: 1000 concurrent calls
Architecture:
Call Audio Stream
β
Streaming ASR (Kaldi/Conformer-RNN-T)
β
Utterance Buffer (5-10s windows)
β
Sentiment Model (DistilBERT fine-tuned)
β
WebSocket β Dashboard
β
Alert System (if negative sentiment detected)
Key Design Decisions:
1. ASR Choice:
- Must be streaming (Whisper won't work)
- Options: Kaldi, Conformer-RNN-T, Deepgram API
- Trade-off: Build vs buy (Kaldi = cheaper at scale, API = faster to market)
2. Sentiment Model:
- DistilBERT (6x faster than BERT, 97% accuracy)
- Fine-tuned on call center data (critical!)
- Inference: <50ms on CPU
3. Windowing Strategy:
- Analyze 5-10 second chunks (balance latency vs context)
- Moving average: Smooth out noise
- Spike detection: Alert when sentiment drops suddenly
4. Alert Logic:
- Escalate if: Sustained negative sentiment (>30s) OR sudden drop
- Provide context: "Customer said 'this is unacceptable' at 2:34"
- Suggested action: "Offer supervisor escalation"
Infrastructure:
- ASR: GPU instances (or streaming API)
- Sentiment: CPU inference (batched)
- WebSockets: Real-time dashboard updates
- Redis: Store current call states
System Design
High-Level Architecture:
Audio Ingestion β Processing Pipeline β Analytics β Storage β API/UI
β β β β β
Zoom/Meet ASR+Diarization NLP Models DB Dashboard
Detailed Components:
1. Audio Ingestion:
- Integrations: Zoom, Google Meet, Webex APIs
- Recording capture: Auto-join meetings as bot
- Queue: Kafka for async processing
- Storage: S3 for raw audio (lifecycle policy: delete after 90 days)
2. Processing Pipeline (Airflow/Temporal):
- Task 1: Audio preprocessing (format conversion, enhancement)
- Task 2: ASR (Whisper on GPU)
- Task 3: Speaker diarization (pyannote.audio)
- Task 4: Merge ASR + diarization outputs
- Task 5: NLP analytics (parallel jobs)
3. NLP Analytics (Parallel Processing):
- Sentiment analysis per speaker
- Topic extraction
- Action items detection
- Question identification
- Talk-time ratios, interruptions, speaking rate
- Keyword/competitor mentions
4. Storage Layer:
- PostgreSQL: Structured data (users, meetings, metadata)
- Elasticsearch: Full-text search on transcripts
- Vector DB (Pinecone/Weaviate): Semantic search ("find calls about pricing objections")
- Redis: Caching, session data
5. API Layer (FastAPI):
- REST API for CRUD operations
- GraphQL for complex queries
- WebSockets for real-time features
- Rate limiting, authentication (OAuth)
6. Frontend (React):
- Dashboard: Call library, analytics charts
- Call player: Transcript + audio synchronized
- Search: Semantic + keyword search
- Insights: AI-generated summaries, coaching tips
Scale Considerations:
- Processing: 10K meetings/day = ~20K hours audio/month
- ASR cost: Whisper at $0.02/min = $24K/month (optimize with batching)
- Storage: 1 hour audio β 50 MB, transcript β 100 KB
- Compute: Auto-scaling GPU pools for ASR, CPU for NLP
Key Optimizations:
- Batch process non-urgent calls (overnight)
- Priority queue for "urgent" calls (live meetings)
- Caching: Pre-compute common analytics views
- CDN: Serve audio files from edge locations
ML Theory
Dataset Preparation:
- Collect 5K-10K labeled call transcripts (customer utterances)
- Labels: Positive, Neutral, Negative (or 5-point scale)
- Balance classes (over/undersample if needed)
- Split: 80% train, 10% validation, 10% test
Preprocessing:
- Tokenize with BERT tokenizer (WordPiece)
- Max length: 512 tokens (truncate longer utterances)
- Handle ASR errors: Keep as-is (model learns robustness)
Model Architecture:
- Base: `bert-base-uncased` (110M params)
- Add classification head: Linear(768 β 3 classes)
- Alternative: Use DistilBERT (6x faster, similar accuracy)
Training:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=3
)
# Hyperparameters
learning_rate = 2e-5 # Small LR for fine-tuning
epochs = 3-5
batch_size = 16
optimizer = AdamW
Evaluation:
- Metrics: Accuracy, F1 (macro), confusion matrix
- Error analysis: Which sentiment mistakes are critical?
- Calibration: Check confidence scores align with accuracy
Deployment:
- Export to ONNX for faster inference
- Inference: <100ms on CPU for single utterance
- Monitor: Track accuracy on production data, retrain quarterly
Coding Challenges
def calculate_talk_time(diarization_output):
"""
Calculate talk-time percentage per speaker.
Args:
diarization_output: List of tuples [(start, end, speaker), ...]
e.g., [(0.0, 5.2, 'SPEAKER_00'), (5.2, 12.8, 'SPEAKER_01'), ...]
Returns:
Dict of {speaker: talk_time_percentage}
"""
speaker_durations = {}
total_duration = 0
for start, end, speaker in diarization_output:
duration = end - start
speaker_durations[speaker] = speaker_durations.get(speaker, 0) + duration
total_duration += duration
# Calculate percentages
talk_time_pct = {
speaker: (duration / total_duration) * 100
for speaker, duration in speaker_durations.items()
}
return talk_time_pct
# Example
diarization = [
(0.0, 5.2, 'SPEAKER_00'),
(5.2, 12.8, 'SPEAKER_01'),
(12.8, 18.5, 'SPEAKER_00'),
(18.5, 25.0, 'SPEAKER_01')
]
result = calculate_talk_time(diarization)
# {'SPEAKER_00': 46.0, 'SPEAKER_01': 54.0}
Follow-up questions interviewers might ask:
- How would you handle overlapping speech? (Add overlap handling logic)
- What if there are gaps (silence)? (Track silence separately)
- How to optimize for large datasets? (Use NumPy for vectorization)
Behavioral Questions
Situation: "At [Company], we were deploying real-time sentiment analysis for live customer calls. Initial model was BERT-large with 95% accuracy but 300ms latency."
Task: "Product required <100ms latency for real-time agent coaching. I needed to reduce latency by 3x without sacrificing too much accuracy."
Action:
- "Benchmarked alternatives: DistilBERT, TinyBERT, lightweight CNNs"
- "DistilBERT gave 92% accuracy at 50ms (6x faster)"
- "Quantized to INT8, reducing latency to 35ms"
- "Implemented confidence thresholding: only show predictions >0.85 confidence"
- "A/B tested with customer success team"
Result: "Deployed DistilBERT-INT8. Achieved 40ms latency, 91% accuracy on production data. CS team reported 30% faster resolution times. Small accuracy drop (95% β 91%) was acceptable given 7x latency improvement."
Why this answer works: Shows quantitative thinking, practical trade-offs, validation with stakeholders.
For technical questions: Always explain your reasoning. Interviewers care more about how you think than memorized answers. Start with clarifying questions, state assumptions, then walk through your approach systematically.
Company-Specific Focus Areas
Gong interviews emphasize:
- System design for scale (millions of calls)
- Real-time analytics challenges
- Product thinking (what insights matter to sales teams?)
Chorus/CallMiner focus on:
- ASR accuracy improvements
- Contact center domain knowledge
- Compliance and quality monitoring
Otter/Fireflies ask about:
- Consumer product thinking
- Whisper optimization and fine-tuning
- Cross-platform integration (Zoom, Meet, Teams)
Preparation Checklist
- β Review ASR basics (WER, model types, evaluation)
- β Understand speaker diarization deeply (pyannote.audio docs)
- β Practice NLP tasks (sentiment, NER, topic modeling)
- β Design 2-3 speech analytics systems from scratch
- β Code implementations of common tasks (talk-time, keyword extraction)
- β Prepare behavioral stories using STAR framework
- β Research the specific company's tech stack and product