Stereo Call Transcription with Diarization
When you record calls in stereo format—with each speaker on a separate audio channel—you can achieve 100% accurate speaker identification. This guide shows you how to transcribe stereo recordings, identify speakers automatically, and extract structured insights from your calls.
Why Stereo Diarization?
| Aspect | Standard Diarization | Stereo Diarization |
|---|---|---|
| Speaker identification | AI-based detection | Channel-based (L/R) |
| Accuracy | Good | Perfect (100%) |
| Performance | Normal | Faster |
| Speaker labels | SPEAKER 1, SPEAKER 2... | SPEAKER_L, SPEAKER_R |
| Best for | Mono audio, meetings | Call center recordings |
Ideal for Call Centers
Most PBX systems (FreeSWITCH, Asterisk) can record calls in stereo with each party on a separate channel. This eliminates any guesswork in speaker identification.
Prerequisites
- Stereo audio file: MP3, WAV, or other supported format with speakers on separate channels
- SipPulse AI API key: Get yours at sippulse.ai
- Pro model access:
pulse-precision-pro
Stereo Diarization Model
The pulse-precision-pro model supports both stereo and mono diarization:
| Model | Speed | Accuracy | Best For |
|---|---|---|---|
pulse-precision-pro | Optimal | Highest | Quality-critical stereo call transcriptions |
Pro Model Features
The pulse-precision-pro model includes advanced features:
- Stereo diarization: 100% accurate channel-based speaker identification
- VAD preset: Use
vad_preset=telephonyfor optimized 8kHz narrow-band audio - Highest accuracy: Best Word Error Rate (WER) for call center analytics
Step 1: Prepare Your Audio
For stereo diarization to work correctly, your audio must have:
- Left channel (L): One speaker (e.g., the customer)
- Right channel (R): Other speaker (e.g., the agent)
Recording Configuration
Most PBX systems support stereo recording:
- FreeSWITCH: Use
RECORD_STEREO=truein your dialplan - Asterisk: Configure MixMonitor with the
Doption for stereo
Channel Consistency
Ensure consistent channel assignment across recordings. Document whether customers are always on the left or right channel for accurate analysis.
Step 2: Transcribe with Stereo Diarization
Use the /v1/asr/transcribe endpoint with response_format=stereo_diarization and one of the Pro models.
curl -X POST 'https://api.sippulse.ai/v1/asr/transcribe' \
-H 'api-key: $SIPPULSE_API_KEY' \
-F 'file=@call-recording.mp3' \
-F 'model=pulse-precision-pro' \
-F 'response_format=stereo_diarization' \
-F 'language=en' \
-F 'vad_preset=telephony'import FormData from "form-data";
import fs from "fs";
async function transcribeStereoCall(
filePath: string
): Promise<StereoTranscription> {
const form = new FormData();
form.append("file", fs.createReadStream(filePath));
form.append("model", "pulse-precision-pro");
form.append("response_format", "stereo_diarization");
form.append("language", "en");
form.append("vad_preset", "telephony"); // Optimized for phone calls
const response = await fetch("https://api.sippulse.ai/v1/asr/transcribe", {
method: "POST",
headers: {
"api-key": process.env.SIPPULSE_API_KEY!,
},
body: form,
});
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
return response.json();
}
interface StereoTranscription {
text: string;
segments: Array<{
speaker: "SPEAKER_L" | "SPEAKER_R";
text: string;
initial_time: number;
end_time: number;
}>;
words: Array<{
word: string;
speaker: "SPEAKER_L" | "SPEAKER_R";
start: number;
end: number;
}>;
}import os
import requests
def transcribe_stereo_call(file_path: str) -> dict:
"""
Transcribe a stereo call recording with speaker diarization.
Args:
file_path: Path to the stereo audio file
Returns:
Transcription with speaker-labeled segments and words
"""
with open(file_path, "rb") as audio_file:
response = requests.post(
"https://api.sippulse.ai/v1/asr/transcribe",
headers={"api-key": os.getenv("SIPPULSE_API_KEY")},
files={"file": audio_file},
data={
"model": "pulse-precision-pro",
"response_format": "stereo_diarization",
"language": "en",
"vad_preset": "telephony", # Optimized for phone calls
},
)
response.raise_for_status()
return response.json()Step 3: Understand the Response
The stereo diarization response includes three main components:
Response Structure
{
"text": "00:02-00:05 | SPEAKER L:\nHello, how can I help you today?\n\n00:05-00:08 | SPEAKER R:\nHi, I'm calling about my account...",
"segments": [
{
"speaker": "SPEAKER_L",
"text": "Hello, how can I help you today?",
"initial_time": 2.1,
"end_time": 5.3
},
{
"speaker": "SPEAKER_R",
"text": "Hi, I'm calling about my account...",
"initial_time": 5.5,
"end_time": 8.2
}
],
"words": [
{
"word": "Hello,",
"speaker": "SPEAKER_L",
"start": 2.1,
"end": 2.5
},
{
"word": "how",
"speaker": "SPEAKER_L",
"start": 2.5,
"end": 2.7
}
]
}Key Fields
| Field | Description |
|---|---|
text | Formatted transcript with timestamps and speaker labels |
segments | Array of speech segments with speaker, text, and timing |
words | Word-level timestamps with speaker attribution |
speaker | SPEAKER_L (left channel) or SPEAKER_R (right channel) |
Step 4: Analyze with Structured Analysis
After transcription, use Structured Analysis to extract insights from the conversation.
Setting Up Your Analysis
- Navigate to Structured Analysis in the SipPulse AI dashboard
- Create a new analysis or select an existing preset like "Conversation Analysis"
- Configure your schema with the fields you want to extract
- Copy the Analysis ID using the copy button next to the analysis name
The Conversation Analysis template extracts:
- Total questions and response rate
- Whether the call achieved its goal (sale, resolution, etc.)
- Customer interest level (0-1)
- Main objections and if they were resolved
- Service tone and empathy level
- Overall success score (0-1)
- Recommendations for improvement
Execute Analysis via API
Use the copied Analysis ID to execute the analysis programmatically:
interface ConversationAnalysis {
total_questions: number;
questions_answered: string[];
response_rate: number;
sale_completed: boolean;
client_interest: number;
main_objections: string[];
objections_resolved: boolean;
service_tone: string;
empathy_level: number;
overall_score: number;
recommendations: string[];
next_steps: string;
}
async function analyzeConversation(
analysisId: string,
transcriptionText: string
): Promise<ConversationAnalysis> {
const response = await fetch(
`https://api.sippulse.ai/v1/structured-analyses/${analysisId}/execute`,
{
method: "POST",
headers: {
"api-key": process.env.SIPPULSE_API_KEY!,
"Content-Type": "application/json",
},
body: JSON.stringify({
content: transcriptionText,
}),
}
);
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
const result = await response.json();
return result.content;
}import os
import requests
from typing import TypedDict
class ConversationAnalysis(TypedDict):
total_questions: int
questions_answered: list[str]
response_rate: float
sale_completed: bool
client_interest: float
main_objections: list[str]
objections_resolved: bool
service_tone: str
empathy_level: float
overall_score: float
recommendations: list[str]
next_steps: str
def analyze_conversation(
analysis_id: str,
transcription_text: str
) -> ConversationAnalysis:
"""
Analyze a call transcription using Structured Analysis.
Args:
analysis_id: ID of the Conversation Analysis preset
transcription_text: The formatted transcription text
Returns:
Structured analysis results
"""
response = requests.post(
f"https://api.sippulse.ai/v1/structured-analyses/{analysis_id}/execute",
headers={
"api-key": os.getenv("SIPPULSE_API_KEY"),
"Content-Type": "application/json",
},
json={"content": transcription_text},
)
response.raise_for_status()
return response.json()["content"]Complete Example: End-to-End Pipeline
Here's a complete example that transcribes a stereo call and analyzes it:
import FormData from "form-data";
import fs from "fs";
async function processCallRecording(audioPath: string, analysisId: string) {
// Step 1: Transcribe with stereo diarization
console.log("Transcribing audio...");
const form = new FormData();
form.append("file", fs.createReadStream(audioPath));
form.append("model", "pulse-precision-pro");
form.append("response_format", "stereo_diarization");
form.append("language", "en");
form.append("vad_preset", "telephony");
const transcribeResponse = await fetch(
"https://api.sippulse.ai/v1/asr/transcribe",
{
method: "POST",
headers: { "api-key": process.env.SIPPULSE_API_KEY! },
body: form,
}
);
const transcription = await transcribeResponse.json();
console.log(`Transcribed ${transcription.segments.length} segments`);
// Step 2: Analyze the conversation
console.log("Analyzing conversation...");
const analyzeResponse = await fetch(
`https://api.sippulse.ai/v1/structured-analyses/${analysisId}/execute`,
{
method: "POST",
headers: {
"api-key": process.env.SIPPULSE_API_KEY!,
"Content-Type": "application/json",
},
body: JSON.stringify({ content: transcription.text }),
}
);
const analysis = await analyzeResponse.json();
// Step 3: Return combined results
return {
transcription: {
text: transcription.text,
segments: transcription.segments,
speakerCount: 2,
},
analysis: analysis.content,
};
}
// Usage - get analysisId from the dashboard by clicking the copy button
const result = await processCallRecording(
"./recordings/support-call.mp3",
"sa_abc123def456" // Your Analysis ID from the dashboard
);
console.log("Overall Score:", result.analysis.overall_score);
console.log("Customer Interest:", result.analysis.client_interest);
console.log("Recommendations:", result.analysis.recommendations);import os
import requests
def process_call_recording(audio_path: str, analysis_id: str) -> dict:
"""
Complete pipeline: transcribe stereo call and analyze it.
Args:
audio_path: Path to the stereo audio file
analysis_id: ID of the Conversation Analysis preset (copy from dashboard)
Returns:
Combined transcription and analysis results
"""
api_key = os.getenv("SIPPULSE_API_KEY")
# Step 1: Transcribe with stereo diarization
print("Transcribing audio...")
with open(audio_path, "rb") as audio_file:
transcribe_response = requests.post(
"https://api.sippulse.ai/v1/asr/transcribe",
headers={"api-key": api_key},
files={"file": audio_file},
data={
"model": "pulse-precision-pro",
"response_format": "stereo_diarization",
"language": "en",
"vad_preset": "telephony",
},
)
transcribe_response.raise_for_status()
transcription = transcribe_response.json()
print(f"Transcribed {len(transcription['segments'])} segments")
# Step 2: Analyze the conversation
print("Analyzing conversation...")
analyze_response = requests.post(
f"https://api.sippulse.ai/v1/structured-analyses/{analysis_id}/execute",
headers={
"api-key": api_key,
"Content-Type": "application/json",
},
json={"content": transcription["text"]},
)
analyze_response.raise_for_status()
analysis = analyze_response.json()
# Step 3: Return combined results
return {
"transcription": {
"text": transcription["text"],
"segments": transcription["segments"],
"speaker_count": 2,
},
"analysis": analysis["content"],
}
if __name__ == "__main__":
# Get analysis_id from the dashboard by clicking the copy button
result = process_call_recording(
"./recordings/support-call.mp3",
"sa_abc123def456" # Your Analysis ID from the dashboard
)
print(f"Overall Score: {result['analysis']['overall_score']}")
print(f"Customer Interest: {result['analysis']['client_interest']}")
print(f"Recommendations: {result['analysis']['recommendations']}")Example Output
Here's what the complete response looks like for a support call:
{
"transcription": {
"text": "00:00-00:03 | SPEAKER L:\nThank you for calling TechSupport, my name is Sarah. How can I help you today?\n\n00:03-00:09 | SPEAKER R:\nHi Sarah, I'm having trouble logging into my account. It keeps saying my password is incorrect, but I'm sure I'm typing it right.\n\n00:09-00:15 | SPEAKER L:\nI'm sorry to hear that. Let me help you with that. Can I have your email address associated with the account?\n\n00:15-00:18 | SPEAKER R:\nSure, it's john.smith@email.com.\n\n00:18-00:25 | SPEAKER L:\nThank you, John. I can see your account here. It looks like there were several failed login attempts, so the account was temporarily locked for security.\n\n00:25-00:28 | SPEAKER R:\nOh, that explains it. How can I unlock it?\n\n00:28-00:38 | SPEAKER L:\nI can unlock it for you right now. I'll also send a password reset link to your email. You should receive it within the next few minutes. Is there anything else I can help you with?\n\n00:38-00:42 | SPEAKER R:\nNo, that's all I needed. Thank you so much for your help, Sarah!\n\n00:42-00:45 | SPEAKER L:\nYou're welcome, John! Have a great day!",
"segments": [
{
"speaker": "SPEAKER_L",
"text": "Thank you for calling TechSupport, my name is Sarah. How can I help you today?",
"initial_time": 0.0,
"end_time": 3.2
},
{
"speaker": "SPEAKER_R",
"text": "Hi Sarah, I'm having trouble logging into my account. It keeps saying my password is incorrect, but I'm sure I'm typing it right.",
"initial_time": 3.5,
"end_time": 9.1
},
{
"speaker": "SPEAKER_L",
"text": "I'm sorry to hear that. Let me help you with that. Can I have your email address associated with the account?",
"initial_time": 9.4,
"end_time": 15.0
},
{
"speaker": "SPEAKER_R",
"text": "Sure, it's john.smith@email.com.",
"initial_time": 15.2,
"end_time": 18.0
},
{
"speaker": "SPEAKER_L",
"text": "Thank you, John. I can see your account here. It looks like there were several failed login attempts, so the account was temporarily locked for security.",
"initial_time": 18.3,
"end_time": 25.5
},
{
"speaker": "SPEAKER_R",
"text": "Oh, that explains it. How can I unlock it?",
"initial_time": 25.8,
"end_time": 28.2
},
{
"speaker": "SPEAKER_L",
"text": "I can unlock it for you right now. I'll also send a password reset link to your email. You should receive it within the next few minutes. Is there anything else I can help you with?",
"initial_time": 28.5,
"end_time": 38.0
},
{
"speaker": "SPEAKER_R",
"text": "No, that's all I needed. Thank you so much for your help, Sarah!",
"initial_time": 38.3,
"end_time": 42.0
},
{
"speaker": "SPEAKER_L",
"text": "You're welcome, John! Have a great day!",
"initial_time": 42.2,
"end_time": 45.0
}
],
"speaker_count": 2
},
"analysis": {
"total_questions": 3,
"questions_answered": [
"How can I help you today?",
"Can I have your email address?",
"Is there anything else I can help you with?"
],
"response_rate": 1.0,
"sale_completed": false,
"client_interest": 0.75,
"main_objections": [],
"objections_resolved": true,
"service_tone": "professional, friendly, and empathetic",
"empathy_level": 0.9,
"technical_knowledge": 0.85,
"overall_score": 0.92,
"recommendations": [
"Consider proactively offering account security tips",
"Could mention estimated time for password reset email"
],
"next_steps": "Customer will receive password reset email and regain account access"
}
}Best Practices
Audio Quality
- Sample rate: 16kHz or higher for best results (8kHz telephony audio is also supported)
- Bit depth: 16-bit minimum
- Format: MP3 or WAV both work well
Channel Assignment
- Be consistent: Always assign the same party to the same channel
- Document it: Note whether agents are on L or R in your configuration
- Map speakers: In your application, map
SPEAKER_L/SPEAKER_Rto meaningful labels (Agent/Customer)
Choosing the Right Approach
| Scenario | Recommended Model | Response Format |
|---|---|---|
| Stereo call recordings | pulse-precision-pro | stereo_diarization |
| Mono recordings with multiple speakers | pulse-precision-pro | diarization |
Performance Tips
- Use
vad_preset=telephony: Optimized for phone call audio characteristics - Batch processing: For large volumes, process files in parallel
- Combine with anonymization: Add
anonymize=trueto remove PII automatically
Next Steps
- Structured Analysis - Create custom analysis schemas
- Speech-to-Text Models - Explore all STT options
- Advanced Call Analysis - Multi-step processing pipeline
- Request Tracking - Monitor API usage
