Skip to content

Stereo Call Transcription with Diarization

When you record calls in stereo format—with each speaker on a separate audio channel—you can achieve 100% accurate speaker identification. This guide shows you how to transcribe stereo recordings, identify speakers automatically, and extract structured insights from your calls.

Why Stereo Diarization?

AspectStandard DiarizationStereo Diarization
Speaker identificationAI-based detectionChannel-based (L/R)
AccuracyGoodPerfect (100%)
PerformanceNormalFaster
Speaker labelsSPEAKER 1, SPEAKER 2...SPEAKER_L, SPEAKER_R
Best forMono audio, meetingsCall center recordings

Ideal for Call Centers

Most PBX systems (FreeSWITCH, Asterisk) can record calls in stereo with each party on a separate channel. This eliminates any guesswork in speaker identification.

Prerequisites

  • Stereo audio file: MP3, WAV, or other supported format with speakers on separate channels
  • SipPulse AI API key: Get yours at sippulse.ai
  • Pro model access: pulse-precision-pro

Stereo Diarization Model

The pulse-precision-pro model supports both stereo and mono diarization:

ModelSpeedAccuracyBest For
pulse-precision-proOptimalHighestQuality-critical stereo call transcriptions

Pro Model Features

The pulse-precision-pro model includes advanced features:

  • Stereo diarization: 100% accurate channel-based speaker identification
  • VAD preset: Use vad_preset=telephony for optimized 8kHz narrow-band audio
  • Highest accuracy: Best Word Error Rate (WER) for call center analytics

Step 1: Prepare Your Audio

For stereo diarization to work correctly, your audio must have:

  • Left channel (L): One speaker (e.g., the customer)
  • Right channel (R): Other speaker (e.g., the agent)

Recording Configuration

Most PBX systems support stereo recording:

  • FreeSWITCH: Use RECORD_STEREO=true in your dialplan
  • Asterisk: Configure MixMonitor with the D option for stereo

Channel Consistency

Ensure consistent channel assignment across recordings. Document whether customers are always on the left or right channel for accurate analysis.

Step 2: Transcribe with Stereo Diarization

Use the /v1/asr/transcribe endpoint with response_format=stereo_diarization and one of the Pro models.

bash
curl -X POST 'https://api.sippulse.ai/v1/asr/transcribe' \
  -H 'api-key: $SIPPULSE_API_KEY' \
  -F 'file=@call-recording.mp3' \
  -F 'model=pulse-precision-pro' \
  -F 'response_format=stereo_diarization' \
  -F 'language=en' \
  -F 'vad_preset=telephony'
typescript
import FormData from "form-data";
import fs from "fs";

async function transcribeStereoCall(
  filePath: string
): Promise<StereoTranscription> {
  const form = new FormData();
  form.append("file", fs.createReadStream(filePath));
  form.append("model", "pulse-precision-pro");
  form.append("response_format", "stereo_diarization");
  form.append("language", "en");
  form.append("vad_preset", "telephony"); // Optimized for phone calls

  const response = await fetch("https://api.sippulse.ai/v1/asr/transcribe", {
    method: "POST",
    headers: {
      "api-key": process.env.SIPPULSE_API_KEY!,
    },
    body: form,
  });

  if (!response.ok) {
    throw new Error(`API error: ${response.status}`);
  }

  return response.json();
}

interface StereoTranscription {
  text: string;
  segments: Array<{
    speaker: "SPEAKER_L" | "SPEAKER_R";
    text: string;
    initial_time: number;
    end_time: number;
  }>;
  words: Array<{
    word: string;
    speaker: "SPEAKER_L" | "SPEAKER_R";
    start: number;
    end: number;
  }>;
}
python
import os
import requests

def transcribe_stereo_call(file_path: str) -> dict:
    """
    Transcribe a stereo call recording with speaker diarization.

    Args:
        file_path: Path to the stereo audio file

    Returns:
        Transcription with speaker-labeled segments and words
    """
    with open(file_path, "rb") as audio_file:
        response = requests.post(
            "https://api.sippulse.ai/v1/asr/transcribe",
            headers={"api-key": os.getenv("SIPPULSE_API_KEY")},
            files={"file": audio_file},
            data={
                "model": "pulse-precision-pro",
                "response_format": "stereo_diarization",
                "language": "en",
                "vad_preset": "telephony",  # Optimized for phone calls
            },
        )

    response.raise_for_status()
    return response.json()

Step 3: Understand the Response

The stereo diarization response includes three main components:

Response Structure

json
{
  "text": "00:02-00:05 | SPEAKER L:\nHello, how can I help you today?\n\n00:05-00:08 | SPEAKER R:\nHi, I'm calling about my account...",
  "segments": [
    {
      "speaker": "SPEAKER_L",
      "text": "Hello, how can I help you today?",
      "initial_time": 2.1,
      "end_time": 5.3
    },
    {
      "speaker": "SPEAKER_R",
      "text": "Hi, I'm calling about my account...",
      "initial_time": 5.5,
      "end_time": 8.2
    }
  ],
  "words": [
    {
      "word": "Hello,",
      "speaker": "SPEAKER_L",
      "start": 2.1,
      "end": 2.5
    },
    {
      "word": "how",
      "speaker": "SPEAKER_L",
      "start": 2.5,
      "end": 2.7
    }
  ]
}

Key Fields

FieldDescription
textFormatted transcript with timestamps and speaker labels
segmentsArray of speech segments with speaker, text, and timing
wordsWord-level timestamps with speaker attribution
speakerSPEAKER_L (left channel) or SPEAKER_R (right channel)

Step 4: Analyze with Structured Analysis

After transcription, use Structured Analysis to extract insights from the conversation.

Setting Up Your Analysis

  1. Navigate to Structured Analysis in the SipPulse AI dashboard
  2. Create a new analysis or select an existing preset like "Conversation Analysis"
  3. Configure your schema with the fields you want to extract
  4. Copy the Analysis ID using the copy button next to the analysis name

The Conversation Analysis template extracts:

  • Total questions and response rate
  • Whether the call achieved its goal (sale, resolution, etc.)
  • Customer interest level (0-1)
  • Main objections and if they were resolved
  • Service tone and empathy level
  • Overall success score (0-1)
  • Recommendations for improvement

Execute Analysis via API

Use the copied Analysis ID to execute the analysis programmatically:

typescript
interface ConversationAnalysis {
  total_questions: number;
  questions_answered: string[];
  response_rate: number;
  sale_completed: boolean;
  client_interest: number;
  main_objections: string[];
  objections_resolved: boolean;
  service_tone: string;
  empathy_level: number;
  overall_score: number;
  recommendations: string[];
  next_steps: string;
}

async function analyzeConversation(
  analysisId: string,
  transcriptionText: string
): Promise<ConversationAnalysis> {
  const response = await fetch(
    `https://api.sippulse.ai/v1/structured-analyses/${analysisId}/execute`,
    {
      method: "POST",
      headers: {
        "api-key": process.env.SIPPULSE_API_KEY!,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        content: transcriptionText,
      }),
    }
  );

  if (!response.ok) {
    throw new Error(`API error: ${response.status}`);
  }

  const result = await response.json();
  return result.content;
}
python
import os
import requests
from typing import TypedDict

class ConversationAnalysis(TypedDict):
    total_questions: int
    questions_answered: list[str]
    response_rate: float
    sale_completed: bool
    client_interest: float
    main_objections: list[str]
    objections_resolved: bool
    service_tone: str
    empathy_level: float
    overall_score: float
    recommendations: list[str]
    next_steps: str

def analyze_conversation(
    analysis_id: str,
    transcription_text: str
) -> ConversationAnalysis:
    """
    Analyze a call transcription using Structured Analysis.

    Args:
        analysis_id: ID of the Conversation Analysis preset
        transcription_text: The formatted transcription text

    Returns:
        Structured analysis results
    """
    response = requests.post(
        f"https://api.sippulse.ai/v1/structured-analyses/{analysis_id}/execute",
        headers={
            "api-key": os.getenv("SIPPULSE_API_KEY"),
            "Content-Type": "application/json",
        },
        json={"content": transcription_text},
    )

    response.raise_for_status()
    return response.json()["content"]

Complete Example: End-to-End Pipeline

Here's a complete example that transcribes a stereo call and analyzes it:

typescript
import FormData from "form-data";
import fs from "fs";

async function processCallRecording(audioPath: string, analysisId: string) {
  // Step 1: Transcribe with stereo diarization
  console.log("Transcribing audio...");
  const form = new FormData();
  form.append("file", fs.createReadStream(audioPath));
  form.append("model", "pulse-precision-pro");
  form.append("response_format", "stereo_diarization");
  form.append("language", "en");
  form.append("vad_preset", "telephony");

  const transcribeResponse = await fetch(
    "https://api.sippulse.ai/v1/asr/transcribe",
    {
      method: "POST",
      headers: { "api-key": process.env.SIPPULSE_API_KEY! },
      body: form,
    }
  );

  const transcription = await transcribeResponse.json();
  console.log(`Transcribed ${transcription.segments.length} segments`);

  // Step 2: Analyze the conversation
  console.log("Analyzing conversation...");
  const analyzeResponse = await fetch(
    `https://api.sippulse.ai/v1/structured-analyses/${analysisId}/execute`,
    {
      method: "POST",
      headers: {
        "api-key": process.env.SIPPULSE_API_KEY!,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ content: transcription.text }),
    }
  );

  const analysis = await analyzeResponse.json();

  // Step 3: Return combined results
  return {
    transcription: {
      text: transcription.text,
      segments: transcription.segments,
      speakerCount: 2,
    },
    analysis: analysis.content,
  };
}

// Usage - get analysisId from the dashboard by clicking the copy button
const result = await processCallRecording(
  "./recordings/support-call.mp3",
  "sa_abc123def456" // Your Analysis ID from the dashboard
);

console.log("Overall Score:", result.analysis.overall_score);
console.log("Customer Interest:", result.analysis.client_interest);
console.log("Recommendations:", result.analysis.recommendations);
python
import os
import requests

def process_call_recording(audio_path: str, analysis_id: str) -> dict:
    """
    Complete pipeline: transcribe stereo call and analyze it.

    Args:
        audio_path: Path to the stereo audio file
        analysis_id: ID of the Conversation Analysis preset (copy from dashboard)

    Returns:
        Combined transcription and analysis results
    """
    api_key = os.getenv("SIPPULSE_API_KEY")

    # Step 1: Transcribe with stereo diarization
    print("Transcribing audio...")
    with open(audio_path, "rb") as audio_file:
        transcribe_response = requests.post(
            "https://api.sippulse.ai/v1/asr/transcribe",
            headers={"api-key": api_key},
            files={"file": audio_file},
            data={
                "model": "pulse-precision-pro",
                "response_format": "stereo_diarization",
                "language": "en",
                "vad_preset": "telephony",
            },
        )

    transcribe_response.raise_for_status()
    transcription = transcribe_response.json()
    print(f"Transcribed {len(transcription['segments'])} segments")

    # Step 2: Analyze the conversation
    print("Analyzing conversation...")
    analyze_response = requests.post(
        f"https://api.sippulse.ai/v1/structured-analyses/{analysis_id}/execute",
        headers={
            "api-key": api_key,
            "Content-Type": "application/json",
        },
        json={"content": transcription["text"]},
    )

    analyze_response.raise_for_status()
    analysis = analyze_response.json()

    # Step 3: Return combined results
    return {
        "transcription": {
            "text": transcription["text"],
            "segments": transcription["segments"],
            "speaker_count": 2,
        },
        "analysis": analysis["content"],
    }


if __name__ == "__main__":
    # Get analysis_id from the dashboard by clicking the copy button
    result = process_call_recording(
        "./recordings/support-call.mp3",
        "sa_abc123def456"  # Your Analysis ID from the dashboard
    )

    print(f"Overall Score: {result['analysis']['overall_score']}")
    print(f"Customer Interest: {result['analysis']['client_interest']}")
    print(f"Recommendations: {result['analysis']['recommendations']}")

Example Output

Here's what the complete response looks like for a support call:

json
{
  "transcription": {
    "text": "00:00-00:03 | SPEAKER L:\nThank you for calling TechSupport, my name is Sarah. How can I help you today?\n\n00:03-00:09 | SPEAKER R:\nHi Sarah, I'm having trouble logging into my account. It keeps saying my password is incorrect, but I'm sure I'm typing it right.\n\n00:09-00:15 | SPEAKER L:\nI'm sorry to hear that. Let me help you with that. Can I have your email address associated with the account?\n\n00:15-00:18 | SPEAKER R:\nSure, it's john.smith@email.com.\n\n00:18-00:25 | SPEAKER L:\nThank you, John. I can see your account here. It looks like there were several failed login attempts, so the account was temporarily locked for security.\n\n00:25-00:28 | SPEAKER R:\nOh, that explains it. How can I unlock it?\n\n00:28-00:38 | SPEAKER L:\nI can unlock it for you right now. I'll also send a password reset link to your email. You should receive it within the next few minutes. Is there anything else I can help you with?\n\n00:38-00:42 | SPEAKER R:\nNo, that's all I needed. Thank you so much for your help, Sarah!\n\n00:42-00:45 | SPEAKER L:\nYou're welcome, John! Have a great day!",
    "segments": [
      {
        "speaker": "SPEAKER_L",
        "text": "Thank you for calling TechSupport, my name is Sarah. How can I help you today?",
        "initial_time": 0.0,
        "end_time": 3.2
      },
      {
        "speaker": "SPEAKER_R",
        "text": "Hi Sarah, I'm having trouble logging into my account. It keeps saying my password is incorrect, but I'm sure I'm typing it right.",
        "initial_time": 3.5,
        "end_time": 9.1
      },
      {
        "speaker": "SPEAKER_L",
        "text": "I'm sorry to hear that. Let me help you with that. Can I have your email address associated with the account?",
        "initial_time": 9.4,
        "end_time": 15.0
      },
      {
        "speaker": "SPEAKER_R",
        "text": "Sure, it's john.smith@email.com.",
        "initial_time": 15.2,
        "end_time": 18.0
      },
      {
        "speaker": "SPEAKER_L",
        "text": "Thank you, John. I can see your account here. It looks like there were several failed login attempts, so the account was temporarily locked for security.",
        "initial_time": 18.3,
        "end_time": 25.5
      },
      {
        "speaker": "SPEAKER_R",
        "text": "Oh, that explains it. How can I unlock it?",
        "initial_time": 25.8,
        "end_time": 28.2
      },
      {
        "speaker": "SPEAKER_L",
        "text": "I can unlock it for you right now. I'll also send a password reset link to your email. You should receive it within the next few minutes. Is there anything else I can help you with?",
        "initial_time": 28.5,
        "end_time": 38.0
      },
      {
        "speaker": "SPEAKER_R",
        "text": "No, that's all I needed. Thank you so much for your help, Sarah!",
        "initial_time": 38.3,
        "end_time": 42.0
      },
      {
        "speaker": "SPEAKER_L",
        "text": "You're welcome, John! Have a great day!",
        "initial_time": 42.2,
        "end_time": 45.0
      }
    ],
    "speaker_count": 2
  },
  "analysis": {
    "total_questions": 3,
    "questions_answered": [
      "How can I help you today?",
      "Can I have your email address?",
      "Is there anything else I can help you with?"
    ],
    "response_rate": 1.0,
    "sale_completed": false,
    "client_interest": 0.75,
    "main_objections": [],
    "objections_resolved": true,
    "service_tone": "professional, friendly, and empathetic",
    "empathy_level": 0.9,
    "technical_knowledge": 0.85,
    "overall_score": 0.92,
    "recommendations": [
      "Consider proactively offering account security tips",
      "Could mention estimated time for password reset email"
    ],
    "next_steps": "Customer will receive password reset email and regain account access"
  }
}

Best Practices

Audio Quality

  • Sample rate: 16kHz or higher for best results (8kHz telephony audio is also supported)
  • Bit depth: 16-bit minimum
  • Format: MP3 or WAV both work well

Channel Assignment

  • Be consistent: Always assign the same party to the same channel
  • Document it: Note whether agents are on L or R in your configuration
  • Map speakers: In your application, map SPEAKER_L/SPEAKER_R to meaningful labels (Agent/Customer)

Choosing the Right Approach

ScenarioRecommended ModelResponse Format
Stereo call recordingspulse-precision-prostereo_diarization
Mono recordings with multiple speakerspulse-precision-prodiarization

Performance Tips

  • Use vad_preset=telephony: Optimized for phone call audio characteristics
  • Batch processing: For large volumes, process files in parallel
  • Combine with anonymization: Add anonymize=true to remove PII automatically

Next Steps