Skip to content

SipPulse AI - Audio to Text Transcription

Audio Transcription Interface

The SipPulse AI platform offers advanced capabilities for transcribing audio to text. This section documents how to use the transcription tool to convert audio files into text.

Accessing the Transcription Interface

  1. Navigating to the Tool:

    • Access the left sidebar menu
    • Click on "Playground" to expand the options
    • Select "Speech to Text" to open the transcription interface
  2. Main Interface Components:

    • Upload Area: Central zone to drag and drop audio files
    • Size Limitation: Maximum of 25MB per file
    • Control Buttons: "Transcribe" to start processing
    • Results Area: Space where the transcription will be displayed after processing

Transcription Configuration

  1. Model and Format Settings:

    • Model: Select "pulse-precision" for accurate transcription
    • Output Format: Option to define the format of the resulting text
    • Language: Configured as "Auto detect" for automatic detection, or select a specific language
    • Prompt: Optional field to provide additional context to the transcription model
  2. Advanced Features:

    • Anonymization: When activated, replaces sensitive information with placeholders

      • Protects personal data, preventing its exposure in the transcribed text
      • Includes additional cost based on the number of processed characters
    • Insights: Additional analyses of the transcribed content

      • Text Summarization: Transforms long transcriptions into concise summaries
      • Topic Detection: Identifies the main themes addressed in the audio
  3. Presets and Saved Configurations:

    • Ability to save frequently used configurations as presets
    • "Save as preset" button to store current settings
    • "No preset" menu to select previously saved configurations

Transcription Process

  1. File Preparation:

    • Verify that the audio file is in a compatible format
    • Ensure the file size does not exceed 25MB
  2. File Upload:

    • Drag and drop the file into the indicated area, or
    • Click on the area to open the file selector
  3. Parameter Configuration:

    • Adjust model settings, language, and other options as needed
    • Activate additional features if necessary (anonymization, analyses)
  4. Executing the Transcription:

    • Click the "Transcribe" button to begin processing
    • The system will process the audio and display progress
  5. Review and Export:

    • After completion, review the transcribed text in the results area
    • Use available options to export or copy the content

API Integration

SipPulse AI provides a RESTful API for integrating speech-to-text capabilities directly into your applications. Below are examples of how to use the API in different programming languages.

API Parameters

  • model: Specifies the transcription model (e.g., pulse-precision)
  • response_format: Determines the structure of the response (e.g., diarization for speaker identification)
  • api-key: Your SipPulse API authentication key

Python Example

python
import requests

url = 'https://api.sippulse.ai/asr/transcribe' + '?model=pulse-precision&response_format=diarization'
method = 'POST'
headers = {
    'accept': 'application/json',
    'api-key': '$SIPPULSE_API_KEY',
}

with open('audio-sample.mp3', 'rb') as f:
    files = {
        'file': ('audio-sample.mp3', f, 'audio/mpeg'),
    }
    response = requests.request(method, url, headers=headers, files=files)

print(response.text)

Node.js Example

javascript
const url = new URL('https://api.sippulse.ai/asr/transcribe');
const params = new URLSearchParams({
  model: 'pulse-precision',
  response_format: 'diarization',
});
url.search = params;

const fetchOptions = {
  method: 'POST',
  headers: {
    'accept': 'application/json',
    'api-key': '$SIPPULSE_API_KEY',
  },
  body: new FormData(),
};

files.forEach(file => fetchOptions.body.append('file', file));

fetch(url, fetchOptions)
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => console.error("Error:", error));

cURL Example

bash
curl -X 'POST' \
  'https://api.sippulse.ai/asr/transcribe?model=pulse-precision&response_format=diarization' \
  -H 'accept: application/json' \
  -H 'api-key: $SIPPULSE_API_KEY' \
  -F 'file=audio-sample.mp3;type=audio/mpeg'

Sample API Response

Below is an example of the API response when using the diarization format, with personal information anonymized:

json
{
  "segments": [
    {
      "end_time": 9.98,
      "initial_time": 9.54,
      "speaker": "NOT IDENTF",
      "text": "Hello."
    },
    {
      "end_time": 11.84,
      "initial_time": 9.98,
      "speaker": "SPEAKER_00",
      "text": "Hi there,"
    },
    {
      "end_time": 12.48,
      "initial_time": 12.0,
      "speaker": "SPEAKER_01",
      "text": "go ahead."
    },
    {
      "end_time": 31.38,
      "initial_time": 12.7,
      "speaker": "SPEAKER_00",
      "text": "Good morning, my name is [AGENT_NAME] and I'm calling on behalf of [COMPANY_NAME]. May I please speak with [CUSTOMER_NAME]? Great. Well, [CUSTOMER_NAME], it's a pleasure speaking with you today. Just to let you know this call is being recorded for security purposes. I'm sorry, could you repeat that?"
    },
    {
      "end_time": 36.68,
      "initial_time": 32.68,
      "speaker": "SPEAKER_01",
      "text": "You're recording this call, but I don't have any accounts with [COMPANY_NAME]."
    },
    {
      "end_time": 55.34,
      "initial_time": 37.54,
      "speaker": "SPEAKER_01",
      "text": "My accounts are up to date. I spoke with the manager there yesterday. He advised me not to share information on these calls. If there's an issue with my account, please tell me what it is."
    },
    {
      "end_time": 66.36,
      "initial_time": 56.3,
      "speaker": "SPEAKER_00",
      "text": "I understand. This is regarding a business matter we have with you, but before I can provide additional information, I would need to verify some details. Would that be possible?"
    },
    {
      "end_time": 69.88,
      "initial_time": 67.52,
      "speaker": "SPEAKER_01",
      "text": "No, I'm not going to verify anything, I don't know what this is about."
    },
    {
      "end_time": 82.18,
      "initial_time": 70.94,
      "speaker": "SPEAKER_00",
      "text": "I understand. If you'd like, you can contact us directly at [PHONE_NUMBER] or visit your local branch, okay?"
    },
    {
      "end_time": 93.6,
      "initial_time": 84.88,
      "speaker": "SPEAKER_01",
      "text": "Thank you. My branch is very close, I'll stop by and see what's happening with my account."
    },
    {
      "end_time": 105.24,
      "initial_time": 94.24,
      "speaker": "SPEAKER_00",
      "text": "I understand. Thank you for your time. I wish you a wonderful day."
    },
    {
      "end_time": 106.6,
      "initial_time": 106.06,
      "speaker": "SPEAKER_01",
      "text": "Thanks."
    }
  ],
  "text": "00:09-00:09 | NOT IDENTIFIED:\nHello.\n\n00:09-00:11 | SPEAKER 00:\nHi there,\n\n00:12-00:12 | SPEAKER 01:\ngo ahead.\n\n00:12-00:31 | SPEAKER 00:\nGood morning, my name is [AGENT_NAME] and I'm calling on behalf of [COMPANY_NAME]. May I please speak with [CUSTOMER_NAME]? Great. Well, [CUSTOMER_NAME], it's a pleasure speaking with you today. Just to let you know this call is being recorded for security purposes. I'm sorry, could you repeat that?\n\n00:32-00:36 | SPEAKER 01:\nYou're recording this call, but I don't have any accounts with [COMPANY_NAME].\n\n00:37-00:55 | SPEAKER 01:\nMy accounts are up to date. I spoke with the manager there yesterday. He advised me not to share information on these calls. If there's an issue with my account, please tell me what it is.\n\n00:56-01:06 | SPEAKER 00:\nI understand. This is regarding a business matter we have with you, but before I can provide additional information, I would need to verify some details. Would that be possible?\n\n01:07-01:09 | SPEAKER 01:\nNo, I'm not going to verify anything, I don't know what this is about.\n\n01:10-01:22 | SPEAKER 00:\nI understand. If you'd like, you can contact us directly at [PHONE_NUMBER] or visit your local branch, okay?\n\n01:24-01:33 | SPEAKER 01:\nThank you. My branch is very close, I'll stop by and see what's happening with my account.\n\n01:34-01:45 | SPEAKER 00:\nI understand. Thank you for your time. I wish you a wonderful day.\n\n01:46-01:46 | SPEAKER 01:\nThanks.",
  "usage": {
    "cost": 0.0705675264,
    "currency": "BRL",
    "cost_details": [
      {
        "type": "speech-to-text",
        "unit": "minute",
        "amount": {
          "value": 2
        },
        "total_price": {
          "value": 0.0705675264
        },
        "unit_price": {
          "value": 0.0352837632
        }
      }
    ],
    "performance": {
      "delay": 98,
      "execution_time": 20343,
      "relative_execution_time": 5.316030084058399,
      "relative_execution_time_unit": "seconds_per_seconds"
    }
  }
}

Response Structure Explained

The API response provides detailed information about the transcribed audio:

  1. Segments: Array of individual speech segments with:

    • initial_time/end_time: Timestamps in seconds
    • speaker: Speaker identification (SPEAKER_00, SPEAKER_01, etc.)
    • text: Transcribed content for that segment
  2. Text: Formatted transcript with timestamps and speaker identification

  3. Usage: Detailed usage metrics including:

    • Cost: Total cost for the transcription
    • Cost details: Breakdown of charges by service type
    • Performance: Processing time metrics

Usage Considerations

  1. Transcription Quality:

    • Transcription accuracy depends on the quality of the original audio
    • Audio with background noise, overlapping voices, or poor recording quality may affect results
  2. Cost Optimization:

    • Additional features such as anonymization and analyses increase processing cost
    • Only use the necessary features for each specific use case
  3. Supported Languages:

    • The system supports multiple languages with automatic detection option
    • For best results in specific languages, manually select the correct language
  4. API Rate Limits:

    • Be aware of API rate limits for your subscription tier
    • Implement appropriate error handling for API responses

By using the SipPulse AI transcription tool, you can efficiently convert audio content to text, facilitating analysis, documentation, and further processing of the content, either through the web interface or programmatically via the API.