Whisper
+
Techflow Logo - Techflow X Webflow Template

Whisper

Whisper: Multilingual speech recognition model, robust, versatile, open-source.

API for

Whisper

OpenAI's Whisper model offers robust, multilingual speech-to-text capabilities, trained on diverse data, free for commercial use under the MIT license.

Whisper

Model Whisper


Basic Information

Model Name: Whisper

Developer/Creator: OpenAI

Release Date: September 2022 (original series), December 2022 (large-v2), and November 2023 (large-v3)

Model Type: Sequence-to-sequence ASR (automatic speech recognition) and speech translation model

Versions:

Size Parameters Relative speed
tiny 39 M ~32x
base 74 M ~16x
small 244 M ~6x
medium 769 M ~2x
large 1550 M 1x

Description

The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.

Key Features:
  • Multilingual capabilities, shows strong results in roughly 10 languages but have limited evaluation for other tasks like voice detection and speaker classification.
  • Robust to diverse accents and noisy environments.
  • Can be used for tasks such as speech transcription, translation, and generating subtitles.
Intended Use:

Intended for developers and researchers interested in incorporating speech-to-text capabilities into applications, supporting accessibility features, or conducting linguistic research.

Technical Details

Architecture:

The model utilizes a Transformer architecture that has been pre-trained on a mixture of supervised and unsupervised data.

Training Data:

The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.

Performance Metrics:

Research indicates that these models outperform many existing ASR systems. They show enhanced robustness to accents, background noise, and technical language, and provide zero-shot translation from multiple languages into English with nearly state-of-the-art accuracy in both speech recognition and translation.

Performance varies across languages, particularly suffering in low-resource or less commonly studied languages, and demonstrates variability in accuracy with different accents, dialects, and demographic groups. The models may also generate repetitive texts, a trait partly addressable through beam search and temperature scheduling techniques.

Knowledge cutoff:

Audio or text data used for training would not include information beyond mid-2022

Usage

Usage

Code Samples/SDK:

Tutorials: Speech-to-text Multimodal Experience in NodeJS

File Size

The maximum file size is limited to 2 GB.

Support and Community

Community Resources:

AIML API Discord

Support Channels:

Issues and contributions can be made directly through the GitHub repository.

Ethical Considerations

  • Ethical Guidelines: OpenAI provides guidance on responsible usage, emphasizing privacy and ethical use of AI technologies.
  • Bias Mitigation: Continuous efforts to reduce biases in speech recognition accuracy across different languages and accents.

Licensing

  • License Type: Released under the MIT license, allowing for commercial and non-commercial use.


References

Try  
Whisper

More APIs

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.