Whisper

Whisper: Multilingual speech recognition model, robust, versatile, open-source.

API for

Whisper

OpenAI's Whisper model offers robust, multilingual speech-to-text capabilities, trained on diverse data, free for commercial use under the MIT license.

Home

Models

Whisper

Model Whisper

Basic Information

Model Name: Whisper

Developer/Creator: OpenAI

Release Date: September 2022 (original series), December 2022 (large-v2), and November 2023 (large-v3)

Model Type: Sequence-to-sequence ASR (automatic speech recognition) and speech translation model

Versions:

Size	Parameters	Relative speed
tiny	39 M	~32x
base	74 M	~16x
small	244 M	~6x
medium	769 M	~2x
large	1550 M	1x

Description

The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.

Key Features:

Multilingual capabilities, shows strong results in roughly 10 languages but have limited evaluation for other tasks like voice detection and speaker classification.
Robust to diverse accents and noisy environments.
Can be used for tasks such as speech transcription, translation, and generating subtitles.

Intended Use:

Intended for developers and researchers interested in incorporating speech-to-text capabilities into applications, supporting accessibility features, or conducting linguistic research.

Technical Details

Architecture:

The model utilizes a Transformer architecture that has been pre-trained on a mixture of supervised and unsupervised data.

Training Data:

The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.

Performance Metrics:

Research indicates that these models outperform many existing ASR systems. They show enhanced robustness to accents, background noise, and technical language, and provide zero-shot translation from multiple languages into English with nearly state-of-the-art accuracy in both speech recognition and translation.

Performance varies across languages, particularly suffering in low-resource or less commonly studied languages, and demonstrates variability in accuracy with different accents, dialects, and demographic groups. The models may also generate repetitive texts, a trait partly addressable through beam search and temperature scheduling techniques.

Knowledge cutoff:

Audio or text data used for training would not include information beyond mid-2022

‍Usage

Usage

Code Samples/SDK:

Tutorials: Speech-to-text Multimodal Experience in NodeJS

File Size

The maximum file size is limited to 2 GB.

Support and Community

Community Resources:

AIML API Discord

Support Channels:

Issues and contributions can be made directly through the GitHub repository.

Ethical Considerations

Ethical Guidelines: OpenAI provides guidance on responsible usage, emphasizing privacy and ethical use of AI technologies.
Bias Mitigation: Continuous efforts to reduce biases in speech recognition accuracy across different languages and accents.

Licensing

License Type: Released under the MIT license, allowing for commercial and non-commercial use.

‍
References

https://arxiv.org/abs/2212.04356

Whisper

Whisper

Model Whisper

Basic Information

Description

Key Features:

Intended Use:

Technical Details

Architecture:

Training Data:

Performance Metrics:

Knowledge cutoff:

‍Usage

Usage

Support and Community

Community Resources:

Support Channels:

Ethical Considerations

Licensing

More APIs

Chat GPT-4o

Claude-3-Haiku

Claude-3-Sonnet

Claude-3-Opus

Deepgram Nova-2

Whisper

Mixtral 8x22B Instruct

Meta Llama 3 8B Instruct

Meta Llama 3 70B Instruct

OLMo-7B-Instruct

StarCoder

StableLM Base Alpha 3B

FLAN T5

GPT Neox 20B

SQLCoder

GPT-NeoXT-Chat-Base-20B

Zephyr 7B

Dolly-v2-12B

Evo-1 8k Base

Toppy-M-7B

ReMM-SLERP-L2-13B

Platypus2-70B-Instruct

Deepseek-LLM-67b-Chat

Dolphin-2.5-Mixtral-8x7b

OLMo-7B

Mixtral 8x22B

OpenAI DALL·E 3

Chat GPT 4 32k

Chat GPT 4 Turbo

Chat GPT 4

Chat GPT 3.5

MythoMax-L2 (13B)

Analog Diffusion

Falcon (7B)

Falcon Instruct (7B)

GPT-JT-Moderation (6B)

Falcon Instruct (40B)

Falcon (40B)

Alpaca (7B)

Openjourney v4

RedPajama-INCITE (3B)

RedPajama-INCITE Instruct (3B)

RedPajama-INCITE Chat (3B)

Qwen (7B)

RedPajama-INCITE (7B)

RedPajama-INCITE Instruct (7B)

RedPajama-INCITE Chat (7B)

Qwen-Chat (7B)

Phind Code LLaMA v2 (34B)

Platypus2 Instruct (70B)

Vicuna v1.5 (13B)

Realistic Vision 3.0

Stable Diffusion 2.1

Stable Diffusion XL 1.0

WizardCoder Python v1.0 (34B)

WizardLM v1.2 (13B)

OpenOrca Mistral (7B) 8K

OpenHermes-2-Mistral (7B)

Bert Base Uncased

Sentence-BERT