DEV Community

Cover image for Massive OpenAI Spring Update GPT-4o - Amazing New Features - All 22 Videos - RTX Super Res Upscaled
Furkan Gözükara
Furkan Gözükara

Posted on

Massive OpenAI Spring Update GPT-4o - Amazing New Features - All 22 Videos - RTX Super Res Upscaled

OpenAI's Spring Update. Introducing GPT-4o and making more capabilities available for free in ChatGPT. Learn more about GPT-4o and advanced tools to ChatGPT for free users. Learn more about GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Massive OpenAI Spring Update GPT-4o - Amazing New Features - All 22 Videos - RTX Super Res Upscaled

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

Model evaluations
As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.

Improved Reasoning - GPT-4o sets a new high-score of 88.7% on 0-shot COT MMLU (general knowledge questions). All these evals were gathered with our new simple evals(opens in a new window) library. In addition, on the traditional 5-shot no-CoT MMLU, GPT-4o sets a new high-score of 87.2%. (Note: Llama3 400b(opens in a new window) is still training)

Graph Test 2
Audio ASR performance - GPT-4o dramatically improves speech recognition performance over Whisper-v3 across all languages, particularly for lower-resourced languages.

gpt-40-08 light
Audio translation performance - GPT-4o sets a new state-of-the-art on speech translation and outperforms Whisper-v3 on the MLS benchmark.

M3Exam Zero-Shot Results
M3Exam - The M3Exam benchmark is both a multilingual and vision evaluation, consisting of multiple choice questions from other countries’ standardized tests that sometimes include figures and diagrams. GPT-4o is stronger than GPT-4 on this benchmark across all languages. (We omit vision results for Swahili and Javanese, as there are only 5 or fewer vision questions for these languages.

Vision understanding evals
Vision understanding evals - GPT-4o achieves state-of-the-art performance on visual perception benchmarks. All vision evals are 0-shot, with MMMU, MathVista, and ChartQA as 0-shot CoT.

Language tokenization
These 20 languages were chosen as representative of the new tokenizer's compression across different language families

Model safety and limitations
GPT-4o has safety built-in by design across modalities, through techniques such as filtering training data and refining the model’s behavior through post-training. We have also created new safety systems to provide guardrails on voice outputs.

Source :
Source :

0:00 Introduction to the OpenAI's massive Spring Update
0:13 Say hello to GPT-4o
1:35 Two GPT-4os interacting and singing
7:29 Realtime Translation with GPT-4o
8:36 Lullabies and Whispers with GPT-4o
9:40 Meeting AI with GPT-4o
11:35 Sarcasm with GPT-4o
12:06 Math problems with GPT-4o
15:17 Point and Learn Spanish with GPT-4o
15:56 Rock, Paper, Scissors with GPT-4o
17:23 Harmonizing with two GPT-4os
18:51 Interview Prep with GPT-4o
19:58 Dog meets GPT-4o
20:25 Be My Eyes Accessibility with GPT-4o
21:32 Happy Birthday with GPT-4o
22:23 Dad jokes with GPT-4o
23:07 Fast counting with GPT-4o
23:41 Live demo of GPT-4o realtime conversational speech
26:06 Live demo of GPT4-o voice variation
28:03 Live demo of GPT-4o vision capabilities
32:05 Live demo of GPT-4o coding assistant and desktop app
35:42 Live demo of GPT-4o realtime translation
37:08 Live demo of GPT-4o's vision capabilities

Top comments (0)