Last month I joined the team at Daily. Daily builds video chat APIs that simplify adding video calls to any website or app. I had to get to know their APIs during the interview process, and, sure enough, it took me just a few minutes to get a call embedded on my website.
That was just for the take home project, though. When prepping for any engineering interview, I practice answering the infamous question, “What happens when you type
company-i-am-interviewing-at.com in your browser?” I practice because I’ve messed up the answer more than once.
There are many ways to answer this question, and there are lots of great resources out there (Even other Dev.to posts!).
But, visiting a Daily URL felt like a different experience. After I signed up for an account and generated my first room link, when I pasted it in my browser, I went directly to a video call. I realized that I actually didn’t know:
- What happens when you type a video chat URL in your browser?
- How do call participants trade audio and video?
- When Meryl Streep, Christine Baranski and Audra McDonald serenaded Stephen Sondheim for his 90th birthday, how did that actually work?
Since most of us are spending a lot more time on video calls in 2020, I became especially curious about these questions.
This post walks through what I’ve learned so far, at a high-level. It covers peer-to-peer (P2P) video chats, calls when different participants’ browsers talk directly to each other. I won’t be getting into other kinds of video calls, getting into the weeds about how connecting to the internet works, or covering many other things that I don’t know that I don’t know yet. Instead, like the answers to most questions, this is a springboard for more things to ask later. With that, let’s jump in:
URL stands for Uniform Resource Locator. A URL points us to the server, represented by the domain name, where the resource we’re looking for lives.
You typically see the letters HTTP before a URL. That’s the language the browser communicates in: Hypertext Transfer Protocol. The browser sends an HTTP request asking for the resource specified in the URL.
The request starts the Domain Name System (DNS) lookup process.
That process alone deserves many a standalone post. For our purposes, we’ll fast forward to the part when a resource is found. Once the browser knows the location, it initiates a TCP connection with the server hosting the resource, and the information exchange begins.
So, at the highest level, a browser says, “Hi! Can I have this resource?” After a negotiation process, a server responds with, “Sure, here you go!” .
Unlike a static web page (like the one you’re reading right now!), realtime moving images and audio are traded back and forth during a video call. Head to any Daily room URL, and two things will happen: 1) You’ll see a video chat interface and 2) Your browser will likely ask for permission to use your video and microphone. Sign up for a Daily account and create your own room URL to see for yourself.
How do those two things happen?
Again, a URL points to a resource. We can think of a resource as an html site. A Daily site runs some extra scripts that get in touch with a signaling server.
If you’ve ever used an online dating app or found an apartment on a site like Craigslist, you’re more familiar with how signaling servers work than you think. On Craigslist, you might see a post about a new apartment, including information like the neighborhood, rent, and how to reply to the poster. After you reply, you often wind up communicating directly with the person who made the post, without going through Craigslist at all. It’s similar in dating apps.
It's also like what happens in peer-to-peer (P2P) video calls. Once I’ve typed a URL in my browser and the server lets me into a chat, the server also hands me the IP addresses of the other people on the call. When I have their addresses, I know where to send out my audio and video streams (to each of their IPs), and can receive their audio/video back as well. If someone else joins the call after me, the server lets me and the others know, and we start sending/receiving media directly with the new caller.
We know how to do all of this because our browsers all speak the same language: WebRTC.
WebRTC is the internet’s open, secure standard for exchanging video, voice, and generic data between browsers. Daily, and many other video chat applications, is built on top of it. WebRTC transmits data over UDP (User Datagram Protocol), which is encrypted via DTLS (Datagram Transport Layer Security).
WebRTC gives us the MediaDevices API. The MediaDevices API is the browser’s API for, well, accessing and working with our media devices, like cameras and mics. We can see this API in action by running
navigator.mediaDevices.getUserMedia() in our console. Before we do, let's walk through each part of that command:
- navigator: holds information about our browser.
- .mediaDevices: again, this is the browser’s API for working with our cameras, mics, and other devices.
- .getUserMedia(): the method that prompts the user to turn on their camera or microphone. We pass an Object to this method that specifies whether or not we need the video and audio streams.
Open your browser console on any website, and paste in the below gist to play with this. You should be prompted to allow camera access. When you grant permission, your camera light should turn on:
If, like me, you’ve got a green camera light, at this point you might be reminded of a certain Great Gatsby, looking out at a green light.
Unlike Gatsby, though, your dreams, or, at least your ability to share your video stream, are within reach !
So far, we know that when we type a video chat URL into our browser, we connect to a signaling server to get the IP addresses and local streams of everybody else on the call. We know that the browser accesses our own camera via WebRTC’s MediaDevices API. What about sending our own stream out to the other callers, so they can see it?
There’s a WebRTC API for that, too.
We can attach our camera stream to another caller’s connection through the RTCPeerConnection API. The other callers then add our stream to a MediaStream object. We do the same thing when they send a stream our way . All of this data transmission happens over the User Datagram Protocol (UDP).
So, what happens when you type a video chat URL in your browser?
At the highest level, your browser finds a resource that asks a signaling server if you can join the call. The server remembers details about the call, like who else is in the chat and where to find them (their IP addresses). When your browser connects to the server and you're let into the call, your browser shares your video and audio streams directly with other participants, and receives them back, through WebRTC APIs.
I mentioned this way back in the intro, but we’ve only covered how chat works using WebRTC and peer-to-peer (P2P) calls, when browsers talk directly to each other. There’s a whole other potential
acronym architecture to explore: Selective Forwarding Unit.
That’s not to mention a bunch of other questions, like, how does muting your mic or sharing your screen work? What about limiting who can join a call? I’m excited to learn the answers to these and then some at Daily. Follow along here on Dev.to, over on the Daily blog, or on GitHub.
If you’d like to pair on embedding video chat into your own app or website, or want to chat all things video APIs, give me a shout on Twitter @kimeejohnson.
 You also often see an S after HTTP, but getting into HTTPS is a bit beyond the scope of this post. Quickly: the ‘S’ stands for secure, meaning the browser and server will negotiate a key to encrypt (and decrypt) the information they exchange (Check out Julia Evans’ https breakdown for a visual).
 If your camera light is a color other than green, insert your own art deco or yearning joke here.
 WebRTC’s docs go into details and code samples