How does Tesla's autopilot work? How does the car recognize the signals, other vehicles, pedestrians, lanes and so on? Basically, Tesla is using several cameras on the vehicle to understand what obstacles are around of the car. It enables the car to detect traffic, pedestrians, road signs, lane markings, and anything else that might be in front of the vehicle. This information is used to help the car drive itself. Tesla is using its self developed Full Self Driving(FSD) computer that handles image processing and recognition.
This technology came from kinds of artificial intelligence(AI) technology called computer vision. Computer vision is the science and technology of machines that see. As a scientific discipline, computer vision is concerned with the theory and technology for building artificial systems that obtain information from images and/or videos.
Then, how computer vision recognize the objects in the image/video? It involves numerous tasks including recognizing what objects are present, localizing the objects in 2D and 3D, determining the objects' and scene's attributes, characterizing relationships between objects and providing a semantic description of the scene.
In terms of data science, the task of computer vision is a kind of classification problems that validates an object is on the target dataset or not. (It is the same problem with spam detection!) The model could be quite complex because of the basic characteristics of the image/video data. The classification model consists of an algorithm and a dataset for training algorithm. We can use state-of-the-art algorithms something like classifying with neural networks.
Let's think about how the classification model is working especially focus on the dataset.
Generally, the computer vision algorithm works with video and/or image. The data structure of images in the computer can be presented with n by m matrix based on its resolution. For example, resolution of full high definition image is 1920 x 1080, it could have 2 million pixels, we can put it into the matrix. If we set 2 Bytes per value, the size of the matrix could be around 7.6MB. And the video is a combination of images, usually, we are using 30 frames(images) per second. It means, over 200MB size of data could be processed for analyzing the 1 seconds video. (This is a simplified example of image/video processing, it would quite different in the real world.) Because of these kinds of difficulties, the dataset must be light and clear. The number of training data also does matters. In the universe, there are tons of objects. If it is too huge, it means it needs a huge computing power to perform a computer vision analysis. It is impossible to put a supercomputer into a car.
Microsoft COCO(Common Object in Context, http://cocodataset.org/#home) dataset is a new large-scale dataset for detecting and segmenting objects found in everyday life in their natural environments. COCO dataset has 91 objects types, a total of 2.5 million labeled instances in 328k images.
The images could be sorted by a)iconic object image, b)iconic scene image, c)non-iconic image. Most of the dataset for object recognition have focused on image classification, object bounding box localization or semantic pixel-level segmentations with iconic images. But COCO dataset focus on segmenting individual object instances even if it is placed on the non-iconic image.
Here is an example of object detection with COCO dataset.
The human eyes are one of the most sophisticated organs in the body. There are a lot of studies about computer vision are progressing in the world and the skill level of computer vision can understand is improving so fast. With computer vision, someday I wish that we could have another eye for us.