As self-driving cars become a clear reality, all data and information surrounding their safe driving have to be on the ball.

That’s why when the news that labels of hundreds of pedestrians, cyclists, traffic cones, among others, were missing from a widely-used dataset for self-driving cars emerged, worry was the prime reaction. 

Out of the 15,000 hand-checked images from the Udacity Dataset 2, 4,986 of them, that’s 33%, had issues. 

Machine learning and algorithms

Machine learning has helped various industries evolve. Teaching computer algorithms to do new tasks is primary for this process to work smoothly, and safely. These machines are only as good as the data they receive, though. 

Self-Driving Car Dataset Missing Hundreds of Labels for Pedestrians, Bicycles, and More
The red boxes show the omitted or unlabeled pedestrians and cyclists, Source: Roboflow

When it comes to self-driving cars, they need a lot of data input in their algorithms. If a car doesn’t know how to recognize a human pedestrian walking by the side of the road, or a cyclist sharing the road with the car, then serious issues can arise. 

Online publication, Roboflow, discovered that a popular self-driving car dataset was riddled with hundreds of mistakes. The Udacity Dataset 2 is used by thousands of students who are building an open-source self-driving car dataset. It turns out that it contains serious errors and omissions. 

Roboflow hand-checked 15,000 images from the dataset and discovered that 33% of them had problems. There were thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists. 

Self-Driving Car Dataset Missing Hundreds of Labels for Pedestrians, Bicycles, and More
The red boxes show the missing labels, Source: Github

There were also other issues such as duplicated bounding boxes, oversized bounding boxes, and phantom annotations, among others. 

To make matters worse, around 1.4% of the images were simply unlabled, yet they contained cars, trucks, lights, and even pedestrians. 

This goes to show how carefully open source datasets need to be monitored. Luckily in this instance, Roboflow caught the issue and amended it as best they could. Thanks to Udacity’s permissive licensing, Roboflow fixed and re-released the dataset, here