Tools that make synthetic data generation easy are fundamentally changing the way machine learning work is done. Iterating and improving the dataset over the course of a project is more important to project success than iterating the model architecture. That's why we are releasing zpy, an open source synthetic data toolkit. All developers should have the option of working with dynamic data rather than static data.
Software 2.0
We are undergoing a phase change in the way software programming works [1]. As we replace our collective software stack with deep learning systems, we are going to fundamentally change many of the core abstractions and workflows that have been part of software development for decades.
Figure 1: Machine learning introduces a new programming paradigm [2].
Unfortunately many deep learning researchers are still stuck in the old software paradigm: spending the majority of their time and effort designing and iterating the algorithm (“Rules” in Figure 1) while using a static dataset like MNIST or ImageNet. Those of us who make machine learning work in the real world though have already come to the realization that the most important part of getting something to work is making a good dataset (“Data” and “Answers” in Figure 1). The data and the labels are really where we should spend the majority of our time and effort.
Deep learning algorithms are made of the same building blocks: layers of neurons arranged in clever patterns. The exact arrangement of those neurons and the long list of accompanying tricks and widgets has been described as alchemy [3]. Researchers spend a huge amount of effort discovering the arrangements that work best, often keeping the dataset static so they can compare these arrangements quantitatively. In the real world however, engineers often do the opposite: figuring out how to get better data while simply using whatever arrangement is popular at the time.
This presents a huge need for tools that make it simple to modify, adjust, and create more training data. A need which is being met by the dynamic nature of synthetic data generation. Synthetic data makes it easy to change the annotation style, or add an additional label which can be used as an additional training loss for the model. It also makes it easy to generate more examples of a specific edge case that may be causing issues in production. Synthetic data generation and iteration should be easy, and used in concert with adjustments to the model in order to achieve one’s goals.
Open Source
“Free software” means software that respects users’ freedom and community. Roughly, it means that the users have the freedom to run, copy, distribute, study, change and improve the software. Thus, “free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer”. [4]
People want to be able to shape and influence the tools they use. The best way to empower them to do that is to build those tools out in the open. The future of data creation, and thus the future of software, will be open core tools that are created in part by the developer community.
The best argument for this type of development is the growing popularity of the open core model in the software startup scene. Open core is based around the idea of having the “core” of the software stack being open source and freely available online. Startups that adopt this paradigm sustain themselves by selling additional services or features on top of this open core. This stands in contrast to the more popular SaaS business model where all software is proprietary and is effectively rented out to users.
Dynamic Data
Dynamic data is the future of training deep learning systems. Open source is the future of programming. That’s why we have decided to release our data development toolkit zpy [5] under an open source license. Now everything you need to generate and iterate synthetic data for computer vision is available for free.
But this is just the beginning of the phase shift we mentioned earlier. Your feedback, commits, and feature requests, will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use hands on support with a particularly tricky problem, please reach out!
References
[1] Building the Software 2 0 Stack. Video Lecture by Andrej Karpathy.
[2] Deep Learning with Python. Book by Francois Chollet.
[3] Machine Learning has become Alchemy. Video Lecture by Ali Rahimi.
[4] “What is free software?”. Article by the GNU Operating System.
[5] zpy: an open source synthetic data toolkit.
Top comments (1)
Yes, that is the correct idea. It is easier to create a simulated world and collect data in simulation, compared to collecting real world data and then having to annotate it.