Learning to code by creating open source documentation: a beneficial synergy

#opensource #tutorial #documentation

How Duolingo and ReCAPTCHA inspire us to improve open source documentation and the learning experience of coding tutorials

For most people, captchas are not exactly their idea of fun, but rather a necessary evil. Captcha stands for "completely automated public Turing test to tell computers and humans apart" and they are annoying little tasks that consist of reading and typing in scrambled words in order to identify yourself as a non-machine. I always hoped I would get an "easy one" whenever a captcha popped up. Not the kind of illegible word that would make you start all over again and curse the inventor of this frustrating waste of time, especially if it is your 10th captcha in the space of 30 minutes.

Ironically, the idea behind captchas was to confound machines by setting tasks which they perform badly at, while being natural and easy for humans to solve, such as image recognition or reading handwriting. After a while, reCAPTCHA, "Captcha reloaded", as it were, emerged and suddenly you had to type in two words instead of one. Great, more madness!

Admittedly, I exaggerate a little, although there is no denying that I have a point. Of course, it had nothing to do with madness, and later I discovered that something had changed underneath the hood which no longer made captchas a waste of time, but transformed them into a powerful crowdsourcing tool.

Captcha reloaded

With more and more websites using captchas and an increasing number of people completing these little tasks every single day, the developers behind captchas came up with the smart idea of using the massive amount of manpower that goes into solving captchas more constructively. They founded a company called reCAPTCHA and redesigned captchas in such a way that the solving and authenticating process essentially stays the same, but with the difference that the answers given by users are now used to solve big data problems.

As the name implies, big data problems require huge amounts of data in order to be solved, and often the datasets have to be created by hand. Examples of such tasks are digitising handwritten books or labelling pictures for machine learning datasets. The method used to digitise handwriting involves making people read a handwritten word and type it in, and storing the result as the digitised version of the handwritten word, whereas picture recognition tasks simply ask you to click on images which contain certain objects.

Of course, there are several mechanisms to ensure that only correct answers are stored, but otherwise that's all it is. Incredibly simple and efficient. According to Luis von Ahn, one of the developers and founders of reCAPTCHA, they managed to digitise 2.5 million books per year, which in fact has been so efficient that by 2011 the New York Times' archives and the books on Google Books had already been digitised in full.

You can check this out and more in Luis von Ahn's (very funny and inspiring) Ted Talk about massive online collaboration. At this point, I would like to move on to another project he mentions in his talk which has become increasingly popular in recent years:

Duolingo

Co-founded by von Ahn, Duolingo is an online language learning platform which teaches you the language of your choice by letting you practice by completing short exercises every day. It is free to use and provides a possibility for everyone to learn a new language while, to quote the MIT technology review, "being an education that pays for itself".

How does Duolingo achieve this? Besides more "classical" options like language tests and obtaining certificates for them, Duolingo provides a very clever service: crowdsourced translation. Companies can submit texts, such as daily news, to Duolingo and request a translation into a specific language. Duolingo will then distribute the text to the students on the platform, who will work on it as part of their learning experience, before returning the translated text to the companies for a fee.

It is easy to see how the principles behind Duolingo's and reCAPTCHA's business models are very similar:

Intelligent crowdsourcing
Combined workflows
Clever usage of existing resources

Synergetic benefits

Let's take a closer look at the genius of these combined principles. Obviously, the problem, and the solutions, of identifying a human user, as well as that of digitising books already existed before reCAPTCHA. However, only by combining those two problems in order to solve them both simultaneously do you gain a truly elegant and efficient solution that creates added value - essentially out of nothing, simply by combining existing processes. The same goes for Duolingo, where required tasks such translating the news are completed while serving as examples for language learners. Both the translation and the learning aspects existed before, but only by combining them do you create a synergy that is beneficial for both. Translations become cheaper as they are done through crowdsourcing, and language learners get some hands-on experience and can learn a language for free.

Both reCAPTCHA and Duolingo are very popular, which in my opinion is due to the clever principles or "synergetic benefits" they are built upon:

2 for 1 - solving two problems with one action
Win Win - both sides benefit from the synergy
Recycling - reusing existing processes and solutions

Background

You could also describe reCAPTCHA and Duolingo as "interactive knowledge transfer platforms", and if you think of more traditional ways of transferring knowledge, you can't get around libraries. In fact, I have been looking into "synergetic benefits" as part of my research on open source software development for the ETH Library Lab. The Library Lab aims to rethink and advance information infrastructures and information cycles for science, research and education, which is why I took a closer look at open source software and its increasing importance as a digital public infrastructure.

Here is some background information, so we are all on the same page: the number of users and repositories on Github, the most prominent code sharing platform, has been increasing almost exponentially over the last decade, reaching more than 31 million developers and over 96 million repositories according to the state of Octoverse 2018. An increasing number of companies use or release open source software and according to the 2015 BlackDuck/Northbridge "Future of open source" survey, 78% of companies operate some or all of their systems on open source software. Moreover, the StackOverflow developer survey 2019 states that 41% of the developers polled improved their skills or learnt new ones through the contribution to open source software.

This increasing amount of open source software development can only be achieved with sufficient developers and maintainers working on it. Statistics show that as technology becomes increasingly present and important in our lives, the number of people wanting to learn coding has increased similarly to the demand for open source software. Surrounded by devices running and executing code all the time, the desire and, of course, the need to understand and design such systems is becoming more prevalent. Nowadays, the go-to place for code-learning is the internet. According to the aforementioned StackOverflow survey, more than 60% of the developers have taken an online course or worked through online tutorials in order to acquire new coding skills.

All about code

With a growing amount of software being developed, two factors become increasingly important:

The ability to read, navigate and understand foreign code
Code documentation

As software projects become bigger in terms of lines of code and more and more people collaborate on projects, developers need to be increasingly able to understand other people's code. This skill is far from being trivial and takes practice, much like when you were first learning to read. There are some interesting articles and forum posts about techniques and methods on how to navigate through and read foreign code in order to grasp its structure and understand its functionality. Unfortunately, as most online courses and tutorials lack in-depth exercises and guidance on how to do so properly, the transition from online tutorials to real coding projects can be harsh. In my opinion, the integration of code reading tasks and exercises in coding tutorials would be beneficial, but more on this later.

On the topic of code documentation, the Open Source survey states that the most common problem of open source is incomplete or confusing documentation. With an increasing amount of software becoming available thanks to the open source ideology, the concern about software has shifted from the existence of a certain software solution towards the usability of said software solution. Poor documentation raises the problem of software becoming unusable or being used in a wrong or insecure way because it has not been understood correctly. With open source software becoming public infrastructure, it is important that software projects are well documented and maintained. This is particularly crucial for critical projects which a lot of people and companies rely on. Improving the quality and quantity of documentation also improves the maintainability of a project, as developers can join in and understand a project much more quickly. Unfortunately, with developers spending most of their time coding, debugging and solving issues, documentation is often the lowest priority and thus many projects are either poorly documented, if at all.

The beneficial synergy

Let's wrap everything up and you will see how all these topics are related. As I am sure you will have noticed, the aforementioned problems of learning to read code and improving on open source code documentation are twofold, and you would like to solve them simultaneously so that both sides benefit. As part of our work at the Lab, we came up with a system based on "synergetic benefits" that aims to solve the two aforementioned problems, and I would like to share it with you as an example of how different work and knowledge flows can be combined in order to generate added value.

The system consists of a distribution platform, to which open source projects and online coding tutorials can connect. Similar to the crowdsourced translation service provided by Duolingo, open source projects can upload code to the platform and request crowdsourced documentation. The platform provides a standard API for online coding tutorials, through which they can pull code and integrate it in the learning experience as code reading exercises for their users. Once the code has been understood, people are asked to provide the documentation they would have needed in order to understand the given piece of code. Through an internal crowdsourced review process, the documentation is improved and then returned to the distribution platform, where it can be retrieved and integrated by the open source project.

Setting up the system in the described way comes with the following direct benefits:

Code learners working on the open source code learn to read code and get used to working with other people's code.
By formulating the documentation for the given piece of code, they can test and consolidate their understanding of the code.
Undocumented open source projects will be provided with documentation, making them more accessible and usable.
Documentation will be improved by being written or extended by people who are not part of the project and thus have an outside view on the matter.
Documentation will become more user and especially beginner-friendly.

The scientific community is seriously discussing the possibility of integrating code used for obtaining results in research papers and making it publicly available in order to render results verifiable and speed up research processes. The system presented would indirectly benefit this, since less time would have to be spent on working through poorly or undocumented code and more time could be spent on actually working and experimenting with the code.

This begs the question as to why online learning tutorials would not directly pull open source code from GitHub repositories and then return it, and why you would need an extra intermediate platform for this. Neither open source projects, nor coding tutorials are specialised for such interactions, so they would probably be on a small scale and relatively inefficient. An independent intermediate allows for multi-platform interactions, such that any Coding platform out there has the possibility to integrate open source code into its learning experience. Moreover the platform fulfils the task of sorting the received code in order to provide the choice of different levels of difficulty and to make sure online tutorials receive code that matches their topics (machine-learning code to machine-learning tutorials and java code to a java tutorial for example). I see the biggest potential here in providing a standardised access point to make sure the knowledge exchange is made as easy as possible and enable a good scalability of the approach.

All about collaboration

Only by building on the many great resources that already exist can this approach develop its full potential so that the system is entirely based on extending existing structures and collaborating with open source projects and online tutorials. This seems to be a promising field for classical institutions in order to extend their reach beyond traditional services and remain relevant in the future. I am convinced there are many other problems that can be solved efficiently through the use of "synergetic benefits" and I am looking forward to see what other people come up with.

In the spirit of collaboration, if you have remarks or ideas for improvements, constructive criticism or any other feedback, please get in touch (magnus.wuttke@librarylab.ethz.ch).