Principles in practice (2 Part Series)
How the shared principles behind cookie speed eating and choosing the right box allow us to make optimum decisions as well as leveraging the power of crowdsourced improvements to open source projects
When I was a child, my friends and I used to come up with challenges like eating a yoghurt hands-free or cookie speed-eating. We all had 30 seconds to think of a strategy and then the challenge started. Cookie crumbs flew left and right and the yoghurt rarely ended up where it was supposed to, landing instead in noses, eyes and even ears. The initial strategies we devised were rarely successful and so the real challenge was to constantly improve on them or invent even better ones. We did so by trial and error while attempting to achieve some success with the best strategy we'd come up with in the meantime. To take the cookie challenge as an example, given the size, amount and dryness of the cookies, it was common to alternate between taking small bites in quick succession and gobbling them down whole in order to determine which strategy was most effective.
In its general form, this approach is used widely to solve any problem with an as yet unknown optimum solution, not just by children playing around. This trial-and-error approach, seeking to maximise the gains over a defined period of time, is a common learning paradigm. It plays a central role in the machine learning discipline of reinforcement learning, and is generally referred to as exploration-exploitation.
Let us consider the following example to illustrate the general problem more closely: you have been invited to participate in a game show and are faced with ten boxes filled with different amounts of money. The game consists of ten rounds. During each one, you must choose one of the boxes to discover and receive its previously unknown contents. The now empty box is filled with the same amount as before and you get to play again. Your goal is evidently to obtain as much money as possible. Intuitively, given the constraint of having only ten rounds, we would seek to choose the combination of boxes which would yield the largest amount of money, but the problem is that we still do not know the amounts in each box. The only way to explore the contents of a box is to actually choose it and discover its contents. How do you know if the chosen box is the one with the highest amount of money in it? Only by knowing the contents of the other boxes. Once you have determined the desired box, you can exploit your knowledge by repeatedly choosing that box.
Technically speaking, you want to maximise the total gains over a finite amount of time by choosing the right actions from a vast action pool. Every new action (e.g. choose box no. 7) results in a well-defined but previously unknown outcome, its gains (e.g. the amount of money in box no. 7). You want to choose actions so that the sum of their gains is maximised, and we say that an action performs well if it results in higher gains than the majority of the remaining actions. The exploration-exploitation paradigm is a strategy that attempts to maximise the total gains. It performs exploration steps which harvest information about the available actions, i.e. the action pool, and exploitation steps which seek to maximise the gains using the information acquired:
- Exploration is defined as choosing unknown actions and thus determining their gains. You discover how well, or poorly an action performs, and you become increasingly able to assess the average gains resulting from actions. This allows you to estimate whether an action performs well or poorly compared to the overall action pool without knowing the gains resulting from every single action.
- Exploitation is defined as choosing the action that, in your view, performs the best.
Deciding when to explore and when to exploit is not an easy task, and is subject to research in reinforcement learning. A fundamental trade-off underlying exploration and exploitation exists, which is sometimes termed the exploration-exploitation dilemma. The more you explore, the less you exploit and vice versa. This is a result of the fact that you can either explore or exploit, but cannot do both simultaneously. An entire theory on how to best approach this in different settings exists, but, essentially, it all comes down to intuition: if you're lacking concrete information about the actions, you explore. Once you have enough information and think you have found the best or at least a well performing action, you exploit.
You have received another invitation to participate in the game show, but this time the rules are slightly different: in each round you can either open a new box or choose to increase the amount of money in a box which is already open. If you decide on the latter, a random amount is added to the money in that box. At the end of the ten rounds, you receive the box with the highest amount of money in it (of course only opened boxes are considered). In order to maximise the amount of money you receive at the end of the game, you want to find the box with the highest-value contents as quickly as possible in order to have a maximum timespan left to increase the amount of money in that box.
Generally speaking, you are no longer interested in maximising the total gains, but instead in maximising the individual gains resulting from your actions. This means that actions no longer have fixed gains, but you are able to randomly improve their performance. In addition to the a priori unknown gains resulting from the actions, you are unaware, beforehand, of how good the improvements will be. Therefore, let us define exploration-optimisation as a variant of exploration-exploitation:
- Exploration is defined as choosing a new action and determining its gains.
- Optimisation is defined as choosing a known action and improving on it.
Exploration-optimisation comes with the same trade-off as exploration-exploitation, since one cannot simultaneously explore and improve. The following section investigates the structure of a system which relies on an exploration and optimisation approach in order to achieve its goals. After applying the principle, we will make a connection between the actual system and the principle of exploration-optimisation.
In my last post, Learning to code by creating open source documentation, I introduced an approach describing how to use beneficial synergies to improve the creation of open source code documentation and the learning experience of online coding tutorials. Open source code is integrated into online coding tutorials as reading exercises so that the tutorial's users learn to read and understand code written by other people. In return, the learner's task is to provide documentation relating to the piece of code they have been working on, which can subsequently be integrated into the open source projects to enhance usability and user-friendliness.
Specifically, the intermediary system facilitating this beneficial synergy aims to create a set of different documentation versions for a given piece of software in order to increase the odds of producing accurate and understandable documentation. After obtaining several different versions of documentation for the respective software, we must determine which of these is best. However, it would be unrealistic to assume that these initial documentation versions are already as accurate and comprehensible as we would like them to be, so we seek to improve them further. It follows that we aim to improve existing documentation, simultaneously determining how well it documents the respective code.
The main goals are as follows:
- Creation of different versions of documentation for the given code
- Systematic improvement on the documentation available
- Rating the various documentation versions to determine the highest-performing solution
The intermediary system consists of two distinct parts which include mechanisms designed to achieve the above mentioned goals. The first part is a solution generator responsible for the documentation's initial creation. The second part, a feedback loop, is responsible for rating and improving on the different versions of the documentation. An in-depth look at these two core components of our system and the mechanisms they include is provided below.
Furthermore, we will assume that the Input step has already occurred, consisting of open source code being uploaded to the intermediary platform and tagged appropriately (e.g. programming language, dependencies, level of difficulty, etc.). Later on, the tags are used by the coding tutorials to navigate and filter the available code. Every time a piece of code is requested by a coding tutorial, the Distribution step takes place and the intermediary system has to decide whether to launch the solution generator or the feedback loop.
If the intermediary system decides to generate new solutions in order to expand the solution space, the distribution step constitutes the intermediary systems providing open source code to the requesting coding tutorial. In that case, the solution generator, consisting of the following steps, is launched:
- Integration step: online coding tutorials integrate the open source code in the form of reading and documentation exercises.
- Creation step: several learners work in parallel on the same pieces of code and create the required documentation to the best of their abilities.
- Collection step: the different versions of the user-generated documentation are returned to the intermediary system and stored as part of the solution space.
While the Integration and Creation steps take place remotely at the coding tutorial, the Collection step involves the coding tutorials as well as the intermediary system.
If, upon request, the intermediary system possesses enough different documentation versions, the feedback loop will be triggered and the Distribution step constitutes the provision of one of the existing documentation versions along with the code. As previously mentioned, the main purpose of the feedback loop is the rating and improvement of the documentation provided, and its structure appears as follows:
- Integration step: online coding tutorials integrate the open source code as well as the documentation provided within their learning experience in the form of exercises.
- Rating step: the learners are asked to study the code and rate the documentation provided with a performance score (e.g. from 0 to 100), expressing their opinion on how understandable and accurate the documentation is.
- Refinement step: the learners are asked to improve on the documentation according to their own needs.
- Recollection step: the performance score and the improved documentation are returned to the intermediary platform, where the corresponding documentation is assigned its performance score and is updated to the improved version.
- Discarding step: the system searches for documentation that has failed to increase its performance score over the last couple of iterations and discards that documentation.
The Rating and Refinement steps take place remotely at the coding tutorials, while the Recollection step involves both the tutorials and the intermediary system, and the Discarding step takes place internally within the intermediary system once all the information has been re-collected.
Note that we not only use the performance score as the final criterion to determine the best documentation, but also to track the evolution of a given documentation version and its improvements. Each time the documentation is improved on, the performance score is determined and its value sequence indicates whether the performance of the given documentation has increased or stagnated over time. When documentation stagnates, it is discarded and new initial documentation can be generated.
After this brief introduction to the mechanisms used by our intermediary system, let us take a step back in order to relate the abstract concept of exploration-optimisation to the concrete steps of the solution generator and the feedback loop.
By defining our set of different documentation versions as the action pool and the gains of an action as its performance score (describing how well that documentation fits the code), we can, in formal terms, describe our search for effective documentation as an exploration-optimisation problem:
- Exploration is implemented by the solution generator, and constitutes the creation of new solutions. This is required as an initialisation process and after poorly-performing solutions have been discarded.
- Optimisation is implemented by the feedback loop and primarily constitutes the rating of documentation verisions to determine their gains as well as of the improvement of said documentation. Part of the optimisation process is the decision to discard poor solutions in order to allow for increased exploration.
In theory, exploration-optimisation can be used to systematically improve on any kind of user-generated solution by defining the action pool appropriately, provided that mechanisms to determine the gains of your actions exist, as well as the means to alter these accordingly. Thus, exploration-optimisation provides a conceptual and abstract framework for building systems which aim to determine optimum solutions for a given problem in an iterative manner. The intermediary system I have described in this article is a concrete example of how to transition from the conceptual framework to a real system. I believe its mechanisms and functionalities can be extrapolated and applied to a vast range of similar problems. I look forward to seeing the solutions and approaches involving exploration-optimisation developed by others in future.
In the spirit of collaboration, if you have comments or ideas for improvements, constructive criticism or any other feedback, please get in touch (firstname.lastname@example.org).