This is a reposting of an article by me that originally appeared on June 26, 2016 in the IBM Watson blog on IBM developerWorks. That blog has since been removed since it had a lot of articles that were specific to products that are no longer available. However, this particular post addresses a more general topic that I believe is still applicable to more recent systems, so I am reposting it.
The content in this reposting is also updated slightly. You can see the original in the web archive.
Many Watson Developer Cloud services and other systems that employ machine learning approaches provide a numerical score for how confident the system is in some result. For example, the query API in the IBM Watson Discovery service provides a confidence for each search result it returns. Similarly, the IBM Watson Natural Language Classifier also provides a confidence score as do Watson services for language identification, speech processing, entity and relation detection, etc.
Some of these confidence scores may have some meaningful interpretation as a probability (e.g., a probability that some result is “correct” or “relevant”). Other may not. You cannot assume that all results with a very high confidence score are very good and all results with a very low confidence score are very bad. However, you can generally assume that results with a higher confidence score are more likely to be better.
One use of confidence scores is setting a threshold for when to act. For example, imagine a specialized authoring tool that monitors what the author is writing and interrupts the author with relevant information if and only if it is extremely confident that it has some information that is relevant. Such an application needs to be very precise and should thus have a very high confidence threshold. In contrast, imagine a more typical search application where a user types in a query in a search box and waits for search results to come back. In that type of application, users definitely want to see some results, so such a system may be better with a much lower threshold or no threshold at all.
The ideal approach to selecting a threshold is:
- Assign numerical reward to each possible outcome. For example, for the authoring tool described earlier, we might decide that interrupting with a highly relevant document is worth $0.07, that interrupting with a moderately relevant document is worth $0.02, that interrupting with a non-relevant document is worth -$0.15 (i.e., it has a very high cost), and that not interrupting is worth 0.
- Run a large set of queries for which you know what the relevant responses are. These should be queries that are not in the set you use to train the system.
- For each possible threshold (e.g., each number that is the confidence of at least one instance in your set), compute the net reward for the system at that threshold.
- Select the threshold that has the greatest net reward.
However, there are several reasons why attaining this ideal might be challenging. Assigning numerical rewards to outcomes is very hard and often data is not available to make these assignments in an informed way. Also, you may not always have large set of requests for which you know what the relevant responses are. Below are more details on how to deal with these issues.
For some applications, it may be possible to compute meaningful rewards from observations of user behavior. For example, if you know that when you present a relevant result to a user that 7% of the time it results in selling a product with an average profit of $22 so the reward for a relevant result is 7% of $22.
In some cases, the outcomes may not have a directly measurable cash benefit but may still result in something that is desirable. For example, for some application you may have logs that indicate what queries were asked, what responses the system gave, and whether the user used the application again on the next day. If you have an expert go through a random sample of those logs and mark-up which responses are good and which are bad, you can compute how quality of the responses correlates with the user returning. That can allow you to compute some expected impact on whether the user will return for each outcome, which you can use as a reward for that outcome. If there is a big negative impact from a bad response but no corresponding negative impact for not responding, then you would compute a large penalty for a bad response (relative to not responding) and as a result, you would wind up making the system very cautious (by setting a high threshold). If the negative impact of a bad response is only a little bit worse than negative response, then you might have a very low threshold. If the negative impact of a bad response is not as bad as no response, then you would want a 0 threshold so that you always respond since in that application a bad response tends to be better than nothing.
Many applications cannot compute outcome rewards from user data either because they do not have any user data yet (e.g., because they are under development) or because the user data they have does not provide information about the consequences of the system behavior (e.g., nothing that links requests and responses to sales or retention or anything else that you want). In that case, the long term goal should be to eventually start accumulating relevant user data and using it as described in the previous section. However, in the short term, it makes sense to try to make a best guess at what the rewards should be for different outcomes. One can try to guess the reward values directly, but that is extremely hard to do well; often when you ask an expert to do that and then they do and see what the implication of those guesses are, they find that they are unhappy with those implications. Instead, experts should see as much as possible about the consequences of different thresholds are so they can choose the set of consequences that seem best.
Here is one approach to letting experts see the consequences of different thresholds so that the system can be optimized to be consistent with the experts’ goals:
- Run the system on a validation set (a large set of requests for which you know what the relevant responses are) and record which requests had correct responses and how confident those responses were.
- For an assortment of thresholds (e.g., .04, .08, .12, .16, etc.), compute at that threshold:
- What percentage of questions are correct at that threshold?
- What percentage of questions are incorrect at that threshold?
- What percentage of questions have no response at that threshold?
- Organize these results into a table OR present the result as a graph.
- Present the table or graph to one or more people who are domain experts and ask which of these combinations would be most desirable.
- Take the average (or negotiated consensus) most desirable threshold: that is the initial threshold for your system.
- Compute a set of numerical rewards for outcomes that is consistent with the initial threshold. (see details below).
- Frequently update the threshold (as described below in “Keeping thresholds up to date”) using the rewards that you computed in Step 6 and the “Ideal method” described earlier.
- Infrequently update the rewards by redoing Steps 1-6. This is done less frequently than updating the threshold because it requires manual judgement from domain experts. In contrast, if you have validation data labeled with which results are good and which results are bad, then you can fully automate the process of computing a threshold from the outcome rewards so it makes sense to rerun it on every substantial update to the system. We assume that the outcome rewards change infrequently because they reflect the goals for the system, not its behavior. In contrast we expect the threshold to change much more often because it is directly dependent on the behavior.
It can be very tempting to skip steps 6 and 7 because after step 5 you have a threshold that you like and there does not seem to be an urgent need to do anything more. However, these steps tend to be very important in the long run because keeping the threshold up to date is very important and it is rarely feasible to redo steps 1-5 often enough to keep the threshold where you want it.
One way to compute numerical rewards for outcomes that is consistent with a chosen threshold (step 6 above) is to arbitrarily fix the rewards for all but one outcome, compute the rewards for the remaining outcome that would make the system indifferent between the selected threshold and the ones above and below it, and then average those two. For example, consider the following table in a simple case where there are only three outcomes (in this example, if the confidence is above the threshold then a response is provided that is either right or wrong and if the confidence is not above the threshold then the response is ignored):
|Threshold||Percent Right||Percent Wrong||Percent Ignored|
If the experts select the “.08” row then we want to find rewards consistent with that being the best row. We arbitrarily assign a reward of 1 to being right and a reward of 0 to ignoring a request, so we need to compute what the (negative) reward should be for being wrong. The point at which we should be indifferent between threshold .04 and .08 is
60%*1+32%*r1+8%*0 = 58%*1+26%*r1+16%*0; solving this equation, we get
r1=-0.333. Similarly, the point at which we should be indifferent between .08 and .12 is
r2=-0.167. We compute the average of r1 and r2 to get the middle of the range of rewards that are consistent with this choice of threshold; in this case, the middle of the range is -0.25. Thus in this example we select the following set of rewards: 1 for being right, -0.25 for being wrong, and 0 for not responding.
In a more complex example, you may have more kinds of outcomes. For example, a question answering system may use confidence scores to decide whether to “hedge” an answer (e.g., to say something like “I don’t really know the answer to your question but maybe the following is relevant”). Also, some systems may use confidence scores to decide whether to respond with a single answer or multiple answers (it may use the confidence in the single best ranked answer to make this decision or it may also consider the confidence scores of lower ranked answers). The same basic principles apply. First, list possible outcomes (e.g., system does not respond, system responds with a hedge and a single answer and that answer is correct, system responds three answers and none of the answers is correct, etc.). Then produce examples of thresholds, see how often different outcomes occur at those thresholds. Then have domain experts which frequencies of outcomes are most desirable. Finally, find rewards for the outcomes that are consistent with the preferences of the experts.
One example of a tool for plotting graphs (instead of tables) for selecting a threshold is the IBM Watson Performance Evaluation Framework which you can access using sample data or bring your own data. The Performance Evaluation Framework does not fully address some complex cases such as having multiple thresholds for multiple outcomes, but you can download the code and adapt it to your specific needs. The graphs serve the same purpose as the outcome tables, but in a more comprehensive way: you can see the consequences of all possible threshold settings and use that information to choose an optimal threshold. Regardless of whether you use graphs or tables, you should determine a set of rewards that are consistent with this optimal threshold so you can automatically recompute the optimal threshold for those rewards every time you update the system.
This section explains how to enable experts to guess what an ideal behavior of the system should be so that a system can be configured to be consistent with that guess. However, it is still important to remember that guesses from experts are not as good as direct observations from large volumes of real user data. Do not forget that the approach described in the previous section (“Computing outcome rewards from user data”) is the preferred approach. If you do not have the data that would enable that approach, you should try to get that sort of data eventually and to use the approach described in this section only as a temporary measure.
Instead of determining rewards for outcomes, you can use some sort of rule to determine how you want the system to behave and then compute a threshold that is consistent with that behavior. Here are some examples of such rules:
- We want the system to act 60% of the time (and get as many correct results as possible).
- We want the system to get a correct result 85% of the times that it acts (and act as often as possible).
- We want the system to get a correct result to at least 30% of all requests (and get the highest possible percentage of the ones that it responds to correct).
- We want the system to get the maximum possible F1 score. The F1 score is defined to be 2*p*r/(p+r) where p is the fraction of requests for which you act that you get a good result and r is the fraction of all requests for which you get a good result.
For each of these rules, you can run the system on a pool of queries with known responses and compute an optimal threshold for that rule. In principle, rules like this are not as good as the approach described in earlier sections (because they do not guarantee an optimal total reward). However, sometimes domain experts prefer rules like these because it can be more intuitive to understand the consequences and thus easier to guess what a good value might be. These sorts of rules are (at least) much better than simply selecting an arbitrary threshold and keeping it fixed. Once you have a rule like this or a set of rewards as discussed earlier, you can update the threshold automatically on a continual basis as described in the next section.
In general, it is a bad idea to select a confidence score once and then assume that this confidence score remains good for all time. Instead, you should update the confidence thresholds whenever you make a substantial change to the system such as adding more content or changing the configuration. You should have some sort of validation process that you run every time you make a substantial change anyway, to verify that you haven’t broken the system (e.g., for a search system, you might have a batch of sample queries that you run to validate that the system still works and produces results of comparable or better quality to the ones you were getting before). It is a good idea to embed the process for computing the confidence into your standard validation process. As much as possible, you should automate both computing and updating the threshold to prevent user error and ensure that it gets done.
The first sample rule listed in the previous question (“we want the system to act 60% of the time…”) is different from the others in that you can compute an optimal threshold for this rule without knowing what the outcomes are (e.g., without knowing which responses are good and which ones are bad). For most purposes, it is a pretty bad rule: in general, you should want your system to answer more often when it is more accurate instead of answering a fixed amount regardless of how accurate it is. However, it does have the advantage that the threshold is easier to compute and easier to keep up to date. In fact, with this rule, you can continuously update your threshold to be more and more optimal as you continue, e.g., by updating the threshold every time you get a new response to be the threshold that would have caused you to respond to the desired percentage of all past queries including the one you just responded with. For most other types of rules, you do need information about outcomes (e.g., whether a response was good or bad), which often requires expert labeling.
Regardless of whether you have heuristic rules or outcome rewards, updating your threshold when you do a major update to your system is very important. For example, imagine that you make a change to your system that causes the confidence scores for all responses (good or bad) to be cut in half (and has no other effect). That change should not cause any change in the behavior of the system: the new confidence scores are just as useful for distinguishing between good responses and bad ones as the old was. If you automatically update your threshold using any of the heuristic rules discussed in the previous section or by computing an optimal threshold for a set of rewards, you will see that the system behavior is unchanged. However, if you don’t do that and you assume that the numerical threshold that was good for the previous version of your system is still good, then you will see a dramatic change in the behavior, and if the threshold was good before, it will probably be bad now. In practice you generally do not see anything as extreme as all the scores being cut in half, but there is often some amount of drift in scores in one direction or another (that may be conflated with other changes in behavior such as the system being more accurate). Real improvements to a system can easily be counteracted by unintended and meaningless drift in absolute scores if you keep the threshold fixed. In contrast, if you recompute the optimal threshold on each update, you prevent this from happening and wind up with a system that is more stable in its behavior and more steadily improving as the underlying capabilities improve.
All of the processes above assume that you have some validation data that you can use to compute thresholds. However, in some cases you have barely enough training data to train your system. One method for addressing this is cross-fold validation:
- Split your data up into some number of “folds” (typically 10 folds). For example, if you are working with a search system and you have 800 queries for which you know what the relevant responses are, you could split that into 10 folds with 80 queries each.
- For each fold, train the system with all of the data that is not in the fold and then apply that system to all of the data that is in the fold and compute an optimal threshold for the data in the fold.
- Take the average optimal threshold as the threshold.
- Finally train the system with all of the data and deploy it using the optimal threshold that you computed in the previous step. This is a lot more work than having separate validation data because you have to train the system many times. However, if you don’t have separate validation data, this is a reasonable alternative.
Many cognitive systems will not work at all without training data. However, some will work to some extent “out of the box” without any training data; for example, you can use the Retrieve portion of IBM Watson Retrieve and Rank with no training data at first and gather training data over time to train a Ranker and get better results. In general, when you have no training data, it is probably a good idea to not use a threshold at all and present all the results you have to the users (and make sure that you have users who are very tolerant of poor results at this phase of your project). Once you gather a little data, you can use this for training and validating by cross-folding as described in the previous section. When you have even more, you can split that data up into training and validation data.
If the net reward for not responding is worse than the reward for responding with a bad result, then obviously you do not want to use a threshold to decide whether to respond: instead you should always respond no matter what.
At the other extreme, you may have an application where the cost of allowing a bad result through is so high that the optimal threshold involves never responding to any request. If you are in this situation, then either your rewards are wrong or you should not be deploying your system until it is more effective at what it does. In some cases, you may want to adjust the rewards to reflect the fact that there is some benefit to having the system operate so that it can gather some data that you can use for more training and more improvements. Such an adjustment generally involves temporarily increasing the reward for both right and wrong responses (or decreasing the reward for not responding). For example, consider a system where rewards for right, wrong, and not responding should be 1, -20, and 0 in the long run (i.e., there is a huge cost to getting any wrong, so it should only respond when it is extremely certain); if the system is not very precise, the optimal behavior for these rewards may be to never respond. However, if the system is in preliminary alpha testing and really needs to accumulate training data, then maybe the rewards for right and wrong should get a temporary boost of +5 (making the rewards 6, -15, and 0). That makes the system more aggressive. If this allows the system to respond sometimes, it can start accumulating data with a threshold optimized for these adjusted rewards. Once it accumulates enough data, the rewards should be adjusted back to their long-term values. If the optimal behavior is still to never respond, then the system still needs improvement. However, with more training data, you may find that there is now a lower threshold that is optimal because the system is more precise in its high confidence responses.
For most cognitive systems, it is not possible to set a threshold that accepts some results and have that threshold be “perfect” in either sense: it cannot guarantee that you will never accept a bad result and it cannot guarantee that you will never reject a good one. Instead, a threshold can only adjust the frequency of different outcomes: it can make you provide bad results less often at the cost of occasionally rejecting good ones.
With some exceptions, you should not treat the confidence scores from a cognitive system as a “percentage” indicating degree of match or probability of match. For example, in Retrieve and Rank, a score of 0.8 does not indicate that the query matches the search result 80% (whatever that might mean) or that the search result has an 80% chance of being relevant to the query. For a deeper discussion of the complexities inherent in computing a percentage score for search results, see this article at the Apache Lucene site [hyperlink to wiki.apache.org/lucene-java/ScoresAsPercentages]. If you really want a probability that some search result has some degree of relevance to your query, one reasonable method for computing that probability is as follows:
- Run a large number of queries that are representative of what your users ask and divide the range 0-1 into small intervals.
- For each such interval, examine all the results within that interval and decide whether they are relevant.
- Compute the fraction of those results that are relevant. This is the estimated probability that a result with a score within that interval is correct.
For example, if you have many results with a score between 0.181 and 0.182 and 13% of them are relevant, then you can conclude that results with a score between 0.181 and 0.182 have a 13% chance of being correct. For most applications, this is not worth the effort; instead it is sufficient to know that higher scores are better and to have some threshold for deciding how high a score is good enough.
Many people provided helpful input on drafts of this document. I would particularly like to thank John Prager, Jiri Navratil, Rishav Chakravarti, Anna Chaney, Anish Mathur, Seth Bachman, Scot Taylor, A.J. Morello, Chitra Venkatramani, Raimo Bakis, and Stephan Roorda for their contibutions.