In Natural Language Processing it is very difficult to gauge the performance of a model. Facebook has launched Dynaboard which ranks state-of-the-art language models like BERT, RoBERTa, ALBERT, T5, and DeBERTa on four common NLP tasks. The tasks are-
- Natural Language Inference
- Question Answering
- Sentiment Analysis
- Hate Speech
For evaluating the models for these tasks first a new performance evaluation parameter was created that is known as Dynascore.
It takes into consideration different metrics which include
- Accuracy - how many examples did the model get right as a percentage
- Compute - To account for computation, we measure the number of examples that a model can process per second on its instance in our evaluation cloud
- Memory - We average the memory usage over the duration that the model is running, with measurements taken each N seconds
- Robustness - We evaluate robustness of a model's prediction by measuring changes after adding perturbations to the examples
- Fairness - we perform perturbations of original datasets by changing, for instance, noun phrase gender (e.g., replacing “sister” with “brother”, or “he” with “they”) and by substituting names with others that are statistically predicative of another race or ethnicity. For the purposes of Dynaboard scoring, a model is considered more “fair” if its predictions don’t change after such a perturbation
Dynascore is calculated by giving different weightage to these metrics and combining them depending on the type of task. First the tasks mentioned above which form the Dynabench were solved statically. Dynaboard has helped to make this process more dynamic.
The objectives achieved by Dynaboard are-
- Backwards Compatibility
- Forward Compatibility
- Prediction Costs
To know more about Dynaboard read the official FB blog and to know about further details of implementation read the paper.
Top comments (0)