DEV Community

loading...

Analyzing correlations of Codeforces users and their ratings

caioicy profile image Caio Nardelli ・3 min read

A social network and platform for competitive programming, Codeforces is a popular place for both beginners and masters alike. Users can participate in regularly held contests to climb the ladder and increase their rating.

While not a regular participant myself, I have always been involved in competitive programming through my former university and friends made therein.

Recently I was browsing the website and noticed that most high-rated users had seemingly better results in the beginning of their participation than most average-rated users.

High-rated example

Rating over time for high-rated user Benq

Codeforces API

Wondering if this perception could be backed by data, I decided to see if there were any interesting statistics to be found through Codeforces' API, using Python and lovely scientific packages like pandas, numpy, scipy and matplotlib.

With user.ratedList I gathered a list of all rated users, and with user.rating I gathered every rating change data for each user. As a side note, I locally cached these responses to avoid making these 30k requests over and over again.

The code used for this can be found here:

GitHub logo CaioIcy / codeforces-analysis

Analyzing correlations of Codeforces users and their ratings


Visualization

To understand if there was any merit to my initial hypothesis I plotted every user who had ever reached the first division (max rating >= 1900), with the x-axis being a user's maximum rating and the y-axis being the number of contests participated in order to reach div1 for the first time:

scatter

Number of contests to reach div1 x Max rating

Interesting!

By glancing at the scatter plot there seems to be a correlation. Out of curiosity I searched for ways to measure just how correlated two variables really are, and landed on Spearman's rank correlation coefficient. Using scipy's spearmanr we have about -0.21 for a correlation coefficient which means there is an observable correlation, albeit moderate, as seen in the table below:

Interpretation of correlation coeficient

Reading the correlation coefficient. Excerpt from Nonparametric Statistics for Non-statisticians: A Step-by-Step Approach

To have another perspective on the data, I split the users into buckets based on their maximum rating to be able to visualize the data in a less cluttered way than the scatter plot:

misc


Conclusion

Indeed users of certain rating ranges had usually slightly better performances in their early contests when compared to lower rating ranges, but why? Well, I don't know! There could be a myriad of reasons, and I would like to hear what you think!

And remember, if you blasted through your early contests in Codeforces, it does not mean that you will have a breezy time moving forward, keep studying:

xkcd

Relevant XKCD

Thank you for reading!


References

Spearman's Rank-Order Correlation
How to Calculate Nonparametric Rank Correlation in Python
Nonparametric Statistics for Non-statisticians: A Step-by-Step Approach

Discussion

pic
Editor guide
Collapse
se7enwonder profile image
stringray

How did u cache the response?

Collapse
caioicy profile image
Caio Nardelli Author

I saved the responses locally as a JSON file with the filename being the method I was calling from the API. I had a wrapper method that either requested from the API and cached the result, or retrieved the cached response as can be seen here: github.com/CaioIcy/codeforces-anal...