DEV Community is a community of 550,319 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Analyzing correlations of Codeforces users and their ratings

A social network and platform for competitive programming, Codeforces is a popular place for both beginners and masters alike. Users can participate in regularly held contests to climb the ladder and increase their rating.

While not a regular participant myself, I have always been involved in competitive programming through my former university and friends made therein.

Recently I was browsing the website and noticed that most high-rated users had seemingly better results in the beginning of their participation than most average-rated users.

Codeforces API

Wondering if this perception could be backed by data, I decided to see if there were any interesting statistics to be found through Codeforces' API, using Python and lovely scientific packages like pandas, numpy, scipy and matplotlib.

With user.ratedList I gathered a list of all rated users, and with user.rating I gathered every rating change data for each user. As a side note, I locally cached these responses to avoid making these 30k requests over and over again.

The code used for this can be found here:

Visualization

To understand if there was any merit to my initial hypothesis I plotted every user who had ever reached the first division (max rating >= 1900), with the x-axis being a user's maximum rating and the y-axis being the number of contests participated in order to reach div1 for the first time:

Interesting!

By glancing at the scatter plot there seems to be a correlation. Out of curiosity I searched for ways to measure just how correlated two variables really are, and landed on Spearman's rank correlation coefficient. Using scipy's spearmanr we have about `-0.21` for a correlation coefficient which means there is an observable correlation, albeit moderate, as seen in the table below:

To have another perspective on the data, I split the users into buckets based on their maximum rating to be able to visualize the data in a less cluttered way than the scatter plot:

Conclusion

Indeed users of certain rating ranges had usually slightly better performances in their early contests when compared to lower rating ranges, but why? Well, I don't know! There could be a myriad of reasons, and I would like to hear what you think!

And remember, if you blasted through your early contests in Codeforces, it does not mean that you will have a breezy time moving forward, keep studying: