As I've started to become more familiar with pandas, I've noticed that one of the more useful and versatile features is the value_counts() function. For this reason, a value_counts() guide may be convenient for rookie data scientists who are looking to best leverage this invaluable tool.
To indulge in my current obsession with tennis, I will use a data set from the GitHub of Jeff Sackmann, founder of tennis database TennisAbstract, containing data from all the ATP matches that have taken place in 2023. (https://github.com/JeffSackmann/tennis_atp.git).
I can use value_counts() on a Series, or column, in a data frame to return its distribution of values. In other words, in outputs the count of unique values in that column.
Here is an example:
I used value_counts() on the Series 'winner_name' to return a truncated listing of the name of each winner, and the number of matches he won an ATP match in 2023. It will also show us the length of the data frame, which gives us the number of winners (302). Here I can see that Daniil Medvedev won 68 matches, the most of any player on the ATP tour in 2023.
We can also use value_counts() on a Series with a smaller number of values and use the parameter (normalize = True) to show the relative distribution (percentage) of a category.
Here we see that about 56.5% of ATP matches in 2023 were on a hard court surface. Knowing that 56.5% of matches took place on hard court is a more robust metric than just knowing the count of matches on hard court. For example, among other factors such as durability and number of tournaments entered, the predominant surface being hard court can help explain why a player like Daniil Medvedev, a hard court specialist who notoriously struggles on clay and grass, has the most matches won in 2023 despite not being considered the best player on the ATP tour currently.
The output for value_counts() on the Series 'surface' also shows us that there are some 'None' values for surface. This is potentially useful when it comes time to clean our data set, as we would need to change the 'None' values before dropping rows with null values (NaN).
Speaking of null values, we can set the dropna parameter within value_counts() to False to include the null values in our returned result.
This displays the Null values in the Series showing the numbers of aces in each match by the winning player. As I briefly touched on in the previous section, we may need to clean our data by removing all of the rows with null values in order to conduct further analysis of our data.
One final feature of value_counts() that I will highlight is setting the parameter 'ascending' to True. This will display the counts of each value from lowest count to highest. While it may often be more useful to stick with the default 'ascending = False', being able to sort in either direction is a subtle added utility that can provide a different perspective. Like we did in the previous section, we can show this feature with the winning aces Series.
The value_counts() function in Pandas offers several advantages that make it a valuable tool in the arsenal of a data analyst. It provides a simple and concise overview of the distribution of unique values within a categorical column that can lead to quick observations to facilitate further analysis of our data. This is why data scientists find themselves using this feature so often when first investigating a data frame. Value_counts() also contains different parameters that add to its utility, such as displaying relative frequencies, showing the null values that may need to be removed from a certain Series, or sorting in either order. Furthermore, the inherent simplicity and readability of this function is useful not only for maintaining consistency and reproducibility within a data pipeline but also for effectively conveying data insights to directors and stakeholders.
However, it's important to acknowledge the limitations of value_counts(). Although its simplicity is a great feature, it does make it limited in terms of conducting more complex analysis. For example, with multidimensional analyses involving multiple columns, alternate ways of analyzing such as by grouping and aggregating may be better suited. Also, value_counts() may require separate calls for each column when doing analyses with multiple columns, which can make our code somewhat redundant and our analysis less efficient.