Rogério Chaves

Posted on Sep 12, 2020

Is Neymar Better Than Average?

#datascience #bayesianstatistics #neymar #football

Do people on dev.to like data analysis posts notebook style? Well, let me know if you like it or not.

I've been reading the book Doing Bayesian Data Analysis, and there is an example there with baseball, which I'd like to try for football.

Basically, we want to estimate what's the probability of a player scoring in a match, and which part of that probability is influenced by their position on the field vs other factors. We'll focus on Neymar but I'm actually using the data of all players from UEFA Champions League.

First let's load our data. I collected data from UEFA Champions League for 2018 and 2019-2020, each row is a participation of a player in a match, with which position they were on this match, and how much goals they scored. Players that were in substitution bench during the match but never entered the field are not included

import pandas as pd
import numpy as np

df = pd.read_csv("uefa_player_matches.csv", parse_dates=["match_date"])
df = df.dropna()
df = df[df['position'].isin(["G", "D", "M", "F"])] # excluded "SUB" and "-" positions
position_map = {'G': 'Goalkeeper', 'D': 'Defender', 'M': 'Midfield', 'F': 'Forward'}
df['position'] = [ position_map[x] for x in df['position'] ]
df.head()

	match_id	match_date	player	player_id	team	team_id	position
0	158963	2019-06-25 13:00:00+00:00	Mattia Migani	80722.0	Tre Penne	700	Goalkeeper
1	158963	2019-06-25 13:00:00+00:00	Davide Cesarini	56095.0	Tre Penne	700	Defender
4	158963	2019-06-25 13:00:00+00:00	Mirko Palazzi	56099.0	Tre Penne	700	Defender
5	158963	2019-06-25 13:00:00+00:00	Nicola Gai	80723.0	Tre Penne	700	Midfield
7	158963	2019-06-25 13:00:00+00:00	Nicola Chiaruzzi	80724.0	Tre Penne	700	Midfield

Let's take a look on matches that Neymar participated

df[df['player'] == 'Neymar']

	match_id	match_date	player	player_id	team	team_id	position	goals
4360	240579	2019-11-26 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
5012	240603	2019-12-11 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Midfield	1
5192	292852	2020-02-18 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	1
5451	292853	2020-03-11 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Midfield	1
5610	570966	2020-08-12 19:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
5698	589000	2020-08-18 19:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
5731	591151	2020-08-23 19:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
8419	39151	2018-09-18 19:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
8930	39170	2018-10-03 16:55:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	3
9504	39191	2018-10-24 19:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
9740	39199	2018-11-06 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	0
10421	39224	2018-11-28 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	1
10657	39232	2018-12-11 20:00:00+00:00	Neymar	276.0	Paris Saint Germain	85	Forward	1

Just to help us understand our data a bit, what's the distribution of goals per match?

df['goals'].hist();

Woah a Poisson distribution as expected, I love those. Okay and which positions scores more?

df[['position', 'goals']].groupby('position').sum()

position	goals
Defender	135
Forward	587
Goalkeeper	3
Midfield	449

Alright zero surprises so far. Let's move on and build a Logistic Regression model, which will give us a probability of a goal given a player and their position.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

X = df[['player_id', 'position']]
one_hot_encoder = OneHotEncoder()
one_hot_encoder.fit(X)
X = one_hot_encoder.transform(X)

y = [ 1 if y > 0 else 0 for y in df['goals'] ]

model = LogisticRegression()
model.fit(X, y)
model.score(X, y)

0.9107711354793067

Logistic regression can fit our data with 91% accuracy, seems pretty good. Now let's take a look on what's the probability of Neymar scoring a goal in some future match:

neymar_id = 276
X_example = one_hot_encoder.transform([[ neymar_id, "Forward" ]])
model.predict_proba(X_example)[0][1]

0.3835193670312905

38%, not bad, what if Neymar were a Midfield player instead of Forward?

X_example = one_hot_encoder.transform([[ neymar_id, "Midfield" ]])
model.predict_proba(X_example)[0][1]

0.2424111808266389

24%, is it good? Is it above average? I think so, but let's check what's the average of player in this positions scoring at all. For that we will take the average of probabilities of all players in that position

$P(goal|midfield) = \frac{\sum_{i = 1}^{\#players}{P(goal|players_i,midfield)}}{\#players}$

def prob_for_position(position):
    X_example = []
    for player_id in df[df["position"] == position]['player_id'].unique():
        X_example.append([ player_id, position ])
    X_example = one_hot_encoder.transform(X_example)
    ys_pred = [ y[1] for y in model.predict_proba(X_example) ]
    return ys_pred

forward_position_probs = prob_for_position("Forward")
print("Forward position mean", np.mean(forward_position_probs))
plt.hist(forward_position_probs, bins=50)
plt.show()

midfield_position_probs = prob_for_position("Midfield")
print("Midfield position mean", np.mean(midfield_position_probs))
plt.hist(midfield_position_probs, bins=50);

Forward position mean 0.165252941308293

Midfield position mean 0.08585727912713484

It's interesting to notice that the histograms above remember somewhat a beta distribution, closer to the left side, never smaller than zero, this will be important for us later

But answering our question, yes, Neymar is way above average, as an average forward would score with 16.5% chance while Neymar gets 38% on forward position. It seems that being Neymar does put a hell lot of weight in the probability to score in a match.

Remember: this doesn't really mean that Neymar the person alone is way better than everybody else, because many other variables such as "Playing in PSG with good team members" are aggregated in "being Neymar", as it's not really possible for us to deconfound this variable right now. On the other hand the position variable (forward, midfield) is pretty deconfounded as we measure across all players

So what would be the probability of Neymar scoring, regardless of position?

$P(goal|neymar) = \frac{\sum_{i = 1}^{\#positions}{P(goal|neymar,position_i)}}{\#positions}$

def prob_for_player_id(player_id):
    X_example = []
    for position in df['position'].unique():
        X_example.append([ player_id, position ])
    X_example = one_hot_encoder.transform(X_example)
    ys_pred = [ y[1] for y in model.predict_proba(X_example) ]
    return np.mean(ys_pred)

prob_for_player_id(neymar_id)

0.18960539993655143

What about Messi?

messi_id = 154
prob_for_player_id(messi_id)

0.21051765637224326

Seems like Messi has a bit higher probability of scoring than Neymar, 21% vs 19%

Those are nice point estimations, but you shouldn't take them as absolute truth, because our data is just a small sample from all matches by those two players (only 2 UEFA championships), as our input is incomplete, how much can we really trust the data we collected?

One alternative approach that helps us with that is building a Bayesian model, where we can estimate distributions over our parameters, as you will see next

Measuring uncertainty with Bayesian Analysis

For this next part I’ll use R with rjags because that’s what I learned from the book. I even tried to do the same with PyMC3 but no success so far.

Anyway, first, let’s load and transform the data just like we did on previous python notebook

df <- read.csv("uefa_player_matches.csv", header = TRUE)
df <- df[complete.cases(df), ] # drop na
df <- df[(df$position %in% c("G", "D", "M", "F")),]
position_names <- list("G" = "Goalkeeper", "D" = "Defender", "M" = "Midfield", "F" = "Forward")
df$position <- factor(apply(df['position'], 1, function(x) position_names[[x]]))
df$player_id <- factor(df$player_id)
head(df)

	match_id	match_date	player	player_id	team	team_id	position
1	158963	2019-06-25T13:00:00+00:00	Mattia Migani	80722	Tre Penne	700	Goalkeeper
2	158963	2019-06-25T13:00:00+00:00	Davide Cesarini	56095	Tre Penne	700	Defender
5	158963	2019-06-25T13:00:00+00:00	Mirko Palazzi	56099	Tre Penne	700	Defender
6	158963	2019-06-25T13:00:00+00:00	Nicola Gai	80723	Tre Penne	700	Midfield
8	158963	2019-06-25T13:00:00+00:00	Nicola Chiaruzzi	80724	Tre Penne	700	Midfield
9	158963	2019-06-25T13:00:00+00:00	Enrico Cibelli	80725	Tre Penne	700	Midfield

Let's just setup some variable we will need to build our model: y, which means weather there was a goal or not, positions and playerIds

y = apply(df['goals'], 1, function(x) if (x > 0) { 1 } else { 0 })
positions = as.numeric(df$position)
playerIds = as.numeric(df$player_id)
playerIdsUnique = unique(playerIds)

playersMostCommonPositions = c()
for (playerId in playerIdsUnique) {
  playerPositions = df[(playerIds == playerId),]
  sortedPositions = sort(table(as.numeric(playerPositions$position)), decreasing=TRUE)
  playersMostCommonPositions[playerId] <- as.numeric(names(sortedPositions)[1])
}

Now the most exciting part, let's build and compile our hierarchical model!

modelString = "
  model {
    for (positionId in 1:length(positions)) {
      meanPosition[positionId] ~ dunif(0, 1)
      variancePosition[positionId] ~ dunif(0, 1)
    }
    for (i in playerIdsUnique) {
      # Creating a beta distribution based on mean and variance: https://en.wikipedia.org/wiki/Beta_distribution#Mean_and_variance
      mean[i] = meanPosition[playersMostCommonPositions[i]]
      variance[i] = variancePosition[playersMostCommonPositions[i]] / 10
      v[i] = ((mean[i] * (1 - mean[i])) / variance[i]) - 1
      alpha[i] = mean[i] * v[i]
      beta[i] = (1 - mean[i]) * v[i]

      probPlayer[i] ~ dbeta(alpha[i], beta[i])
    }
    for (i in 1:length(y)) {
      y[i] ~ dbern(probPlayer[playerIds[i]])
    }
  }
"

writeLines(modelString, con="TEMPmodel.txt")
jagsModel = jags.model(
  "TEMPmodel.txt",
  data=list(
    y = y,
    positions = positions,
    playerIds = playerIds,
    playerIdsUnique = playerIdsUnique,
    playersMostCommonPositions = playersMostCommonPositions
  )
)

So, let me try to explain everything. First of all, our goal is to model the probability of a goal happening or not in a match, so this is clearly a bernoulli trial, which is about true/false results, this is defined by the y[i] ~ dbern part. The argument inside dbern is the probability for the given player, but how do we define that?

Player probability of scoring is a beta distribution, as we saw on the previous python notebook. We want to find the probability of scoring for each player, for example, a player with a 0.5 mean probability would probably score every other match. A beta distribution is defined by parameters alpha and beta, but because those are very hard to intuitively interpret, I used a formula to be able to define it based on mean and variance. But who defines this mean and variance then?

The player position! Each position (Forward, Midfield, Defender, Goalkeeper) will have a mean probability of goal that fits for all players, simply defined here by a uninformative uniform distribution between 0 and 1. So the players influence the mean position probability, and the position in turn affects individual players probability to score. This is the magic of Bayes hierarchical models!

update(jagsModel, n.iter=500) # burn-in
parameters = c("probPlayer", "meanPosition", "variancePosition")
codaSamples = coda.samples(jagsModel, variable.names=parameters, n.iter=5000)

Alright, so finally, let's plot our posteriors and start asking questions! What's the mean probability of a forward player to score in a match?

mcmcMat = as.matrix(codaSamples)

positionLevel = which(levels(df$position) == "Forward")
positionMeans = mcmcMat[,paste("meanPosition[", positionLevel, "]", sep="")]
plotPost(positionMeans, credMass=0.95, xlab="Forward mean HDI")

Between 0.159 and 0.196, which is a cool advantage of using Bayesian methods, it's common to look at distributions rather than point estimates, so we can have an idea of our uncertainty. What about Neymar, is he in this range? Or is he conclusively better than average?

neymarId = 276
neymarLevel = which(levels(df$player_id) == neymarId)
playerProbs = mcmcMat[,paste("probPlayer[", neymarLevel, "]", sep="")]
plotPost(playerProbs, credMass=0.95, xlab="Neymar HDI", compVal=0.196)

Well the mean is similar to what we got using Logistic Regression, 0.38 vs 0.328 here. But is he conclusively better than average? Almost but no!* The lower bound for Neymar is 0.148, which is inside the average range, so it seems like he is probably better, but variance is quite high, maybe it was just luck all this time? What about Messi?

* considering 95% HDI

messiId = 154
messiLevel = which(levels(df$player_id) == messiId)
playerProbs = mcmcMat[,paste("probPlayer[", messiLevel, "]", sep="")]
plotPost(playerProbs, credMass=0.95, xlab="Messi HDI", compVal=0.195)

Yes! Messi lower bound is 0.204, considering 95% Highest Density Interval he is conclusively better than average forward position players! It does not overlap with the upper bound of 0.196 for average forward players.

That was it, just a small bayesian exercise, I hope it was as fun to you as it was for me.

Disclaimer: I'm not an expert so I may not be interpreting this results correctly, please let me know if I don't.

If you also want to play with it, you can find the dataset and the notebooks on my github.

Thanks for reading!