Having watched Cricket as my favourite sport for years, I've recently come to wonder how well the outcomes of games could be predicted.

Especially in tournaments like the IPL, it seems difficult to predict the winner of each game since the difference in quality between sides often seems minimal on paper.

It would be interesting to see how effective machine learning could be in this task and what certain factors influence the outcomes of games.

## Top-Down vs Bottom-Up Approach

One way to predict the outcome of a game is by taking in a list of players from two teams predicting the win percentage of both teams.

While this approach works well to predict the final outcome of the match, it fails to answer any other questions we may have about a game.

For this reason, I have decided to take the bottom up approach of predicting the outcome of each ball in a game, which would ultimate predict the final outcome of a game. This way, we can answer many questions such as:

- Who will score the highest in a game?
- Who will take the most wickets?
- What will they score?
- What happens if we reverse the batting order?

## Predicting the outcome of balls

To accurately predict the outcome of each ball in a game, you would need to take into account two groups of data:

- Data of the current match (team score, batsman score, wickets fallen etc.)
- Pre-match historical data (batter and bowler skill)

We of course would need data from the current match so that the situation in which the ball is bowled is known.

In order to make predictions more accurate, the actual skill level of the bowler and batsman should be known, in order to know who is most likely to have the advantage. A players skill level is found out by looking at their data of previous matches.

For simplicity in this project, run outs are counted as a bowler's wicket and extras are counted as the batsman's runs. This means each ball outcome can be one of:

- 0 runs
- 1 run
- 2 runs
- 3 runs
- 4 runs
- 6 runs
- Wicket

## The Dataset

The dataset I have is a collection of CSV files from each IPL match from 2008.

Each CSV file has ball-by-ball data of their match.

This collection of CSV files will be used to calculate certain statistics of the players and will be ultimately used as part of the dataset for the machine learning model to predict the outcomes of balls.

## Calculating Player Skill Ratings

### Batsmen Ratings

The two metrics commonly used to indicate the skill of a batsman are:

- Strike Rate (how many runs a batsman scores per 100 balls)
- Average (how many runs a batsman scores on average before getting out)

Adding these together can give a rating for a batsman, where a higher number would indicate a more skilled player.

However, there are some limitations when using this as a rating

- Different batsmen excel in different situations in a game. For example, finishers are excellent at scoring quick runs near the end of an innings, but probably would not do as good a job when opening the innings. Strike Rates and Averages do not provide this insight into what role a batsman is best in.
- Strike Rates and Averages do not give an insight into how a batsman scores their runs. Two batsman can have a strike rate of 135 and an average of 30, but one can achieve that by frequently running between the wickets while the other can achieve it by hitting boundaries after every few balls. Predicting the outcome of a ball can be much more accurate if the type of batsman is considered.

With these things considered, I felt a single number was not enough to capture the skill of a batsman. After some thought, I decided that each batsman shall have the following ratings

- Explosivity Rating
- Quantifies whether a batsman is a big hitter who likes to accumulate their runs from boundaries
- Total no. of boundaries hit / Total no. of balls faced

- Running Rating
- Quantifies whether a batsman likes to score their runs by running hard between the wickets
- Total no. of balls where batsman ran for their runs / Total no. of balls faced

- Finisher Rating
- Quantifies how often a batsman is involved in the finish of an innings
- Total no. of not outs / Total no. of balls faced

- Consistency Rating
- Measure of how consistently a batsman performs
- Batsman's Average

- Quick Scorer Rating
- Measure of how quickly a batsman gets their runs
- Batsman's Strike Rate

After each rating above is calculated for each player in the database, the ratings will be standardised to acquire the final rating.

The formula for standardising a data point is:

$x$ is the data point in question.

$μ$ is the mean of the whole dataset

$σ$ is the standard deviation of the whole dataset

$z$ is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

### Bowler Ratings

The metrics commonly used to indicate the skill of a bowler are

- Economy (how many runs they concede per over)
- Strike Rate (balls bowled per wicket)
- Average (runs conceded per wicket)

I felt like these metrics were already good enough to indicate the type of bowler.

The economy measures how good a bowler is at conceding less runs, while strike rate measures the wicket taking ability of the bowler.

A low average would mean a bowler has good both economy and strike rate stats.

I named the ratings:

- economy_rating
- wicket_taking_rating
- bowling_consistency_rating

These will be standardised using this formula:

$x$ is the data point in question.

$μ$ is the mean of the whole dataset

$σ$ is the standard deviation of the whole dataset

$z$ is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

This is almost the same as the formula used for batting, except for the negation at the front. Since lower economies, strike rates and averages are better, the scores are negated so that a higher rating would indicate a more skilled bowler.

I also decided to use another rating:

- specialist_rating
- used to measure whether a bowler is a specialist, part-timer or a non-bowler
- Number of balls bowled / number of matches played

This rating will be standardised using the original standardisation formula.

### Experience

Right now, a tailender who has only faced 4 balls in their career and hit 16 runs would be given a high number in most of the batting ratings. This obviously should not be the case, as a tailender would not be able to keep up those numbers for an extended period of time. Similarly, a batsman may have bowled a few decent overs in their career and may end up being higher rated than some established bowlers.

Therefore, these ratings need to be adjusted to account for their experience as a batsman and a bowler.

The number of innings a player has batted in will determine the batting experience.

The number of balls a player has bowled will determine the bowling experience.

Both of these will be standardised (using the formula from before) and their values will be clipped from -1.5 to 1.5, so that the ratings don't just favour the players who have played the most.

Once they have been standardised, each player rating will be adjusted as follows:

- If the rating is a batting rating, add the batting experience to the rating. Otherwise, add the bowling experience.
- Re-standardise this new number

## Exploring the ratings

I wrote a script to go through each ball from the dataset and calculate the overall ratings of players and save it to a CSV file. You can see a snippet of the data below.

According to these ratings...

The top 15 fastest scorers in the IPL have been:

- PN Mankad
- AD Russell
- V Sehwag
- AB de Villiers
- GJ Maxwell
- RR Pant
- CH Gayle
- KA Pollard
- HH Pandya
- SP Narine
- YK Pathan
- DA Warner
- SR Watson
- SA Yadav
- DA Miller

The top 15 most consistent batsmen:

- Iqbal Abdulla
- KL Rahul
- DA Warner
- AB de Villiers
- CH Gayle
- MS Dhoni
- DA Miller
- V Kohli
- RR Pant
- F du Plessis
- JC Buttler
- S Dhawan
- JP Duminy
- SPD Smith
- SK Raina

The top 15 most explosive batsmen (highest proportion of their runs scored in boundaries):

- PN Mankad
- B Stanlake
- RS Sodhi
- V Sehwag
- SP Narine
- AD Russell
- CH Gayle
- GJ Maxwell
- SR Watson
- RR Pant
- AB de Villiers
- SA Yadav
- BB McCullum
- DR Smith
- DA Warner

The top 15 finishers:

- RA Jadeja
- HH Pandya
- DJ Bravo
- MS Dhoni
- DA Miller
- Harbhajan Singh
- JR Hazlewood
- YK Pathan
- IK Pathan
- Iqbal Abdulla
- Mukesh Choudhary
- P Sahu
- BA Bhatt
- K Upadhyay
- JE Taylor

Top 15 hardest runners (highest proportion of runs scored by non-boundaries):

- CRD Fernando
- DP Vijaykumar
- NJ Rimmington
- RG More
- SPD Smith
- DA Miller
- RA Jadeja
- V Kohli
- AB de Villiers
- MS Dhoni
- MK Pandey
- AR Patel
- AT Rayudu
- KL Rahul
- SK Raina

Top 15 most economical bowlers:

- AD Russell
- SN Thakur
- GJ Maxwell
- SR Watson
- PP Chawla
- JA Morkel
- JJ Bumrah
- STR Binny
- DL Chahar
- HV Patel
- Rashid Khan
- Mohammed Siraj
- R Vinay Kumar
- KH Pandya
- K Rabada

Top 15 best wicket takers:

- Sandeep Sharma
- RP Singh
- MM Patel
- UT Yadav
- GJ Maxwell
- MG Johnson
- Rashid Khan
- AR Patel
- A Nehra
- SN Thakur
- MM Sharma
- AB Dinda
- JD Unadkat
- DL Chahar
- DJ Bravo

Top 15 most consistent bowlers:

- MM Patel
- Sandeep Sharma
- RP Singh
- A Nehra
- MG Johnson
- UT Yadav
- AR Patel
- AB Dinda
- Rashid Khan
- GJ Maxwell
- JH Kallis
- MM Sharma
- Harbhajan Singh
- R Bhatia
- SN Thakur

Obviously there are a few anomalies seen in each rating, with some bowlers ranking higher than specialist batsmen, despite correcting for experience. However, these anomalous players do not carry over to the other ratings and the ratings as a whole do make sense.

## Match Data

For each ball, along with player skill data, the context of the current match will also be considered.

The following pieces of information will be considered in each ball:

- Ball Number
- Batsman's Score
- Balls faced by the batsman
- Proportion of balls faced by the batsman that resulted in 0 runs
- Proportion of balls faced by the batsman that resulted in 1 run
- Proportion of balls faced by the batsman that resulted in 2 runs
- Proportion of balls faced by the batsman that resulted in 3 runs
- Proportion of balls faced by the batsman that resulted in 4 runs
- Proportion of balls faced by the batsman that resulted in 6 runs
- Runs conceded by the bowler
- Number of balls bowled by the bowler
- Number of wickets taken by the bowler
- Proportion of balls bowled by the bowler that resulted in 0 runs
- Proportion of balls bowled by the bowler that resulted in 1 runs
- Proportion of balls bowled by the bowler that resulted in 2 runs
- Proportion of balls bowled by the bowler that resulted in 3 runs
- Proportion of balls bowled by the bowler that resulted in 4 runs
- Proportion of balls bowled by the bowler that resulted in 6 runs
- Proportion of balls bowled by the bowler that resulted in a wicket
- Chasing score (if applicable)
- Required run rate (if applicable)
- Innings score
- Innings wickets

All of these will be standardised as done before.

## Building the dataset

Now that the data to be used has been decided, it is time to process the ball-by-ball CSV files into the dataset to train on.

```
import pandas as pd
import numpy as np
import os
import pickle as pkl
# standardising formula
def zscore(col):
mean = col.mean()
std = col.std()
return (col - mean) / std
df = pd.DataFrame() # will hold the final dataset at the end
player_db = pd.read_csv("player-db.csv") # need it for player ratings
```

The code below goes through each match and adds the relevant data to the dataframe.

```
for file in os.listdir("matches"):
f = os.path.join("matches", file)
match_df = pd.read_csv(f)
match_df = match_df.fillna(0)
# columns of all the match data needed
ball_no = []
striker = []
bowler = []
batsman_runs = []
batsman_balls = []
batsman_outcome_dists = { outcome : [] for outcome in [0,1,2,3,4,6]}
bowler_economy = []
wicket_taking = []
bowler_consistency = []
bowler_wickets = []
bowler_runs = []
bowler_balls = []
bowler_outcome_dists = { outcome : [] for outcome in range(0, 7) }
innings_score = []
innings_wickets = []
chasing = []
req_run_rate = []
outcome = []
batsmen = {}
bowlers = {}
score = 0
wickets = 0
chasing_score = 0
prev_innings = 1
for ball in match_df.iloc:
batter = ball["striker"]
_bowler = ball["bowler"]
striker.append(batter)
bowler.append(_bowler)
runs = int(ball["runs_off_bat"])
runs = min(runs, 6)
runs = 4 if runs == 5 else runs
wides = int(ball["wides"])
wicket = 1 if ball["wicket_type"] else 0
innings = int(ball["innings"])
if innings != prev_innings:
chasing_score = score
score = 0
wickets = 0
prev_innings = innings
if batter not in batsmen:
# this will hold the number of balls faced for each outcome by a batsman
batsmen[batter] = {
0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
6: 0
}
if _bowler not in bowlers:
# this will hold the number of balls bowled for each outcome by a bowler (5 means wicket)
bowlers[_bowler] = {
0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0
}
## Batting Data ##
batsman_balls_faced = np.sum([batsmen[batter][i] for i in batsmen[batter]])
batsman_dist = { _outcome : 0 if batsman_balls_faced == 0 else batsmen[batter][_outcome] / batsman_balls_faced for _outcome in batsmen[batter]} # getting proportion of balls faced for each outcome
batsman_runs_scored = np.sum([_outcome * batsmen[batter][_outcome] for _outcome in batsmen[batter]])
for _outcome in batsman_outcome_dists:
batsman_outcome_dists[_outcome].append(batsman_dist[_outcome])
batsman_runs.append(batsman_runs_scored)
batsman_balls.append(batsman_balls_faced)
## Bowling Data ##
bowler_balls_bowled = np.sum([bowlers[_bowler][i] for i in bowlers[_bowler]])
bowler_runs_given = np.sum([_outcome * bowlers[_bowler][_outcome] if _outcome != 5 else 0 for _outcome in bowlers[_bowler]])
bowler_wickets_taken = bowlers[_bowler][5]
bowler_dist = { _outcome : bowlers[_bowler][_outcome] / bowler_balls_bowled if bowler_balls_bowled != 0 else 0 for _outcome in bowlers[_bowler] } # getting proportion of balls bowled for each outcome
for _outcome in bowler_outcome_dists:
bowler_outcome_dists[_outcome].append(bowler_dist[_outcome])
bowler_runs.append(bowler_runs_given)
bowler_wickets.append(bowler_wickets_taken)
bowler_balls.append(bowler_balls_bowled)
## Innings Data
innings_score.append(score)
innings_wickets.append(wickets)
chasing.append(chasing_score)
ball_outcome = runs if wicket == 0 else 5
outcome.append(ball_outcome)
discrete_ball_no = ball["ball"]
discrete_ball_no = int(discrete_ball_no) * 6 + round(discrete_ball_no % 1) * 10
ball_no.append(discrete_ball_no)
rem_balls = 120 - (discrete_ball_no - 1) if discrete_ball_no <= 120 else int(discrete_ball_no - 120)
_req_run_rate = max(0, (chasing_score - score) / rem_balls)
req_run_rate.append(_req_run_rate)
## Update State ##
batsmen[batter][runs] += 1
bowlers[_bowler][runs] += 1
bowlers[_bowler][5] += wicket
score += runs
wickets += wicket
new_match_df = pd.DataFrame()
new_match_df["striker"] = striker
new_match_df["bowler"] = bowler
new_match_df["batsman_runs"] = batsman_runs
new_match_df["batsman_balls"] = batsman_balls
for _outcome in batsman_outcome_dists:
new_match_df[f"batsman_{_outcome}"] = batsman_outcome_dists[_outcome]
new_match_df["bowler_runs"] = bowler_runs
new_match_df["bowler_balls"] = bowler_balls
new_match_df["bowler_wickets"] = bowler_wickets
for _outcome in bowler_outcome_dists:
new_match_df[f"bowler_{_outcome}"] = bowler_outcome_dists[_outcome]
new_match_df["innings_score"] = innings_score
new_match_df["innings_wickets"] = innings_wickets
new_match_df["ball"] = ball_no
new_match_df["chasing"] = chasing
new_match_df["req_run_rate"] = req_run_rate
new_match_df["outcome"] = outcome
frames = [df, new_match_df]
df = pd.concat(frames)
# df now contains the relevant match data from every single ball in the dataset
```

This now adds the player ratings to the collected data and saves the dataset.

```
batting_ratings = player_db[["player", "explosivity_rating", "consistency_rating", "finisher_rating", "quick_scorer_rating", "running_rating"]]
bowling_ratings = player_db[["player", "economy_rating", "wicket_taking_rating", "bowling_consistency_rating", "specialist_rating"]]
df = df.join(batting_ratings.set_index("player"), on="striker") # add batsmen's batting ratings to dataframe
df = df.join(bowling_ratings.set_index("player"), on="bowler") # add bowler's bowling ratings to dataframe
df = df.drop(["striker", "bowler"], axis=1) # remove the striker and bowler columns since they are not part of the features needed to predict the outcome of a ball
# get the means and standard deviations of each column
df_mean = df.drop(["outcome"], axis=1).mean()
df_std = df.drop(["outcome"], axis=1).std()
# save the mean and std of all the columns. this would be needed later for preprocessing when predicting new data.
with open("mean-std.bin","wb") as f:
pkl.dump({
"mean": df_mean,
"std": df_std
}, f)
# standardise all the columns except the ratings and outcome
for col in df.columns:
if col != "outcome" and "rating" not in col:
df[col] = zscore(df[col])
# shuffle
df = df.sample(frac=1)
# split into a training set and a testing set
training_split = int(len(df) * 0.85) # take 85% to train
training_df = df[:training_split]
testing_df = df[training_split:]
# save the datasets
training_df.to_csv("balls-train.csv", index=False)
testing_df.to_csv("balls-test.csv", index=False)
```

The dataset looks something like this.

## Training and testing the model

I have chosen to use a Random Forest Classifier to train on the data, which is really simple and quick to train using sklearn.

```
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle as pkl
from random import choices
dataset = pd.read_csv("balls-train.csv")
x = dataset.drop(["outcome"], axis=1)
y = dataset["outcome"]
clf = RandomForestClassifier(max_depth=15, n_estimators=50)
clf.fit(x, y)
with open("clf.bin", "wb") as f:
pkl.dump(clf, f)
```

Now to test this model, we could see its accuracy on the test set.

```
test = pd.read_csv("balls-test.csv")
predicted = clf.predict(test.drop(["outcome"], axis=1))
print (np.sum(predicted == test["outcome"]) / len(test))
```

Running this results in the following output:

```
0.4430872720835546
```

An accuracy of 44% does not seem too great. However, each ball can easily have 2 or 3 reasonable outcomes, so using accuracy is not a good measure to capture how the model is performing.

Instead, I thought it would be better to look at the input data of each outcome. For each outcome, take the inputs that were predicted to have this outcome by the model and also take the inputs in the test set that were labelled with this outcome. If these two sets on inputs are similar, then the model has predicted this outcome well.

```
test = pd.read_csv("balls-test.csv")
predicted = test.drop(["outcome"], axis=1)
preds = clf.predict(predicted)
predicted["outcome"] = preds
for outcome in range(0, 7):
print ("Outcome ", outcome)
# get rows which were labelled with this outcome in the test data
test_slice = test[test["outcome"] == outcome]
# get rows which were predicted with this outcome by the model
predicted_slice = predicted[predicted["outcome"] == outcome]
# a vector of the average values of the inputs that were labelled with the outcome in the test set
test_slice_mean = test_slice.mean()
# a vector of the average values of the inputs that were predicted to have this outcome by the model
predicted_slice_mean = predicted_slice.mean().fillna(0)
# calculates the euclidean distance between the two vectors
dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5
print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice)) # compared how many times this outcome appeared in the test set and how many times it got predicted
print ("Average Distance", dist)
```

Running this results in the following output...

```
Outcome 0
Actual Count 12041 Predicted Count 13324
Average Distance 1.1953108714619187
Outcome 1
Actual Count 12564 Predicted Count 20402
Average Distance 0.8442394082114827
Outcome 2
Actual Count 2055 Predicted Count 16
Average Distance 4.355749118524852
Outcome 3
Actual Count 112 Predicted Count 0
Average Distance 4.526265034977223
Outcome 4
Actual Count 3848 Predicted Count 74
Average Distance 2.6306944519746605
Outcome 5
Actual Count 1646 Predicted Count 20
Average Distance 3.586353351892623
Outcome 6
Actual Count 1628 Predicted Count 58
Average Distance 4.243057588990431
```

This shows that the model is performing poorly. Its distances are quite large for each outcome, except for 0 and 1, considering the magnitudes of the values in the input.

It has also not been able to match the proportions of outcomes in the test set, with the predicted counts being very different to the actual counts.

The model is clearly biased to predicted 0 and 1 for each ball. This does make sense however since those are the most common outcomes in games of Cricket.

The problem here is how the predictions are made. The model outputs a probability distribution of the outcomes for each ball. Currently, the outcome with the highest probability is taken as the predicted outcome. It would make more sense to randomly select an outcome, using the probability distribution to weight the random selection.

Here is the same code but using weighted random selection.

```
test = pd.read_csv("balls-test.csv")
predicted = clf.predict_proba(test.drop(["outcome"], axis=1))
preds = []
for weights in predicted:
outcomes = [0,1,2,3,4,5,6]
# weighted random selection
p = choices(outcomes, weights=weights)
preds.append(p[0])
predicted = test.drop(["outcome"], axis=1)
predicted["outcome"] = preds
for outcome in range(0, 7):
print ("Outcome ", outcome)
test_slice = test[test["outcome"] == outcome]
predicted_slice = predicted[predicted["outcome"] == outcome]
# a vector of the average values of the inputs that were labelled with the outcome in the test set
test_slice_mean = test_slice.mean()
# a vector of the average values of the inputs that were predicted to have this outcome by the model
predicted_slice_mean = predicted_slice.mean().fillna(0)
# calculates the euclidean distance between the two vectors
dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5
print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice))
print ("Average Distance", dist)
```

Running this results in the following...

```
Outcome 0
Actual Count 12041 Predicted Count 12072
Average Distance 0.06285163994309419
Outcome 1
Actual Count 12564 Predicted Count 12507
Average Distance 0.047105551432281484
Outcome 2
Actual Count 2055 Predicted Count 2089
Average Distance 0.13890305459997243
Outcome 3
Actual Count 112 Predicted Count 94
Average Distance 0.779151006384056
Outcome 4
Actual Count 3848 Predicted Count 3858
Average Distance 0.09621869495553977
Outcome 5
Actual Count 1646 Predicted Count 1676
Average Distance 0.2084278773037825
Outcome 6
Actual Count 1628 Predicted Count 1598
Average Distance 0.29624599526308754
```

Now the model can be seen to be performing very well. It matches the test set's proportion of outcomes and the distances between the inputs have reduced to a small range.

### Exploring each feature

I thought it would be intriguing to see what each feature of the dataset contributed to the outcome of a ball.

Shown below are box plots for each feature against each predicted outcome to help visualise the distribution of features against each output. This will help in showing how a feature contributes to the outcome.

#### Batsman's runs

#### Batsman's balls

#### Batsman's dot ball proportion

#### Batsman's single proportion

#### Batsman's double proportion

#### Batsman's three runs proportion

#### Batsman's four runs proportion

#### Batsman's six runs proportion

#### Runs conceded by the bowler

#### Number of balls bowled by the bowler

#### Wickets taken by the bowler

#### Bowler's dot ball proportion

#### Bowler's single proportion

#### Bowler's double proportion

#### Bowler's three runs proportion

#### Bowler's four runs proportion

#### Bowler's wicket delivery proportion

#### Bowler's six runs proportion

#### Innings score

#### Innings wickets

#### Ball number of the innings

#### Score to chase

#### Required run rate

#### Explosivity rating

#### Consistency rating

#### Finisher rating

#### Quick scorer rating

#### Running rating

#### Economy rating

#### Wicket taking rating

#### Bowling consistency rating

#### Bowling specialist rating

From these box plots, we can see which features have more of an impact on the outcome.

Features that produce similar looking boxplots for each outcome do not have much of an effect on the outcome of a ball. These features include:

- Batsman/Bowler ball outcome proportions
- Bowler ratings
- Number of wickets taken by a bowler
- Chasing score
- Required run rate

It was not surprising to see the chasing score and required run rate features to not have much of an effect on the outcome, as half the balls in the dataset wouldn't have had these features applied to them.

It also was not surprising to see the ball outcome proportion and no. of wickets taken features in this list too. This is because these can have the same values appear all throughout a cricket match, so they are bound to have a wide range of outcomes for the same values.

I was however surprised to see that bowler ratings did not have much of an impact on the outcome, while batting ratings did.

This can mean a few things

- The outcome of a game comes more down to how strong a team's batting lineup is rather than their bowling.
- There must be a better way to quantify the skills of a bowler

After having a thought about this, while I do think it's a mix of both, I feel it is mainly due to the first point.

I believe this is because, especially in a competition like the IPL, the quality of bowlers does not fluctuate as much as they do for batsmen (statistically speaking at least). It is much more common to have bowlers / non-specialist batsmen to bat in games, while you would almost never find a batsman bowling in a game and occasionally find a part-time bowler.

The way to quantify a bowler's skill could still be improved however, maybe considering some of the following:

- Average pace of the bowler
- Average degree of turn the bowler gets (for spinners)
- Adjusting their economy/wicket taking ratings to consider the contexts of the matches they bowled in

These pieces of data were outside the scope of the dataset I had, but would be interesting to implement to improve this project in the future.

Nevertheless, I am happy with how the model has trained, aligning itself closely to the test dataset.

## Part 2

To avoid making this part too long, I have decided to split this up into two parts.

Part 2 will involve building the actual match simulator. I will use it to see how it does in simulating real games and to answer any question about any hypothetical game situations.

Thank you for reading!

## Top comments (2)

this so cool!

thank you very much!