DEV Community

Cover image for Cricket Match Simulation using Machine Learning [Part 1]
ashwins-code
ashwins-code

Posted on • Edited on

Cricket Match Simulation using Machine Learning [Part 1]

Having watched Cricket as my favourite sport for years, I've recently come to wonder how well the outcomes of games could be predicted.

Especially in tournaments like the IPL, it seems difficult to predict the winner of each game since the difference in quality between sides often seems minimal on paper.

It would be interesting to see how effective machine learning could be in this task and what certain factors influence the outcomes of games.

Top-Down vs Bottom-Up Approach

One way to predict the outcome of a game is by taking in a list of players from two teams predicting the win percentage of both teams.

While this approach works well to predict the final outcome of the match, it fails to answer any other questions we may have about a game.

For this reason, I have decided to take the bottom up approach of predicting the outcome of each ball in a game, which would ultimate predict the final outcome of a game. This way, we can answer many questions such as:

  • Who will score the highest in a game?
  • Who will take the most wickets?
  • What will they score?
  • What happens if we reverse the batting order?

Predicting the outcome of balls

To accurately predict the outcome of each ball in a game, you would need to take into account two groups of data:

  • Data of the current match (team score, batsman score, wickets fallen etc.)
  • Pre-match historical data (batter and bowler skill)

We of course would need data from the current match so that the situation in which the ball is bowled is known.

In order to make predictions more accurate, the actual skill level of the bowler and batsman should be known, in order to know who is most likely to have the advantage. A players skill level is found out by looking at their data of previous matches.

For simplicity in this project, run outs are counted as a bowler's wicket and extras are counted as the batsman's runs. This means each ball outcome can be one of:

  • 0 runs
  • 1 run
  • 2 runs
  • 3 runs
  • 4 runs
  • 6 runs
  • Wicket

The Dataset

The dataset I have is a collection of CSV files from each IPL match from 2008.

Each CSV file has ball-by-ball data of their match.

Image description

This collection of CSV files will be used to calculate certain statistics of the players and will be ultimately used as part of the dataset for the machine learning model to predict the outcomes of balls.

Calculating Player Skill Ratings

Batsmen Ratings

The two metrics commonly used to indicate the skill of a batsman are:

  • Strike Rate (how many runs a batsman scores per 100 balls)
  • Average (how many runs a batsman scores on average before getting out)

Adding these together can give a rating for a batsman, where a higher number would indicate a more skilled player.

However, there are some limitations when using this as a rating

  • Different batsmen excel in different situations in a game. For example, finishers are excellent at scoring quick runs near the end of an innings, but probably would not do as good a job when opening the innings. Strike Rates and Averages do not provide this insight into what role a batsman is best in.
  • Strike Rates and Averages do not give an insight into how a batsman scores their runs. Two batsman can have a strike rate of 135 and an average of 30, but one can achieve that by frequently running between the wickets while the other can achieve it by hitting boundaries after every few balls. Predicting the outcome of a ball can be much more accurate if the type of batsman is considered.

With these things considered, I felt a single number was not enough to capture the skill of a batsman. After some thought, I decided that each batsman shall have the following ratings

  • Explosivity Rating
    • Quantifies whether a batsman is a big hitter who likes to accumulate their runs from boundaries
    • Total no. of boundaries hit / Total no. of balls faced
  • Running Rating
    • Quantifies whether a batsman likes to score their runs by running hard between the wickets
    • Total no. of balls where batsman ran for their runs / Total no. of balls faced
  • Finisher Rating
    • Quantifies how often a batsman is involved in the finish of an innings
    • Total no. of not outs / Total no. of balls faced
  • Consistency Rating
    • Measure of how consistently a batsman performs
    • Batsman's Average
  • Quick Scorer Rating
    • Measure of how quickly a batsman gets their runs
    • Batsman's Strike Rate

After each rating above is calculated for each player in the database, the ratings will be standardised to acquire the final rating.

The formula for standardising a data point is:

z=xμσ z = \frac{x - μ}{σ}

xx is the data point in question.
μμ is the mean of the whole dataset
σσ is the standard deviation of the whole dataset
zz is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

Bowler Ratings

The metrics commonly used to indicate the skill of a bowler are

  • Economy (how many runs they concede per over)
  • Strike Rate (balls bowled per wicket)
  • Average (runs conceded per wicket)

I felt like these metrics were already good enough to indicate the type of bowler.

The economy measures how good a bowler is at conceding less runs, while strike rate measures the wicket taking ability of the bowler.

A low average would mean a bowler has good both economy and strike rate stats.

I named the ratings:

  • economy_rating
  • wicket_taking_rating
  • bowling_consistency_rating

These will be standardised using this formula:

z=xμσ z = -\frac{x - μ}{σ}

xx is the data point in question.
μμ is the mean of the whole dataset
σσ is the standard deviation of the whole dataset
zz is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

This is almost the same as the formula used for batting, except for the negation at the front. Since lower economies, strike rates and averages are better, the scores are negated so that a higher rating would indicate a more skilled bowler.

I also decided to use another rating:

  • specialist_rating
    • used to measure whether a bowler is a specialist, part-timer or a non-bowler
    • Number of balls bowled / number of matches played

This rating will be standardised using the original standardisation formula.

Experience

Right now, a tailender who has only faced 4 balls in their career and hit 16 runs would be given a high number in most of the batting ratings. This obviously should not be the case, as a tailender would not be able to keep up those numbers for an extended period of time. Similarly, a batsman may have bowled a few decent overs in their career and may end up being higher rated than some established bowlers.

Therefore, these ratings need to be adjusted to account for their experience as a batsman and a bowler.

The number of innings a player has batted in will determine the batting experience.

The number of balls a player has bowled will determine the bowling experience.

Both of these will be standardised (using the formula from before) and their values will be clipped from -1.5 to 1.5, so that the ratings don't just favour the players who have played the most.

Once they have been standardised, each player rating will be adjusted as follows:

  • If the rating is a batting rating, add the batting experience to the rating. Otherwise, add the bowling experience.
  • Re-standardise this new number

Exploring the ratings

I wrote a script to go through each ball from the dataset and calculate the overall ratings of players and save it to a CSV file. You can see a snippet of the data below.

Image description

According to these ratings...

The top 15 fastest scorers in the IPL have been:

  • PN Mankad
  • AD Russell
  • V Sehwag
  • AB de Villiers
  • GJ Maxwell
  • RR Pant
  • CH Gayle
  • KA Pollard
  • HH Pandya
  • SP Narine
  • YK Pathan
  • DA Warner
  • SR Watson
  • SA Yadav
  • DA Miller

The top 15 most consistent batsmen:

  • Iqbal Abdulla
  • KL Rahul
  • DA Warner
  • AB de Villiers
  • CH Gayle
  • MS Dhoni
  • DA Miller
  • V Kohli
  • RR Pant
  • F du Plessis
  • JC Buttler
  • S Dhawan
  • JP Duminy
  • SPD Smith
  • SK Raina

The top 15 most explosive batsmen (highest proportion of their runs scored in boundaries):

  • PN Mankad
  • B Stanlake
  • RS Sodhi
  • V Sehwag
  • SP Narine
  • AD Russell
  • CH Gayle
  • GJ Maxwell
  • SR Watson
  • RR Pant
  • AB de Villiers
  • SA Yadav
  • BB McCullum
  • DR Smith
  • DA Warner

The top 15 finishers:

  • RA Jadeja
  • HH Pandya
  • DJ Bravo
  • MS Dhoni
  • DA Miller
  • Harbhajan Singh
  • JR Hazlewood
  • YK Pathan
  • IK Pathan
  • Iqbal Abdulla
  • Mukesh Choudhary
  • P Sahu
  • BA Bhatt
  • K Upadhyay
  • JE Taylor

Top 15 hardest runners (highest proportion of runs scored by non-boundaries):

  • CRD Fernando
  • DP Vijaykumar
  • NJ Rimmington
  • RG More
  • SPD Smith
  • DA Miller
  • RA Jadeja
  • V Kohli
  • AB de Villiers
  • MS Dhoni
  • MK Pandey
  • AR Patel
  • AT Rayudu
  • KL Rahul
  • SK Raina

Top 15 most economical bowlers:

  • AD Russell
  • SN Thakur
  • GJ Maxwell
  • SR Watson
  • PP Chawla
  • JA Morkel
  • JJ Bumrah
  • STR Binny
  • DL Chahar
  • HV Patel
  • Rashid Khan
  • Mohammed Siraj
  • R Vinay Kumar
  • KH Pandya
  • K Rabada

Top 15 best wicket takers:

  • Sandeep Sharma
  • RP Singh
  • MM Patel
  • UT Yadav
  • GJ Maxwell
  • MG Johnson
  • Rashid Khan
  • AR Patel
  • A Nehra
  • SN Thakur
  • MM Sharma
  • AB Dinda
  • JD Unadkat
  • DL Chahar
  • DJ Bravo

Top 15 most consistent bowlers:

  • MM Patel
  • Sandeep Sharma
  • RP Singh
  • A Nehra
  • MG Johnson
  • UT Yadav
  • AR Patel
  • AB Dinda
  • Rashid Khan
  • GJ Maxwell
  • JH Kallis
  • MM Sharma
  • Harbhajan Singh
  • R Bhatia
  • SN Thakur

Obviously there are a few anomalies seen in each rating, with some bowlers ranking higher than specialist batsmen, despite correcting for experience. However, these anomalous players do not carry over to the other ratings and the ratings as a whole do make sense.

Match Data

For each ball, along with player skill data, the context of the current match will also be considered.

The following pieces of information will be considered in each ball:

  • Ball Number
  • Batsman's Score
  • Balls faced by the batsman
  • Proportion of balls faced by the batsman that resulted in 0 runs
  • Proportion of balls faced by the batsman that resulted in 1 run
  • Proportion of balls faced by the batsman that resulted in 2 runs
  • Proportion of balls faced by the batsman that resulted in 3 runs
  • Proportion of balls faced by the batsman that resulted in 4 runs
  • Proportion of balls faced by the batsman that resulted in 6 runs
  • Runs conceded by the bowler
  • Number of balls bowled by the bowler
  • Number of wickets taken by the bowler
  • Proportion of balls bowled by the bowler that resulted in 0 runs
  • Proportion of balls bowled by the bowler that resulted in 1 runs
  • Proportion of balls bowled by the bowler that resulted in 2 runs
  • Proportion of balls bowled by the bowler that resulted in 3 runs
  • Proportion of balls bowled by the bowler that resulted in 4 runs
  • Proportion of balls bowled by the bowler that resulted in 6 runs
  • Proportion of balls bowled by the bowler that resulted in a wicket
  • Chasing score (if applicable)
  • Required run rate (if applicable)
  • Innings score
  • Innings wickets

All of these will be standardised as done before.

Building the dataset

Now that the data to be used has been decided, it is time to process the ball-by-ball CSV files into the dataset to train on.

import pandas as pd
import numpy as np
import os
import pickle as pkl

# standardising formula
def zscore(col):
    mean = col.mean()
    std = col.std()

    return (col - mean)  / std

df = pd.DataFrame() # will hold the final dataset at the end
player_db = pd.read_csv("player-db.csv") # need it for player ratings
Enter fullscreen mode Exit fullscreen mode

The code below goes through each match and adds the relevant data to the dataframe.

for file in os.listdir("matches"):
    f = os.path.join("matches", file)
    match_df = pd.read_csv(f)
    match_df = match_df.fillna(0)

    # columns of all the match data needed

    ball_no = []
    striker = []
    bowler = []
    batsman_runs = []
    batsman_balls = []
    batsman_outcome_dists = { outcome : [] for outcome in [0,1,2,3,4,6]}
    bowler_economy = []
    wicket_taking = []
    bowler_consistency = []
    bowler_wickets = []
    bowler_runs = []
    bowler_balls = []
    bowler_outcome_dists = { outcome : [] for outcome in range(0, 7) }
    innings_score = []
    innings_wickets = []
    chasing =  []
    req_run_rate = []

    outcome = []

    batsmen = {}
    bowlers = {}
    score = 0
    wickets = 0
    chasing_score = 0

    prev_innings = 1

    for ball in match_df.iloc:

            batter = ball["striker"]
            _bowler = ball["bowler"]

            striker.append(batter)
            bowler.append(_bowler)

            runs = int(ball["runs_off_bat"])
            runs = min(runs, 6)
            runs = 4 if runs == 5 else runs
            wides = int(ball["wides"])
            wicket = 1 if ball["wicket_type"] else 0
            innings = int(ball["innings"])

            if innings != prev_innings:
                chasing_score = score
                score = 0
                wickets = 0

            prev_innings = innings


            if batter not in batsmen:
                # this will hold the number of balls faced for each outcome by a batsman
                batsmen[batter] = {
                    0: 0,
                    1: 0,
                    2: 0,
                    3: 0,
                    4: 0,
                    6: 0
                }

            if _bowler not in bowlers:
                # this will hold the number of balls bowled for each outcome by a bowler (5 means wicket)

                bowlers[_bowler] = {
                    0: 0,
                    1: 0,
                    2: 0,
                    3: 0,
                    4: 0,
                    5: 0,
                    6: 0
                }


            ## Batting Data ##

            batsman_balls_faced = np.sum([batsmen[batter][i] for i in batsmen[batter]])
            batsman_dist = { _outcome : 0 if batsman_balls_faced == 0 else batsmen[batter][_outcome] / batsman_balls_faced for _outcome in batsmen[batter]} # getting proportion of balls faced for each outcome
            batsman_runs_scored = np.sum([_outcome * batsmen[batter][_outcome] for _outcome in batsmen[batter]])

            for _outcome in batsman_outcome_dists:
                batsman_outcome_dists[_outcome].append(batsman_dist[_outcome])

            batsman_runs.append(batsman_runs_scored)
            batsman_balls.append(batsman_balls_faced)

            ## Bowling Data ##

            bowler_balls_bowled = np.sum([bowlers[_bowler][i] for i in bowlers[_bowler]])
            bowler_runs_given = np.sum([_outcome * bowlers[_bowler][_outcome] if _outcome != 5 else 0 for _outcome in bowlers[_bowler]])
            bowler_wickets_taken = bowlers[_bowler][5]
            bowler_dist = { _outcome : bowlers[_bowler][_outcome] / bowler_balls_bowled if bowler_balls_bowled != 0 else 0 for _outcome in bowlers[_bowler] } # getting proportion of balls bowled for each outcome

            for _outcome in bowler_outcome_dists:
                bowler_outcome_dists[_outcome].append(bowler_dist[_outcome])

            bowler_runs.append(bowler_runs_given)
            bowler_wickets.append(bowler_wickets_taken)
            bowler_balls.append(bowler_balls_bowled)

            ## Innings Data

            innings_score.append(score)
            innings_wickets.append(wickets)
            chasing.append(chasing_score)

            ball_outcome = runs if wicket == 0 else 5

            outcome.append(ball_outcome)

            discrete_ball_no = ball["ball"]
            discrete_ball_no = int(discrete_ball_no) * 6 + round(discrete_ball_no % 1) * 10
            ball_no.append(discrete_ball_no)

            rem_balls = 120 - (discrete_ball_no - 1) if discrete_ball_no <= 120 else int(discrete_ball_no - 120)
            _req_run_rate = max(0, (chasing_score - score) / rem_balls)

            req_run_rate.append(_req_run_rate)

            ## Update State ##

            batsmen[batter][runs] += 1
            bowlers[_bowler][runs] += 1
            bowlers[_bowler][5] += wicket
            score += runs
            wickets += wicket

    new_match_df = pd.DataFrame()
    new_match_df["striker"] = striker
    new_match_df["bowler"] = bowler
    new_match_df["batsman_runs"] = batsman_runs
    new_match_df["batsman_balls"] = batsman_balls

    for _outcome in batsman_outcome_dists:
        new_match_df[f"batsman_{_outcome}"] = batsman_outcome_dists[_outcome]

    new_match_df["bowler_runs"] = bowler_runs
    new_match_df["bowler_balls"] = bowler_balls
    new_match_df["bowler_wickets"] = bowler_wickets

    for _outcome in bowler_outcome_dists:
        new_match_df[f"bowler_{_outcome}"] = bowler_outcome_dists[_outcome]

    new_match_df["innings_score"] = innings_score
    new_match_df["innings_wickets"] = innings_wickets
    new_match_df["ball"] = ball_no
    new_match_df["chasing"] = chasing
    new_match_df["req_run_rate"] = req_run_rate
    new_match_df["outcome"] = outcome

    frames = [df, new_match_df]
    df = pd.concat(frames)


# df now contains the relevant match data from every single ball in the dataset
Enter fullscreen mode Exit fullscreen mode

This now adds the player ratings to the collected data and saves the dataset.

batting_ratings = player_db[["player", "explosivity_rating", "consistency_rating", "finisher_rating", "quick_scorer_rating", "running_rating"]]
bowling_ratings = player_db[["player", "economy_rating", "wicket_taking_rating", "bowling_consistency_rating", "specialist_rating"]]

df = df.join(batting_ratings.set_index("player"), on="striker") # add batsmen's batting ratings to dataframe
df = df.join(bowling_ratings.set_index("player"), on="bowler") # add bowler's bowling ratings to dataframe

df = df.drop(["striker", "bowler"], axis=1) # remove the striker and bowler columns since they are not part of the features needed to predict the outcome of a ball

# get the means and standard deviations of each column

df_mean = df.drop(["outcome"], axis=1).mean()
df_std = df.drop(["outcome"], axis=1).std()

# save the mean and std of all the columns. this would be needed later for preprocessing when predicting new data.

with open("mean-std.bin","wb") as f:
    pkl.dump({
        "mean": df_mean,
        "std": df_std
    }, f)

# standardise all the columns except the ratings and outcome

for col in df.columns:
    if col != "outcome" and "rating" not in col:
        df[col] = zscore(df[col])

# shuffle

df = df.sample(frac=1)

# split into a training set and a testing set 
training_split = int(len(df) * 0.85) # take 85% to train

training_df = df[:training_split]
testing_df = df[training_split:]

# save the datasets

training_df.to_csv("balls-train.csv", index=False)
testing_df.to_csv("balls-test.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

The dataset looks something like this.

Image description

Training and testing the model

I have chosen to use a Random Forest Classifier to train on the data, which is really simple and quick to train using sklearn.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle as pkl
from random import choices

dataset = pd.read_csv("balls-train.csv")
x = dataset.drop(["outcome"], axis=1)
y = dataset["outcome"]

clf = RandomForestClassifier(max_depth=15, n_estimators=50)
clf.fit(x, y)

with open("clf.bin", "wb") as f:
    pkl.dump(clf, f)
Enter fullscreen mode Exit fullscreen mode

Now to test this model, we could see its accuracy on the test set.

test = pd.read_csv("balls-test.csv")
predicted = clf.predict(test.drop(["outcome"], axis=1))
print (np.sum(predicted == test["outcome"]) / len(test))
Enter fullscreen mode Exit fullscreen mode

Running this results in the following output:

0.4430872720835546
Enter fullscreen mode Exit fullscreen mode

An accuracy of 44% does not seem too great. However, each ball can easily have 2 or 3 reasonable outcomes, so using accuracy is not a good measure to capture how the model is performing.

Instead, I thought it would be better to look at the input data of each outcome. For each outcome, take the inputs that were predicted to have this outcome by the model and also take the inputs in the test set that were labelled with this outcome. If these two sets on inputs are similar, then the model has predicted this outcome well.

test = pd.read_csv("balls-test.csv")
predicted = test.drop(["outcome"], axis=1)
preds = clf.predict(predicted)
predicted["outcome"] = preds

for outcome in range(0, 7):
    print ("Outcome ", outcome)

    # get rows which were labelled with this outcome in the test data
    test_slice = test[test["outcome"] == outcome]

    # get rows which were predicted with this outcome by the model
    predicted_slice = predicted[predicted["outcome"] == outcome]

    # a vector of the average values of the inputs that were labelled with the outcome in the test set
    test_slice_mean = test_slice.mean() 

    # a vector of the average values of the inputs that were predicted to have this outcome by the model
    predicted_slice_mean = predicted_slice.mean().fillna(0) 

    # calculates the euclidean distance between the two vectors
    dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5

    print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice)) # compared how many times this outcome appeared in the test set and how many times it got predicted
    print ("Average Distance", dist)
Enter fullscreen mode Exit fullscreen mode

Running this results in the following output...

Outcome  0
Actual Count 12041 Predicted Count 13324
Average Distance 1.1953108714619187
Outcome  1
Actual Count 12564 Predicted Count 20402
Average Distance 0.8442394082114827
Outcome  2
Actual Count 2055 Predicted Count 16
Average Distance 4.355749118524852
Outcome  3
Actual Count 112 Predicted Count 0
Average Distance 4.526265034977223
Outcome  4
Actual Count 3848 Predicted Count 74
Average Distance 2.6306944519746605
Outcome  5
Actual Count 1646 Predicted Count 20
Average Distance 3.586353351892623
Outcome  6
Actual Count 1628 Predicted Count 58
Average Distance 4.243057588990431
Enter fullscreen mode Exit fullscreen mode

This shows that the model is performing poorly. Its distances are quite large for each outcome, except for 0 and 1, considering the magnitudes of the values in the input.

It has also not been able to match the proportions of outcomes in the test set, with the predicted counts being very different to the actual counts.

The model is clearly biased to predicted 0 and 1 for each ball. This does make sense however since those are the most common outcomes in games of Cricket.

The problem here is how the predictions are made. The model outputs a probability distribution of the outcomes for each ball. Currently, the outcome with the highest probability is taken as the predicted outcome. It would make more sense to randomly select an outcome, using the probability distribution to weight the random selection.

Here is the same code but using weighted random selection.

test = pd.read_csv("balls-test.csv")
predicted = clf.predict_proba(test.drop(["outcome"], axis=1))
preds = []

for weights in predicted:
    outcomes = [0,1,2,3,4,5,6]

    # weighted random selection
    p = choices(outcomes, weights=weights)

    preds.append(p[0])

predicted = test.drop(["outcome"], axis=1)
predicted["outcome"] = preds

for outcome in range(0, 7):
    print ("Outcome ", outcome)

    test_slice = test[test["outcome"] == outcome]
    predicted_slice = predicted[predicted["outcome"] == outcome]

    # a vector of the average values of the inputs that were labelled with the outcome in the test set
    test_slice_mean = test_slice.mean() 

    # a vector of the average values of the inputs that were predicted to have this outcome by the model
    predicted_slice_mean = predicted_slice.mean().fillna(0) 

    # calculates the euclidean distance between the two vectors
    dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5

    print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice))
    print ("Average Distance", dist)
Enter fullscreen mode Exit fullscreen mode

Running this results in the following...

Outcome  0
Actual Count 12041 Predicted Count 12072
Average Distance 0.06285163994309419
Outcome  1
Actual Count 12564 Predicted Count 12507
Average Distance 0.047105551432281484
Outcome  2
Actual Count 2055 Predicted Count 2089
Average Distance 0.13890305459997243
Outcome  3
Actual Count 112 Predicted Count 94
Average Distance 0.779151006384056
Outcome  4
Actual Count 3848 Predicted Count 3858
Average Distance 0.09621869495553977
Outcome  5
Actual Count 1646 Predicted Count 1676
Average Distance 0.2084278773037825
Outcome  6
Actual Count 1628 Predicted Count 1598
Average Distance 0.29624599526308754
Enter fullscreen mode Exit fullscreen mode

Now the model can be seen to be performing very well. It matches the test set's proportion of outcomes and the distances between the inputs have reduced to a small range.

Exploring each feature

I thought it would be intriguing to see what each feature of the dataset contributed to the outcome of a ball.

Shown below are box plots for each feature against each predicted outcome to help visualise the distribution of features against each output. This will help in showing how a feature contributes to the outcome.

Batsman's runs

Image description

Batsman's balls

Image description

Batsman's dot ball proportion

Image description

Batsman's single proportion

Image description

Batsman's double proportion

Image description

Batsman's three runs proportion

Image description

Batsman's four runs proportion

Image description

Batsman's six runs proportion

Image description

Runs conceded by the bowler

Image description

Number of balls bowled by the bowler

Image description

Wickets taken by the bowler

Image description

Bowler's dot ball proportion

Image description

Bowler's single proportion

Image description

Bowler's double proportion

Image description

Bowler's three runs proportion

Image description

Bowler's four runs proportion

Image description

Bowler's wicket delivery proportion

Image description

Bowler's six runs proportion

Image description

Innings score

Image description

Innings wickets

Image description

Ball number of the innings

Image description

Score to chase

Image description

Required run rate

Image description

Explosivity rating

Image description

Consistency rating

Image description

Finisher rating

Image description

Quick scorer rating

Image description

Running rating

Image description

Economy rating

Image description

Wicket taking rating

Image description

Bowling consistency rating

Image description

Bowling specialist rating

Image description

From these box plots, we can see which features have more of an impact on the outcome.

Features that produce similar looking boxplots for each outcome do not have much of an effect on the outcome of a ball. These features include:

  • Batsman/Bowler ball outcome proportions
  • Bowler ratings
  • Number of wickets taken by a bowler
  • Chasing score
  • Required run rate

It was not surprising to see the chasing score and required run rate features to not have much of an effect on the outcome, as half the balls in the dataset wouldn't have had these features applied to them.

It also was not surprising to see the ball outcome proportion and no. of wickets taken features in this list too. This is because these can have the same values appear all throughout a cricket match, so they are bound to have a wide range of outcomes for the same values.

I was however surprised to see that bowler ratings did not have much of an impact on the outcome, while batting ratings did.

This can mean a few things

  • The outcome of a game comes more down to how strong a team's batting lineup is rather than their bowling.
  • There must be a better way to quantify the skills of a bowler

After having a thought about this, while I do think it's a mix of both, I feel it is mainly due to the first point.

I believe this is because, especially in a competition like the IPL, the quality of bowlers does not fluctuate as much as they do for batsmen (statistically speaking at least). It is much more common to have bowlers / non-specialist batsmen to bat in games, while you would almost never find a batsman bowling in a game and occasionally find a part-time bowler.

The way to quantify a bowler's skill could still be improved however, maybe considering some of the following:

  • Average pace of the bowler
  • Average degree of turn the bowler gets (for spinners)
  • Adjusting their economy/wicket taking ratings to consider the contexts of the matches they bowled in

These pieces of data were outside the scope of the dataset I had, but would be interesting to implement to improve this project in the future.

Nevertheless, I am happy with how the model has trained, aligning itself closely to the test dataset.

Part 2

To avoid making this part too long, I have decided to split this up into two parts.

Part 2 will involve building the actual match simulator. I will use it to see how it does in simulating real games and to answer any question about any hypothetical game situations.

Thank you for reading!

Top comments (5)

Collapse
 
cjreads665 profile image
Shahid Alam

this so cool!

Collapse
 
ashwinscode profile image
ashwins-code

thank you very much!

Collapse
 
ohmydi profile image
ohmydi

Yes yes, cricket is loved in India, and this is a fact that hardly anyone can dispute. Even more I will tell you, now a lot of people are betting in 1xbetcom-online.in/ 1xbet casino on credit, and you know, as statistics shows, it turns out to earn very good money on it, accordingly people like it, well, you bet, everyone wants to make extra money, no matter in what particular country we are in, do you agree?

Collapse
 
annegaughan profile image
Robert Nelson

As an avid fan of Cricket who regularly watches matches on Castle, I've often pondered the predictability of game outcomes. If you can watch IPL on castle with its fiercely competitive teams, presents a challenge in forecasting winners. Machine learning offers a promising avenue to delve into this complexity, analyzing various factors that influence match results. From player performance to weather conditions, the potential insights ML could provide fascinate me, adding a new dimension to the thrill of the sport.

Collapse
 
jainbaba profile image
Jainbaba

This is actually Awesome and well Explained. So when can we expect the part 2? it would be great to learn more about it.