Readme of my project :-
Machine-Learning-Baseball ⚾
Baseball
The movie Money Ball, which is based on a true story, shows in game baseball statistics can be collected and analyzed in such a way that provides accurate answers to specific questions. This relies on the fact that, over the course of a season, teams experience patterns and react to factors in a repetitive manner, this ultimately affects their in-game performances. Essentially, the MLB is one large complexity system with feedback, stocks, and other system qualities as a result it can theoretically be understood.
Hypothesis
We theorized that there is indeed a relationship between the statistics to a game and its outcome. As a result, the group focused on implementing a model that predicted the score of a particular game using the statistics of that game.
Model Overview
Both the teams in a game are given their individual ID values and are made into vectors. Relevant data like the home and away team, home runs, RBI’s, and walk’s are all taken into account and passed through layers. There’s no need to reinvent the wheel here, there's a multitude of libraries that enable a coder to implement machine learning theories efficiently. In this case we will be using a library called TFlearn, documentation available from http://tflearn.org. The program will output the home and away teams as well as their respective score predictions.
## Implementation
As mentioned earlier, the model was built using TFLearn, which is a API to Tensorflow. The model’s input data is the 2020/2021 baseball season statistics, score, and matchups. The model learns what statistics are useful for deciding a score, it also recognizes the different team by feeding the teams ID into a separate layer first. To train the model we used back propagation with gradient descent, this was handled by TFLearn during the training process. An in depth description of the model used is given in the notebook that is proved with the report. We chose to train the model on the 2020 season, using that data to learn what statistics are import in a game, that trained model could then make a predictions. We applied the model to the 2021 season, for each game we gave the model the statistics, scores and teams in that game to base a prediction from, even though the statistic are not determined until after a game. We originally tried using input as the team's average statistics for all their previous games, however these predictions were no better than coin flip.
Results
In the end the model predicted games quite well. Scores were within 1 or 2 points off the actual values for the score and have about a 90% prediction rate of who would win games. The high accuracy of the model helps prove the hypothesis that baseball game statistics are highly correlated to the final score of the game. Such that the amount of home runs a team achieves in a game, has an direct effect of high there score is. Also when team achieves more hits than their opponent, they have a higher probability of scoring more runs than the other team. For defence, if a team completes a substantial amount of double plays in a game, then it demonstrates that there defense is effective, and that the other team will have a harder time scoring runs. The neural net was able to detect these subtle relations, to make effective predictions on who wins the game, and what the score is.
Code of my project :-
Code is written in Python programming language :Make sure make .py file to execute following code
import numpy as np
class GameStats(object):
def init(self, homeTeamNameIndex, homeTeamScoreIndex, homeTeamStatsIndex, visitorTeamNameIndex, visitorTeamScoreIndex, visitorTeamStatsIndex):
#parse the text file
self.statsFile = open("baseball2016.txt", "r")
self.topArray = []
self.sideArray = []
self.sc = np.zeros((30,30,30), np.int32)
self.sc[:,:,:] = -1
self.am = np.zeros((30,30), np.float32)
self.gameList = []
for line in self.statsFile:
homeTeam = ""
awayTeam = ""
homeScore = 0
awayScore = 0
token = line.split(',') #tokenize the string
tokenIndex = [homeTeamNameIndex, homeTeamScoreIndex, visitorTeamNameIndex, visitorTeamScoreIndex] + [i for i in homeTeamStatsIndex] + [i for i in visitorTeamStatsIndex]
attributes = dict()
for i in xrange(len(token)):
if(i in tokenIndex):
attributes[i] = removeQuotes(token[i])
self.addScore(attributes[homeTeamNameIndex], attributes[visitorTeamNameIndex], attributes[homeTeamScoreIndex], attributes[visitorTeamScoreIndex])
self.addGame(attributes[homeTeamNameIndex], attributes[homeTeamScoreIndex], [attributes[i] for i in homeTeamStatsIndex], attributes[visitorTeamNameIndex], attributes[visitorTeamScoreIndex], [attributes[i] for i in homeTeamStatsIndex])
self.buildAvgMatrix()
self.statsFile.close()
def removeQuotes(string):
if (string.startswith('"') and string.endswith('"')) or (string.startswith("'") and string.endswith("'")):
print("here")
return string[1:-1]
return string
def addGame(self, team1, score1, stats1, team2, score2, stats2):
self.gameList.append([team1, score1, stats1, team2, score2, stats2])
give it two teams, the scores, and it will add it to the matrix
def addScore(self, team1, team2, score1, score2):
'''
for a team in top array, the index in the array corrisponds to the matrix column there located in
for a team in side array, the index in the array corrisponds to the matrix row there located in
'''
#team 1 score entry
try:
row = self.sideArray.index(team2)
except:
self.sideArray.append(team2)
row = self.sideArray.index(team2)
try:
col = self.topArray.index(team1)
except:
self.topArray.append(team1)
col = self.topArray.index(team1)
temp = self.sc[row, col]
counter = 0
for e in temp:
if (e == -1):
temp[counter] = score1
break
counter += 1
self.sc[row, col] = temp
#team 2 score entry
try:
row = self.sideArray.index(team1)
except:
self.sideArray.append(team1)
row = self.sideArray.index(team1)
try:
col = self.topArray.index(team2)
except:
self.topArray.append(team2)
col = self.topArray.index(team2)
temp = self.sc[row, col]
counter = 0
for e in temp:
if (e == -1):
temp[counter] = score2
break
counter += 1
self.sc[row, col] = temp
returns the score(s) for match up
def getScore(self, team1, team2, gameSelect = None):
print(team1, team2)
try:
score1 = self.sc[self.sideArray.index(team2), self.topArray.index(team1)]
score2 = self.sc[self.sideArray.index(team1), self.topArray.index(team2)]
if (gameSelect == None):
print(team1, score1)
print(team2, score2)
else:
print(team1, score1[gameSelect])
print(team2, score2[gameSelect])
except:
print('Invalid input of teams')
def getGameList(self):
return self.gameList
constructs a matrix of the avg score in a matchup
def buildAvgMatrix(self):
for col in range(len(self.sc[:,0])): #depth
for row in range(len(self.sc[0, :])): #width
tempScore = self.sc[row, col]
avgScore = 0.0
count = 0.0
for j in tempScore:
if (j != -1):
avgScore += j
count += 1
else:
break
try:
avgScore = avgScore / count
except:
avgScore = -1
self.am[row, col] = avgScore
get the value of the avg score for a match up
def getAvgScore(self, team1, team2):
try:
score1 = self.am[self.sideArray.index(team2), self.topArray.index(team1)]
score2 = self.am[self.sideArray.index(team1), self.topArray.index(team2)]
print(team1, score1)
print(team2, score2)
except:
print('Invalid input of teams')
Baseball Format guide
Field(s) Meaning
1 Date in the form "yyyymmdd"
2 Number of game:
"0" -- a single game
"1" -- the first game of a double (or triple) header
including seperate admission doubleheaders
"2" -- the second game of a double (or triple) header
including seperate admission doubleheaders
"3" -- the third game of a triple-header
"A" -- the first game of a double-header involving 3 teams
"B" -- the second game of a double-header involving 3 teams
3 Day of week ("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
4-5 Visiting team and league
6 Visiting team game number
For this and the home team game number, ties are counted as
games and suspended games are counted from the starting
rather than the ending date.
7-8 Home team and league
9 Home team game number
10-11 Visiting and home team score (unquoted)
12 Length of game in outs (unquoted). A full 9-inning game would
have a 54 in this field. If the home team won without batting
in the bottom of the ninth, this field would contain a 51.
13 Day/night indicator ("D" or "N")
14 Completion information. If the game was completed at a
later date (either due to a suspension or an upheld protest)
this field will include:
"yyyymmdd,park,vs,hs,len" Where
yyyymmdd -- the date the game was completed
park -- the park ID where the game was completed
vs -- the visitor score at the time of interruption
hs -- the home score at the time of interruption
len -- the length of the game in outs at time of interruption
All the rest of the information in the record refers to the
entire game.
15 Forfeit information:
"V" -- the game was forfeited to the visiting team
"H" -- the game was forfeited to the home team
"T" -- the game was ruled a no-decision
16 Protest information:
"P" -- the game was protested by an unidentified team
"V" -- a disallowed protest was made by the visiting team
"H" -- a disallowed protest was made by the home team
"X" -- an upheld protest was made by the visiting team
"Y" -- an upheld protest was made by the home team
Note: two of these last four codes can appear in the field
(if both teams protested the game).
17 Park ID
18 Attendance (unquoted)
19 Time of game in minutes (unquoted)
20-21 Visiting and home line scores. For example:
"010000(10)0x"
Would indicate a game where the home team scored a run in
the second inning, ten in the seventh and didn't bat in the
bottom of the ninth.
22-38 Visiting team offensive statistics (unquoted) (in order):
at-bats
hits
doubles
triples
homeruns
RBI
sacrifice hits. This may include sacrifice flies for years
prior to 1954 when sacrifice flies were allowed.
sacrifice flies (since 1954)
hit-by-pitch
walks
intentional walks
strikeouts
stolen bases
caught stealing
grounded into double plays
awarded first on catcher's interference
left on base
39-43 Visiting team pitching statistics (unquoted)(in order):
pitchers used ( 1 means it was a complete game )
individual earned runs
team earned runs
wild pitches
balks
44-49 Visiting team defensive statistics (unquoted) (in order):
putouts. Note: prior to 1931, this may not equal 3 times
the number of innings pitched. Prior to that, no
putout was awarded when a runner was declared out for
being hit by a batted ball.
assists
errors
passed balls
double plays
triple plays
50-66 Home team offensive statistics
67-71 Home team pitching statistics
72-77 Home team defensive statistics
78-79 Home plate umpire ID and name
80-81 1B umpire ID and name
82-83 2B umpire ID and name
84-85 3B umpire ID and name
86-87 LF umpire ID and name
88-89 RF umpire ID and name
If any umpire positions were not filled for a particular game
the fields will be "","(none)".
90-91 Visiting team manager ID and name
92-93 Home team manager ID and name
94-95 Winning pitcher ID and name
96-97 Losing pitcher ID and name
98-99 Saving pitcher ID and name--"","(none)" if none awarded
100-101 Game Winning RBI batter ID and name--"","(none)" if none
awarded
102-103 Visiting starting pitcher ID and name
104-105 Home starting pitcher ID and name
106-132 Visiting starting players ID, name and defensive position,
listed in the order (1-9) they appeared in the batting order.
133-159 Home starting players ID, name and defensive position
listed in the order (1-9) they appeared in the batting order.
160 Additional information. This is a grab-bag of informational
items that might not warrant a field on their own. The field
is alpha-numeric. Some items are represented by tokens such as:
"HTBF" -- home team batted first.
Note: if "HTBF" is specified it would be possible to see
something like "01002000x" in the visitor's line score.
Changes in umpire positions during a game will also appear in
this field. These will be in the form:
umpchange,inning,umpPosition,umpid with the latter three
repeated for each umpire.
These changes occur with umpire injuries, late arrival of
umpires or changes from completion of suspended games. Details
of suspended games are in field 14.
161 Acquisition information:
"Y" -- we have the complete game
"N" -- we don't have any portion of the game
"D" -- the game was derived from box score and game story
"P" -- we have some portion of the game. We may be missing
innings at the beginning, middle and end of the game.
Missing fields will be NULL.
I have used Pycharm IDE and interpreter Python 3.9 (pythonProject) to make this project.
This was my gaming project. I dont know how to upload video. So I am sorry to upload my video here. If I had made any mistake , I am sorry, please excuse me.
Thanks
Mandvi
Top comments (4)
Hey, I'm trying to better understand the ERA (Earned Run Average) statistic in baseball and its significance. What is considered a "good" ERA for pitchers at different levels of the game? For example, is a 3.00 ERA universally excellent, or does it vary between the MLB, college, or minor leagues?
Also, how much does a team’s defense impact a pitcher’s ERA? Moreover, What Is a Good ERA in Baseball? - ERA-Calculator? Can a good ERA always be attributed to strong pitching, or do fielding errors and other factors play a role?
Lastly, in today’s era of analytics, how does ERA compare to newer stats like FIP (Fielding Independent Pitching)?
Super cool! Where did you get the baseball data from?
Thank you for submitting your project, Mindvi! Can you please clarify how SashiDo.io was used here?
@sashido.io
My machine learning project