Since I joined EDB, I figured it was time to learn how to install Postgres. I techinically did it when I installed Discourse. And at some point I seem to have typed brew intstall postgres
on my MacBook. But I'm talking installing it the hard way. I'm talking installing from the source code. Yeah!
I've always found that the only way to learn new technology is to have a reason to use it. Work-related requirements suffice, but it's even better to have a hobby-related reason. For example, creating a baseball database. If you want to do research on baseball history, a great place to start isProject Retrosheet which has detailed play-by-play files going back to 1916. So if you load this data into a database, you can do all sorts of queries to find out how often (or even whether) certain events have happened. For instance, has anyone ever hit two grand slams in an inning?1
Prerequisites
The instructions for installing Postgres and for setting up a Retrosheet database are incomplete for a very simple reason: prerequisites. People who have correctly set up their environment already have everything they needed. People who are getting started don't know what they are missing. Critically, it takes a lot of effort to figure out which of the many possible prerequisites are actually required. So what's needed is either someone to start from scratch and document everything they needed to install.
I got a PC for my new job and this is the first time in nearly a decade that I've used Windows in earnest. So I have some adjustments to make. I started documenting them over here, but I have a feeling this list will grow over time.2
One of the reasons I'm documenting these steps is that I'm going back and testing the installation process using a pristine Windows virtual machine. That way I can test what things are really needed and what is optional or unnecessary. Or, at least I could in theory and maybe will in practice.
Getting Retrosheet data
Retrosheet's data has an idiosyncratic format, so it helps to have software to parse it. So the next step is to download thecommand-line toolsmaintained by the Chadwick Baseball Bureau. Then I move them toC:\Users\[username]\bin
and add that to my PATH environment variable. That way I can use those executables without specifying the full path. More importantly, so can R scripts.
Since the instructions I'm following use R, the next step is todownload a recent version of R. While not strictly necessary, I also grabbedRStudio, which is a fantastic IDE that uses R Markdown. It's a form of literate programming that allows both a data-centric article and the script that generated the data to live in the same document.3
Next you gotta install a bunch of R libraries the script need. Start withdevtools
and then run the following commands from the RStudio console window:
install.packages("devtools")
install.packages('h2o')
install.packages('emayili')
install_github("BillPetti/bpettir")
install.packages("purrr")
I made a few tweaks to the script Bill Petti uses to parse and load Retrosheet data. One problem is that several packages must be loaded first:
require(bpettir)
require(future)
require(furrr)
require(purrr)
This can go at the top of the script where the other require
statements live for simplicity. I also changed "~/Desktop/retrosheet"
to "~/retrosheet"
and added setwd
commands to make sure the script is in the right directory for each input file it needs. For some reason, RStudio actually puts its output in~/Documents/retrosheet
. I think it uses asandbox for R that has ~/Documents
for a home directory.
Installing Postgres from source
I checked all the steps right up to populating Postgres database tables. While its certainly possible to use R to analyze the data, my goal was to have a reason to install Postgres. Somewhat annoyingly for the purposes of this post, the short versionworked almost perfectly using WSL 2.
The only problem I had was that I don't know the password for the Administrator account so I couldn't use su
. Thankfully, I found that I could use sudo su
instead. That confused me until I read some answers on Ask Ubuntu. Turns out su
asks for the password of the account you want to switch to and sudo
asks for the password of the current account.
Installing prerequisites from the command line
Since I wanted to install Postgres the hard way I decided to try installing it on Windows using the Microsoft toolchain. Since I'm not a big fan of GUI installers, I was pleased to find I could useChocolatey to install the prerequisites.4 After installing Chocolatey, open a PowerShell window (as an administrator) and run the following command:
choco install ActivePerl git openssl winflexbison python archiver -y
That should install all of the third-party tools you'll need to build Postgres. You might need to add C:\Perl64\bin
to your PATH
and runrefreshenv
. Then install the Visual Studio tools you'll need to build:
choco install visualstudio2019community windows-sdk-10.1 -y
(The -y
is optional and simply avoids prompting for confirmation by automatically saying "yes" to any questions the installer might ask. You normally want to just say yes.)
While you are at it, you can install all sorts of useful software from the command line:
choco install Emacs vim texlive
And even some of the software I installed by downloading a GUI installer:
choco install R.Studio wsl2
Installing MSVC Build Tools
There is one feature that I haven't figured out how to install via Chocolatey. It's the Visual Studio C++ build tools, which can be installed via the Visual Studio Installer. You can start it up from the Visual Studio 2019 menu system:
Tools => Get Tools and Features...
You can also just run it independently. Either way, you'll want to modify your version of Visual Studio:
Modify => Desktop development with C++ => MSVC v142 - VS 2019 C++ x64/x86 build tools
By the way, I'm including the version numbers I used, but it probably will work fine with newer (and perhaps older) versions. Another constant problem with installing software is that sometimes new releases break stuff that used to work. For instance, Ruby 3 came out recentlyand it breaks the way I build my site. This is a constant problem with guides about software and closely related to the prerequisites problem. Most of the time "install X" is helpful, but sometimes you need to say "install version Y of X" instead.
Building Postgres
At this point we're finally ready to start building Postgres! I'm going to show how to do that in PowerShell. But not just any PowerShell. For this you need to start up Developer PowerShell for VS 2019. Regular PowerShell doesn't work because the build tools aren't readily available.5 There's also a Developer Command Prompt, but I prefer PowerShell for a few Unix-like enhancements. The following commands, create a pg
directory for playing in, move into that directory, get the Postgres source code, extract the files and move to directory where we'll build:
mkdir pg
cd pg
wget https://ftp.postgresql.org/pub/source/v13.2/postgresql-13.2.tar.bz2 -OutFile postgresql-13.2.tar.bz2
arc unarchive .\postgresql-13.2.tar.bz2
cd .\postgresql-13.2\src\tools\msvc\
Now PowerShell is really handy, but notice how the wget
command requires an -OutFile
parameter. If you leave it off, the command reads the URL, shows some metadata and doesn't save the file. It's just an alias for Invoke-WebRequest -Uri
. It's like they get it, but they don't get it. I'm not sure what's the point of fetching files from the internet and immediately throwing them away.
Anyway, if you've set everything up right, you should be able to run the .\build
command. If you haven't installed all the right prerequisites or aren't running in the Developer PowerShell, you'll get an error like this:
PS C:\pg\tmp\postgresql-13.2\src\tools\msvc> .\build
Unable to determine Visual Studio version: The nmake command wasn't found. at C:/pg/tmp/postgresql-13.2/src/tools/msvc/Mkvcbuild.pm line 93.
Otherwise, the output will end something like:
FinalizeBuildStatus:
Deleting file ".\Release\euc2004_sjis2004\euc2004_sjis2004.tlog\unsuccessfulbuild".
Touching ".\Release\euc2004_sjis2004\euc2004_sjis2004.tlog\euc2004_sjis2004.lastbuildstate".
Done Building Project "C:\pg\postgresql-13.2\euc2004_sjis2004.vcxproj" (default targets).
Done Building Project "C:\pg\postgresql-13.2\pgsql.sln" (default targets).
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:04:01.81
Installing Postgres
Once the build succeeds, it's time to install:
.\install \pg
By the way, it doesn't much matter where you install Postgres. It can help to put it somewhere you will remember later. The EDB installerputs it in C:\Program Files\PostgreSQL\13
, which probably is a better choice. Wherever you install it, you'll find a bin
directory that has all the commands you'll need to start the database. You'll need to create a data
directory to hold, well, all your data. While you could run using your own account, it's better to run it under another user account. (A good choice is"postgres".) Either way, make sure your data directories (but not the executables) are owned by the account that will run Postgres.
In the bin
directory, you can find the commands to initialize a database directory, start a Postgres server using that data storage and create a database called "retrosheet":
.\initdb.exe -D \pg\data
.\pg_ctl -D \pg\data -l logfile start
.\createdb.exe retrosheet
Then go ahead and make sure the database exists and is empty:
PS C:\pg\bin> .\psql retrosheet
psql (13.2)
WARNING: Console code page (437) differs from Windows code page (1252)
8-bit characters might not work correctly. See psql reference
page "Notes for Windows users" for details.
Type "help" for help.
retrosheet=# \d
Did not find any relations.
retrosheet=#
Writing Retrosheet data to Postgres
Poor retrosheet database. Let's give it family, shall we?
Now I have to admit something kinda embarrassing here. Building Postgres from source was surprisingly easy, but I made connecting to the database from R incredibly hard. My only excuse is that the last time I had to set up a database connection on my own was over a decade ago. Pretty much the only thing I remembered was that ODBC was involved somehow. When I mentioned how frustrating ODBC still is to install, a colleague kindly pointed out it's not necessary anymore.
What makes this especially embarrassing is that back in 2006, I wrotea Perl script to connect to arbitrary database serversusing the DBI module which is exeedingly similar to the R DBI package. Not only do you not need ODBC, DBI is a competitor.6 Connecting to our new database via R is as simple as installing two packages:
install.packages('DBI')
install.packages('RPostgres')
And using the appropriate connection string:
require(DBI)
retrosheet_db <- dbConnect(RPostgres::Postgres(),
dbname = 'retrosheet',
host='localhost',
port='5432',
user='postgres')
By default, you don't need to specify the postgres passwordwhen connecting via the same machine. If you do need a password, it's good practice to save it in an environment variable and use something like password=Sys.getenv('PG_PWD')
so you don't hardcode the secret. And with that, you can execute the dbWriteTable
anddbGetQuery
commands list in Bill Petti's instructions. It takes a while, but when everything is done, you can see two new tables have been created:
=retrosheet=# \d
List of relations
Schema | Name | Type | Owner
-------------+---------------+-------+----------
public | retro_pbp | table | postgres
public | retro_rosters | table | postgres
And if you are really lucky, they will have data:
retrosheet=# select count(*) from retro_pbp;
count
---------------
13792467
(1 row)
retrosheet=# select count(*) from retro_rosters;
count
------------
92727
(1 row)
Has anyone ever hit two grand slams in a single inning?
Now we can start doing some baseball research. For instance, if you know that a grand slam is the only way to get 4 RBIs in a single play,7it's not hard to find all the batters who have hit at least two grand slams in a game:
select game_id, bat_id, count(*)
from retro_pbp p
where rbi_ct = 4
group by game_id, bat_id
having count(*) > 1;
game_id | bat_id | count |
---|---|---|
BOS199905100 | garcn001 | 2 |
CLE196806240 | nortj101 | 2 |
CLE199808140 | hoilc001 | 2 |
LAN199904230 | tatif001 | 2 |
MIL200907270 | willj004 | 2 |
MIN196105090 | gentj101 | 2 |
PHA193605240 | lazzt101 | 2 |
PHA193907042 | taboj101 | 2 |
SFN196607030 | clont101 | 2 |
SLA194607270 | yorkr101 | 2 |
TEX199509040 | ventr001 | 2 |
TEX200307290 | muelb001 | 2 |
WS2197006260 | robif103 | 2 |
(13 rows) |
While you might not understand the significance of this report, I did. It meant I had not only managed to install Postgres, but also load it up with over a 100 years of baseball history. One particularbat_id
stands out as the player I'm looking for. Next, I queried for all the times when a batter has hit two grand slams in a single inning:
select game_id, inn_ct, bat_id, count(*)
from retro_pbp
where rbi_ct = 4
group by game_id, inn_ct, bat_id
having count(*) > 1;
game_id | inn_ct | bat_id | count |
---|---|---|---|
LAN199904230 | 3 | tatif001 | 2 |
(1 row) |
Now you can take that game_id
, plug it into the nearest search engine and discover St. Louis vs the Dodgers on April 23, 1999. There, in the third inning, Fernando Tatis slams twice. If you feel like it, you can watch a crude animationof how it happened. Think about the series of events that had to happen just so in order for Tatis to get up with the bases loaded twice.
Meanwhile, think of all the little things that needed to be initiated to get a working database server up and running. Thankfully software can be quite a bit easier to manipulate than a baseball game.
-
If you've watched as much baseball as I have, you know the answer. Don't spoil it for everyone else! ↩
-
The #1 revelation was that Windows uses the Control key for most of the stuff Mac uses the Command key for. To make life more complicated, Mac keyboards have an Option key that functions like Alt and it's in the same position where the Windows key is on a PC keyboard. Eventually I realized I just want to be able to use my thumb rather than my pinkie to copy and paste things. So making Alt into another Control key and making the Windows key the new Alt solved my problem. ↩
-
That said, it's a little much in R Studio when it also puts the output of the scripts in the window where the source is. Fortunately you can turn that off by going to Tools => Global Options => R Markdown => Basic => Show output line for all R Markdown documents. ↩
-
A big hat-tip to @JourneymanGeek who pointed this out to me. ↩
-
There is certainly some way of setting up environment variables to make this work, but this post is way too long already and I can't be bothered to figure it out. Just use the Developer PowerShell. ↩
-
Interestingly, DBI can connect with a database connected to ODBC. This fact made my searches misleading since it's entirely possible (but rarely desirable) to connect to Postgres that way. ↩
-
I later doublechecked I got the same result if I included only place that cwevent coded as a home run. ↩
Top comments (0)