Creating a List of Domestic Violence Victims
In this first work challenge you will use R and SQL to get a by-name-list of those who are domestic violence victims from an HMIS data pull (5.1).
Data Needed
The HMIS Data Pulls are simply a relational database which are broken into multiple CSVs. These CSVs will change in formatting, as stipulated by HUD. The current version of these CSVs is 5.1. For this work challenge the focus will be on two CSVs.
- Client.csv
- HealthAndDV.csv
The Client
file will contain one row per client and only one row. Each row will contain most of all the Client's demographic information.
The HealthAndDV
file will contain a row for every HUD Entry Assessment completed on the participant. Each entry will contain general health information as well the client's domestic violence information, which the client reported during the HUD Entry Assessment.
What ties the two files together is the PersonalID
column. This ID is meant to be the grand-mamma of all client IDs. It is 32 characters long and contain both numbers and letter:
B7YIOJIGF9CDP6FV7TANQXLMQRMBTVTB
(note, this ID is called the "Enterprise ID" in our HMIS software)
Both the Client
and HealthAndDV
contain the PersonalID
column. This often referred to as a key
when dealing with relational databases. This unique ID in both files will allow us to merge the two data sets together using something called joins.
Now, in the Client.csv the information is readable to us humans. There will be the FirstName
, LastName
, SSN
columns and many more. But in the HealthAndDV.csv the information is trickier. For this challenge we are going to focus on one column DomesticViolenceVictim
. When you open the data set you may notice that instead of "Yes" or "No" answers in the columns you will see "1" or "0". That's because computers understand the 1 and 0 much quicker than Yes or No.
This is an important side note for managing data. Make the databases easy for computers to understand and report generation will be much faster. You have to think, if using a 1 instead of Yes could save 0.5 seconds on a calculation, then when you have a dataset which contains 1000 records you just saved 500 seconds 8.3 seconds. Now, multiply by 1,700,000 records. Well, you get the picture.
Ok. Back to the problem at hand. Just know "1" is equal to "Yes" and "0" is equal to "No". So, for this challenge, we will want to find all the clients who have a "1" in the DomesticViolenceVictim
column
The Goal
We are going to merge the two data sets and to discover the following:
- A list of clients who are victims of domestic violence.
- A count of how many clients are fleeing domestic violence.
Really, the second piece of information is counting how many people are in the list of those who are victims.
To get this information we will need to do the following:
- Load the Client.csv and HealthAnd.csv
- Filter the HealthAndDV dataset to the most recent according to the column
DateCreated
- Join (merge) the dataframes where their
PersonalID
are the same - Filter the merged dataframe to those who've reported
1
inDomesticViolenceVictim
- Write this data to a file.
- Use a function to count how many participants are in this victim list.
The Resources
Below are the resources which should help you understand each step of the process.
Step 1
- R Programming A-Z -- Video 41 -- Loading and Importing Data in R
- R Programming A-Z -- Video 21 -- Functions in R
Step 2
- The Complete SQL Bootcamp -- All Videos in Section 5
- The Complete SQL Bootcamp -- All Videos in Section 6
- Working with Dates in SQLdf
Step 3
- The Complete SQL Bootcamp -- All Videos in Section 8
Step 4 --
Step 5 --
Step 6 --
- The Complete SQL Bootcamp -- All Videos in Section 5
- The Complete SQL Bootcamp -- All Videos in Section 6
Top comments (0)