DEV Community

Cover image for HMIS, R, and SQL -- Basics
Thomas Brittain
Thomas Brittain

Posted on • Originally published at ladvien.com

HMIS, R, and SQL -- Basics

Hacker Introduction

I'm a hacker.  If you find errors, please leave comments below.  If you have an opinion I'll hear it, but I'm often not likely to agree without some argument.

Joins (Merging Data)

Probably the best part of R and SQL is their ability to quickly combine data around a key.  For example, in HMIS CSVs the Client.csv contains a lot of demographic information and the Enrollment.csv contains a lot of assessment information.  This makes it difficult when needing a count of the total participants who are veterans and disabled, since the veteran information is in Client.csv and disability information is in the Enrollment.csv.  However, both R and SQL contain the join functions.  

Joins are a hughely expansive topic; I'm not going to try to cover all their quirks, but here's some videos I found helpful:

The two useful joins for HMIS data are LEFT JOIN and INNER JOIN.  The left join keeps all the data in the left table and data matching from the right table and the inner join keeps only data which matches.

Here's an example in the context of the Client.csv and Enrollment.csv:

Client.csv

PersonalID FirstName VeteranStatus
12345 Jane Yes
54321 Joe No

Enrollment.csv

PersonalID FirstName DisablingCondition
12345 Jane Yes
54321 Joe No
45321 Sven Yes

Here are the two join statements and their results for the data above

SELECT * 
   FROM client a 
   LEFT JOIN enrollment b ON a.Personal=b.PersonalID
Enter fullscreen mode Exit fullscreen mode

This join should result in the following:

PersonalID FirstName VeteranStatus DisablingCondition
12345 Jane Yes Yes
54321 Joe No No
45321 Sven NULL Yes

Notice Sven was kept, even though he had no entry the Client.csv.  After the join, since he had no 

And the inner join would look like this:

SELECT * 
       FROM client a 
       INNER JOIN enrollment b ON a.Personal=b.PersonalID
Enter fullscreen mode Exit fullscreen mode

This join should result in the following:

PersonalID FirstName VeteranStatus DisablingCondition
12345 Jane Yes Yes
54321 Joe No No

Counts

PersonalID <- sqldf("SELECT DISTINCT PersonalID FROM client")
Enter fullscreen mode Exit fullscreen mode

Method above creates a vector of all the PersonalIDs in the client data-frame, which came from the Client.csv.  The DISTINCT command takes only one ID if there are more than two which are identical.  In short, it create a de-duplicaed list of participants.

For example,

PersonalID  OtherData
12345 xxxxxxxxx
56839 xxxxxxxxx
12345 xxxxxxxxx
32453 xxxxxxxxx

Should result in the following,

PersonalID
12345
56839
32453

This is useful in creating a key vector, given other CSVs have a one-to-many relationship for the PersonalID.  For example,

The Enrollment.csv looks something like this

PersonalID ProjectEntryID  EntryDate
12345 34523 2016-12-01
56839 24523 2015-09-23
12345 23443 2014-01-10
32453 32454 2015-12-30

This reflects a client (i.e., 12345) entering a project twice, once on 2014-01-10 and the other 2016-12-01.

Count of Total Participants:

SELECT COUNT(PersonalID) as 'Total Participants' FROM client
Enter fullscreen mode Exit fullscreen mode

This query should give a on row output, counting the number of clients in the data-frame.

Total Participants
1 1609

However, if there are duplicate PersonalIDs it'll count each entry as an ID.  To get a count of unique clients in a data-frame add the DISTINCT command.

SELECT COUNT(DISTINCT(PersonalID)) as 'Unique Total Participants' FROM client
Enter fullscreen mode Exit fullscreen mode

Conditional Data

Often in HMIS data it is necessary to find a collection of participants which meet a specific requirement.  For example, "How many people in this data-set are disabled?"  This is where the WHERE statement helps a lot.

SELECT PersonlID FROM clientAndEnrollment WHERE disability = 'Yes'
Enter fullscreen mode Exit fullscreen mode

This statement will return a vector of all the PersonalID's of participants who stated they were disabled.  The total participant query could be used, but there is an alternative method.

SELECT SUM(CASE WHEN 
               disability = 'Yes' THEN 1 ELSE 0 
           END) as DisabledCount
Enter fullscreen mode Exit fullscreen mode

The above statement uses the CASE WHEN END statement, which I understand as SQL's version of the IF statement.  Here's C equivalent:

for(int i = 0; i < total_participants; i++)
    if(disability == true){
       disabilityCounter++;
    }
}
Enter fullscreen mode Exit fullscreen mode

BOOL!

Boolean operaters can be used to get more complex conditional data:

SELECT PersonalID FROM clientAndEnrollment 
       WHERE disability = 'Yes' 
       AND gender = 'Female'
Enter fullscreen mode Exit fullscreen mode

This statement will provide a vector of all the PersonalID's for clients who are disabled and female.

Ok, good stopping point for now.

Top comments (0)