Contingency tables in R
Introduction
Contingency tables are very useful when we need to condense a large amount of data into a smaller format, to make it easier to gain an overall view of our original data.
To create contingency tables in R we will be making use of the table() function. Since table() is so versatile, we can build contingency tables from something as simple as character strings, to something as complex as data frame objects.
Something that we have to keep in mind is the fact that as we add more data, or as the complexity of our data increases, our contingency table will grow more complex too.
As well as the creation of contingency tables, we will cover the calculation of marginal distributions, creating a frequency or percentage contingency table, and selecting fragments of our table. We will also learn three of the most popular techniques that are used to test for independence between variables, and how to calculate the measures of association between these same variables.
Here is the index for this tutorial:
Getting started - Loading data
Creating a table object
Contingency tables
- Selecting parts of a contingency table
- Frequencies in contingency tables
- Calculating marginal distributions
Testing for Independence
- The Chi-Square test
- Fisher's Exact test
- The G-Test
Measures of Association
- The Phi Coefficient
- Cramer's V and Tschuprow's T
Conclusion
Getting started - Loading data
Before we begin with contingency tables, I would like to briefly discuss creating table objects in R. This way, we will be able to truly appreciate the advantages that contingency tables have (when our goal is overall legibility) over traditional tables.
In order for us to be on the same page, I would propose using the following built-in R dataset: mtcars, although you can use a different one if you wish to.
Tip: To load a built-in R dataset, you can use data(), as follows: data(mtcars). To check if the data has been loaded correctly, simply type the name of the dataset, and you should see a preview on your console.
Once we have loaded our data, we can start building tables.
Creating a table object
Let us begin by taking a look at the number of gears that each car model has. To do this, we will use the $ syntax:
It's a bit difficult to read, isn't it? If we use table(), the information will be much more readable:
It looks a lot better this way, doesn't it? We can now tell, at a glance, that there are 15 cars with 3 gears, 12 cars with 4 gears and 5 cars with 5 gears.
Now let's take a look at the number of cylinders for each car. As previously, we will first show all the data at once, and then use table() to summarize it:
Simple enough, right? Let's take it one step further. What if we want to see the number of gears that each car has, in relation to the number of cylinders? Here is where contingency tables come in.
Contingency tables
Following the previous examples, we will create a contingency table showing the number of gears of each car, related to their number of cylinders. How do we do this? By once again, using table(), this time with two arguments: The first argument will correspond to the rows, and the second to the columns:
As you can see, each row corresponds to each of the gear values, which are 3,4 and 5. Each column corresponds to each of the cyl values, which are 4, 6 and 8. We can now easily know that there are 12 cars with 3 gears and 8 cylinders, or that there are no cars that have 4 gears and 8 cylinders (this can lead us to ask ourselves if perhaps this combination is mechanically incompatible?)
Selecting parts of a contingency table
Have you ever tried to select a table property by using the $ notation? Then you'll be familiar with the $ operator is invalid for atomic vectors error. Why does this happen? Well, objects of class table are essentially vectors (arrays) which means that we can't use the $ syntax. This syntax is only valid for recursive objects, and vectors are atomic objects, hence the error.
How do we interact with tables then? Here are several useful techniques to select and interact with tables:
str() - Examines and outputs the structure of a table object. The only drawback to this useful little function is that the output it provides us with is hard to interpret when you see it for the first time, which is why we are going to analyze what is going on with str() output.
Let's break down that nasty output, to be able to understand what's going on:
In the first line, str() is telling us that we have an object of type table: 'table', which is a matrix of integers, with three rows and three columns: int [1:3, 1:3], and finally shows us all the values contained in our matrix, grouped by columns 1 8 2 2 4...
The strange-looking attr(, "dimnames")* code fragment in the second line is there because R stores the names of the columns and rows in an attribute called dimnames (dimension names). It also tells us that this dimnames attribute is a list of two components: = List of 2.
The ..$ : in the beginning of the fourth line indicates the beginning of a component, which is a character vector containing three values: chr [1:3], and ends with a list of the three values contained in the vector: "3" "4" "5", which you may have noticed, are the names of the three rows in our table.
The fifth row is exactly the same as the fourth, save for the values contained in the vector, which are "4" "6" "8" and correspond with the names of the three columns of our table.
length() - Returns the length of the table object.
table[x,] - Returns the xᵗʰ row:
table[1:x,] - Returns the first x rows of the table:
table[1:x,1:y] - Returns the first x rows of the first y columns:
table["3",] - Returns the row named 3:
table[,"8"] - Returns the column named 8:
Frequencies in contingency tables
Sometimes, we would prefer to display frequencies. By using prop.table() together with table(), we can create a contingency table of frequencies:
Would you like to display percentages instead of frequencies? It's as simple as multiplying by 100, like this:
Calculating marginal distributions
Here are several useful functions for calculating marginal distributions in contingency tables:
addmargins() - Will calculate the total values of both columns and rows:
rowSums() - Will calculate the total values of each row:
colSums() - Will calculate the total values of each column:
Three-way contingency tables
So far we've worked with a two-way contingency table, but, what if we need to work with three variables? Is it possible? The answer is yes, it is. Let's create a three-way table based on the cyl and carb and gear variables:
As you can see, what is really going on is that we have three separate two-way contingency tables, one for each of the gear values. This third gear variable is known as a layer variable, or as a control variable.
Hint: Instead of using table(), we could use ftable(), for a more concise view of our data.
As you can see, at the leftmost of our three-way contingency table, we have our cyl rows, followed by the carb rows, and on top, we have our gear columns. Is it possible to create contingency tables with more variables? The answer is yes, but the complexity will increase as we add more and more variables. Here is an example of a four-way contingency table:
As you can see, it grows more complex as we increase the number of variables, and loses legibility in the process.
Testing for independence
We may want to test the independence of the row and column variables of our contingency table. To achieve this, there are several tests we can perform on our table:
The Chi-Square test
One of the simplest ways to test for independence is to use the Chi-Square test function: chisq.test()
The important value of this test is the p-value, which, as you can see, equals 0.001214. Since p-value is below the significance level (which is 5%, or 0.05), it looks like the variables are not independent. However…
You may have noticed the warning. This is because the value of at least one of the expected values in the contingency table is below 5.
Important: please bear in mind that expected values are not the values that appear in each cell, they are the result of calculating (row total*column total)/overall total). You can have low cell values in your table, yet have perfectly fine expected values, if you have a large enough sample.
Therefore, we shouldn't trust the result of this test, and should look for a different way to test for independence: in this case, we should use the Fisher test instead of the Chi-Square test. Before diving into the Fisher test, I'd like to introduce:
The Monte Carlo simulation
What happens if the expected values of our contingency table are below 5, but we can't use a different test and have to stick with the Chi-Square test? In these cases, the Monte Carlo simulation is a great alternative. It allows us to perform the Chi-Square test with simulated values, that attempt to make up for the low expected values of our table.To achieve this, all we need to do is set the simulate.p.value to true when calling chisq.test():
As you can see, the warning no longer appears, and the test has been performed with simulated p-value based on 2000 replicates, which is the default.
The Fisher Exact test
The general rule of thumb is, when at least one expected value is lower than five, use the Fisher test rather than the Chi-Square test. There are some who think that it may perhaps be time to retire this rule of 'no expected values lower than five' and use some other method to calculate whether or not a sample size is too small for the Chi-Square test.
Tip: You can find a great post talking about small numbers in Chi-square tests here.
Regardless of the debate, our mtcars sample is definitely too small for the Chi-Square test. Therefore, we will use the Fisher test. To perform this test in R, we can use the fisher.test() function:
Once again, the p-value is below the significance level (0.05). Therefore, we can conclude that the variables are not independent.
The G-test
There is yet another method we can use to test for independence, and that is the G-test. Once again, just like the Chi-Square test, it is only reliable when used with large samples. To be able to use the G-Test function in R, we must first import the DescTools package, and then use the GTest() function:
Tip: Remember, for small samples, use the Fisher test. For large samples, use the Chi-Square test or the G-Test
Measures of association
We now know how to test for independence between contingency table variables. However, how can we know how strong this independence is? By calculating association measures.
Phi Coefficient
The simplest way to calculate the strength of association in a 2x2 contingency table is by using the Phi Coefficient. This measure is related to the Chi-Square test for a two-way contingency table in the following way:
Where:
- ϕ is the Phi Coefficient.
- x² is the chi-squared statistic from the Chi-Square test.
- n is the total number of observations.
The range of the Phi Coefficient is [-1, 1], and its interpretation is the following:
- 0 means there is no association between variables. They are independent.
- 1 means a perfect positive association. This is when most of the data falls along the diagonal cells.
- -1 means a perfect negative relationship. This is when most of the data falls away from the diagonal cells.
Since we need a 2x2 table to perform this test, we will calculate the Phi Coefficient for the following contingency table:
To calculate the Phi Coefficient in R, we can use the phi() function in the psych package:
Since the result is negative, we know that we have a negative relationship. Since the result is close to 0, we know that the strength of the relationship is very weak, which translates to there being next to no relationship.
Tip: Remember that the Phi Coefficient can only be used for 2x2 contingency tables. For larger tables, we should use Cramer's V or Tschuprow's T.
Cramer's V and Tschuprow's T
Both Cramer's V and Tschuprow's T are closely related to the Chi-Square test. Because the latter is dependent on the size of the sample, it's a poor reflection of the strength of association. Here is where Cramer's V and Tschuprow's T come in:
Where:
- φ is the Phi Coefficient.
- x² is the chi-squared statistic from the Chi-Square test.
- n is the total number of observations.
- k is the number of columns.
- r is the number of rows.
And here is Tschuprow's T:
Where:
- x² is the chi-squared statistic from the Chi-Square test.
- n is the total number of observations.
- k is the number of rows.
- c is the number of columns.
Both Cramer's V and Tschuprow's T, produce a value in the [0,1] range. The interpretation of this value is the following:
0 means there is no association between variables. They are independent.
1>x>0 (any number larger than 0, smaller than 1) means that there exists a relationship of association. The only problem with both Cramer's V and Tschuprow's T is that we have no way of knowing whether the obtained value represents a weak, intermediate or strong association.
1 means perfect dependency in the case of Tschuprow's T, but not in the case of Cramer's V, where it only means that there exists a high dependency.
To calculate Cramer's V in R, we need to use the CramerV() function from the DescTools package (since we already imported DescTools previously, for the G-Test, we don't need to do so again):
Since the value is larger than 0, we know that the variables are associated. However, we can't know whether or not that association is weak or strong…
To calculate Tschuprow's T in R, we can use the TschuprowT() function, once again from the DescTools package:
Bias correction
From A bias-correction for Cramér's V and Tschuprow's T: It's important to know that both Cramer's V and Tschuprow's T are considerably biased, and tend to overestimate the strength of association.
How do we avoid this? By applying a bias correction. Here is an example of Cramer's V with a bias correction applied in R:
And an example of Tschuprow's T with an applied bias correction in R:
Conclusion
And that's all for today, folks. We covered everything from the concept of a contingency table to independence tests and calculations of association measures. There are some widely used tests, like the Cochran-Mantel-Haenszel test for conditional independence or the Breslow-Day test for homogeneous association that we didn't mention, perhaps we could talk about them in another tutorial.
As always, I hope you enjoyed this tutorial and/or found it useful. Don't forget to let me know your thoughts in the comments below!
P.S. I am a full stack web developer with an interest in R. I initially wrote most of this material as part of my own learning process, and decided to publish it in case it helped others who are also learning R. Comments, tips and advice are welcome!
Discussion