MennahTullah Mabrouk

Posted on Jul 29, 2023

Analyzing RNA-seq data with DESeq2 (Part 2)

#bioinformatic #rna #deseq2 #analyzing

As I continue learning, I encountered several issues while using the previous dataset, RNA-seq of human multiple myeloma patients myeloid-derived suppressor cells (M-MDSC). Unfortunately, I was unable to prepare it for DESeq2 analysis due to unexpected errors, which resulted in wasted time and frustration. Despite my strong desire to utilize the dataset, I couldn’t achieve the desired results.
During my search for solutions, I came across an excellent Differential expression with DEseq2 Tutorial, which provided valuable insights into the DESeq2 analysis. However, since I still didn’t know how to properly prepare the (M-MDSC) dataset, I decided to work with the dataset used in the tutorial. This way, I could apply the knowledge I had gained from the tutorial to a dataset with known steps and expectations.
We will cover data preprocessing, differential expression analysis, data visualization with histograms and heatmaps, and the use of Principal Component Analysis (PCA) for sample relationship visualization. For those interested in the entire workflow, I have documented it on my GitHub feel free to explore.

Starting with

Sets up the necessary data and prepares it in the required format to perform differential expression analysis using the DESeq2 package. We transformed the data into a suitable matrix format for further analysis with DESeq2.
“Basically, here we upload our data and prepare rawcounts”

library(DESeq2)
library(ggplot2)
library(pheatmap

# Read in the raw read counts
rawCounts <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-raw-counts.tsv>")
head(rawCounts)

# Read in the sample mappings
sampleData <- read.delim("<http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/E-GEOD-50760-experiment-design.tsv>")
head(sampleData)

#For rawCounts
# Convert count data to a matrix of appropriate form that DEseq2 can read
geneID <- rawCounts$Gene.ID
sampleIndex <- grepl("SRR\\\\d+", colnames(rawCounts))
rawCounts <- as.matrix(rawCounts[, sampleIndex])
rownames(rawCounts) <- geneID
head(rawCounts)

Console Snip

> head(rawCounts)
                SRR975551 SRR975552 SRR975553 SRR975554 SRR975555 SRR975556
ENSG00000000003      6617      1352      1492      3390      1464      1251
ENSG00000000005        69         1        20        23        12         4
ENSG00000000419      2798       714       510      1140      1667       322
ENSG00000000457       486       629       398       239       383       290
ENSG00000000460       466       342        73       227       193        35
ENSG00000000938        75        95       158       107       135        75
                SRR975557 SRR975558 SRR975559 SRR975560 SRR975561 SRR975562
ENSG00000000003       207      1333      2126      1799      1362      3435
ENSG00000000005        20         2         3         6        10        15
ENSG00000000419       273       621      1031       677       480      1194
ENSG00000000457       164       452       172       229       264       297
ENSG00000000460        38       184       174        68        46       173
ENSG00000000938       236       254       121       107        94        90

Continue Data Preprocessing and Preparation

Now we select specific columns, renames them, and converts the “individualID” column to a factor, ensuring that the data is in a compatible format for DESeq2’s requirements.
“Basically, here we prepare sampleData”

# Convert sample variable mappings to an appropriate form that DESeq2 can read
head(sampleData)
rownames(sampleData) <- sampleData$Run
keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
sampleData <- sampleData[, keep]
colnames(sampleData) <- c("tissueType", "individualID")
sampleData$individualID <- factor(sampleData$individualID)
head(sampleData)

Watch Difference in Console Snip

> # Convert sample variable mappings to an appropriate form that DESeq2 can read
> head(sampleData)
        Run Sample.Characteristic.biopsy.site.
1 SRR975551                      primary tumor
2 SRR975552                      primary tumor

  Sample.Characteristic.Ontology.Term.biopsy.site. Sample.Characteristic.disease.
1             <http://www.ebi.ac.uk/efo/EFO_0000616>              colorectal cancer
2             <http://www.ebi.ac.uk/efo/EFO_0000616>              colorectal cancer

  Sample.Characteristic.Ontology.Term.disease.
1         <http://www.ebi.ac.uk/efo/EFO_0005842>
2         <http://www.ebi.ac.uk/efo/EFO_0005842>

  Sample.Characteristic.disease.staging.
1             Stage IV Colorectal Cancer
2             Stage IV Colorectal Cancer

  Sample.Characteristic.Ontology.Term.disease.staging.
1                                                   NA
2                                                   NA

  Sample.Characteristic.individual. Sample.Characteristic.Ontology.Term.individual.
1                             AMC_2                                              NA
2                             AMC_3                                              NA

  Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
1                    Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>
2                    Homo sapiens <http://purl.obolibrary.org/obo/NCBITaxon_9606>

  Sample.Characteristic.organism.part.
1                                colon
2                                colon

  Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
1      <http://purl.obolibrary.org/obo/UBERON_0001155>             primary tumor
2      <http://purl.obolibrary.org/obo/UBERON_0001155>             primary tumor

  Factor.Value.Ontology.Term.biopsy.site. Analysed
1    <http://www.ebi.ac.uk/efo/EFO_0000616>      Yes
2    <http://www.ebi.ac.uk/efo/EFO_0000616>      Yes


> rownames(sampleData) <- sampleData$Run
> keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
> sampleData <- sampleData[, keep]
> colnames(sampleData) <- c("tissueType", "individualID")
> sampleData$individualID <- factor(sampleData$individualID)


> head(sampleData)
             tissueType individualID
SRR975551 primary tumor        AMC_2
SRR975552 primary tumor        AMC_3
SRR975553 primary tumor        AMC_5

Here is one of the things I had learned

Prepares the data for unsupervised clustering analysis by reordering columns, renaming tissue types, converting variables to factors, and creating a DESeq2DataSet object. Then performing variance stabilizing transformation and calculates a correlation matrix for hierarchical clustering with correlation heatmaps.

# Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))

# Rename the tissue types
rename_tissues <- function(x) {
  x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
  return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))

# Order the tissue types so that it is sensible and make sure the control sample is first: normal sample -> primary tumor -> metastatic tumor
sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))

# Modify factor levels to comply with safe naming conventions
levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))

# Create the DESeq2DataSet object
deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)

# Estimate size factors
dds_wt <- estimateSizeFactors(deseq2Data)

# Unsupervised clustering analysis: log transformation using vst
vsd_wt <- vst(dds_wt, blind = TRUE)

# Hierarchical clustering with correlation heatmaps
vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)

# Save vsd_cor_wt as a TSV file (Optional only to see an overview)
write.table(vsd_cor_wt, file = "vsd_cor_wt.tsv", sep = "\\t", row.names = TRUE, col.names = TRUE)
Let’s Break Our Code a Little Bit:
By reordering the columns of the count data rawCounts to match the row names (gene IDs) of the sample metadata sampleData, it ensures that the samples are in the correct order for downstream analysis. This step is crucial because DESeq2 relies on correctly matched count data and sample metadata to perform valid statistical analysis.

rawCounts <- rawCounts[, unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
Renaming tissue types to more informative names makes the data easier to interpret and understand.
Providing clearer labels for the different sample groups will particularly be useful for visualizations and result interpretation.
rename_tissues <- function(x) {
  x <- switch(as.character(x), "normal" = "normal-looking surrounding colonic epithelium", "primary tumor" = "primary colorectal cancer", "colorectal cancer metastatic in the liver" = "metastatic colorectal cancer to the liver")
  return(x)
}

sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))

Ordering the tissue types and modifying factor levels ensures that the analysis treats the samples with the intended biological order. This step is crucial for performing meaningful differential expression analysis between specific tissue types
Note: The gsub() function is used to modify the factor levels by replacing any non-alphanumeric characters with underscores. This ensures compliance with safe naming conventions for factor levels and avoids potential issues in subsequent analyses.

sampleData$tissueType <- factor(sampleData$tissueType, levels = c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))

levels(sampleData$individualID) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$individualID))
levels(sampleData$tissueType) <- gsub("[^A-Za-z0-9_.]", "_", levels(sampleData$tissueType))

Creating a DESeq2DataSet object (deseq2Data) is necessary to organize the count data and sample metadata together. DESeq2 requires data to be in this specific object format for differential expression analysis.

deseq2Data <- DESeqDataSetFromMatrix(countData = rawCounts, colData = sampleData, design = ~ individualID + tissueType)

Estimating size factors is a critical step in the normalization process for RNA-seq data. It helps to ensure that the count data is appropriately scaled, making it suitable for meaningful comparisons between samples. Size factors are used to normalize the count data, adjusting for differences in sequencing depth and read coverage, so that genes can be compared more accurately across samples.

dds_wt <- estimateSizeFactors(deseq2Data)

The Variance Stabilizing Transformation (VST) is a powerful statistical method used to stabilize the variance across the range of count values in RNA-seq data. It is based on the negative binomial distribution, which is often used to model count data.

vsd_wt <- vst(dds_wt, blind = TRUE)

The Correlation Matrix for Hierarchical Clustering helps to visualize the relationships and similarities between the samples. Hierarchical clustering with correlation heatmaps allows researchers to identify potential clusters or patterns in the gene expression data, which can be valuable for understanding the underlying biological relationships and differences between the samples.

vsd_mat_wt <- assay(vsd_wt)
vsd_cor_wt <- cor(vsd_mat_wt)

These steps collectively prepare the data for robust differential expression analysis. They ensure that the data is correctly organized, normalized, and transformed to enable reliable detection of differentially expressed genes and meaningful insights into the biological differences between the sample groups under study.

Console Snip

> # Put the columns of the count data in the same order as row names of the sample mapping, then make sure it worked
> rawCounts <- rawCounts[, unique(rownames(sampleData))]
> all(colnames(rawCounts) == rownames(sampleData))
[1] TRUE

Some Visualization

The x-axis tick values will be displayed in a non-scientific notation format, making the plot more readable and user-friendly. The y-axis represents the number of genes falling into each bin of the histogram.

# Add the ggplot code snippet with modified x-axis formatting
ggplot(data.frame(wt_normal1 = rawCounts[, 1])) +
  geom_histogram(aes(x = wt_normal1), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes") +
  scale_x_continuous(labels = function(x) format(x, scientific = FALSE))

Plot the heatmap using pheatmap

# Prepare data for pheatmap
data_for_heatmap <- as.matrix(vsd_cor_wt)

# Convert tissueType to a character vector
annotation_row <- as.character(sampleData$tissueType)

# Add spaces between words in the x-axis labels
annotation_row_with_spaces <- paste(" ", annotation_row, " ")

# Plot the heatmap using pheatmap with manual row annotations
pheatmap(data_for_heatmap,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         color = colorRampPalette(c("blue", "white", "red"))(50),
         show_rownames = FALSE,
         show_colnames = TRUE,
         row_names_side = "left",
         annotation_colors = "black",
         annotation_names_row = FALSE,
         labels_row = annotation_row_with_spaces,
         fontsize_row = 8,     # Adjust the font size of row labels
         fontsize_col = 12,    # Adjust the font size of column labels
         angle_col = 45)       # Set the angle of column labels to 45 degrees

Finally, Perform PCA and provide PCA scores, which can be used for data exploration, visualization, and understanding the underlying structure and relationships within the dataset.

# Calculate PCA scores
sample_scores <- as.data.frame(assay(vsd_wt))
sample_scores$Sample <- rownames(sample_scores)
column_names <- colnames(vsd_wt)

colnames(sample_scores)[2:5] <- c("normal", "fibrosis", "tumor", "metastasis")
sample_scores$PC1 <- sample_scores$normal * -2 + sample_scores$fibrosis * -10 + sample_scores$tumor * 8 + sample_scores$metastasis * 1
sample_scores$PC2 <- sample_scores$normal * 0.5 + sample_scores$fibrosis * 1 + sample_scores$tumor * -5 + sample_scores$metastasis * 6

# Print the PCA scores
print(sample_scores)

Console Snip

> # Print the PCA scores
> print(sample_scores)
                SRR975551    normal  fibrosis     tumor metastasis SRR975556
ENSG00000000003 11.923929  9.968500 10.588833 11.791039  10.058138 10.737563
ENSG00000000005  5.625681  2.799376  4.866074  5.031436   4.065530  3.717037
ENSG00000000419 10.686895  9.059366  9.056915 10.226302  10.243933  8.802204
ENSG00000000457  8.199791  8.879902  8.706264  8.013961   8.158745  8.654508
ENSG00000000460  8.141111  8.024470  6.392902  7.942364   7.214920  5.820591
ENSG00000000938  5.725709  6.301239  7.421013  6.916156   6.736584  6.794418
ENSG00000000971  8.906142  8.952251 10.967832  8.930424   8.158745 10.365427
ENSG00000001036 10.656848 10.452489 10.439085 11.081124  10.571350 10.396267
ENSG00000001084 10.073191 11.118101  9.535102 10.560985   9.788243  9.247774
ENSG00000001167  9.748116 10.326186  9.100712 10.235065   9.469240  9.013992
ENSG00000001460  6.899011  7.258717  8.048669  6.852399   7.791251  7.558024
ENSG00000001461  9.527680  9.811171  9.584137  9.060077  10.353852  9.868912
ENSG00000001497 10.308639  9.804080  9.171732 10.270811   9.505511  9.161493
ENSG00000001561  8.509119  9.179311  9.338776  8.707479   9.155344  8.998760
ENSG00000001617  7.001645  9.451539  8.135472  8.244756   9.540893  8.610090
ENSG00000001626 10.998738 10.733206 11.781040 10.338590  12.181192 12.134195
ENSG00000001629 10.228487 10.303738  9.847778  9.725462  10.016518  9.868912

                     Sample        PC1       PC2
ENSG00000000003 ENSG00000000003 -21.438886 16.966720
ENSG00000000005 ENSG00000000005  -9.942478  5.501761
ENSG00000000419 ENSG00000000419 -16.633528 23.918686
ENSG00000000457 ENSG00000000457 -32.552004 22.028879
ENSG00000000460 ENSG00000000460  -9.224133 13.982838
ENSG00000000938 ENSG00000000938 -24.746778 16.410357
ENSG00000000971 ENSG00000000971 -47.980679 19.744307
ENSG00000001036 ENSG00000001036 -26.075487 23.687810
ENSG00000001084 ENSG00000001084 -23.311101 21.018685
ENSG00000001167 ENSG00000001167 -20.309732 19.903922
ENSG00000001460 ENSG00000001460 -32.393675 24.163534
ENSG00000001461 ENSG00000001461 -32.629246 31.312451
ENSG00000001497 ENSG00000001497 -19.653479 19.752784
ENSG00000001561 ENSG00000001561 -32.931205 25.323103
ENSG00000001617 ENSG00000001617 -24.758855 28.882818
ENSG00000001626 ENSG00000001626 -44.386905 38.541846
ENSG00000001629 ENSG00000001629 -31.265045 26.471447
 [ reached 'max' / getOption("max.print") -- omitted 65200 rows ]

In conclusion, this article demonstrates the use of DESeq2 and R packages like ggplot2 and pheatmap to analyze gene expression data. It covers data preprocessing, unsupervised clustering, heatmap visualization, and principal component analysis (PCA). By providing detailed code snippets and explanations, it enables readers to perform similar analyses and gain valuable insights into biological processes.

DEV Community

Analyzing RNA-seq data with DESeq2 (Part 2)

Starting with

Console Snip

Continue Data Preprocessing and Preparation

Watch Difference in Console Snip

Here is one of the things I had learned

Console Snip

Some Visualization

Plot the heatmap using pheatmap

Console Snip

Top comments (0)

Read next

Terminalizer: Record, Edit, and Share Terminal Sessions

Enjoying free CoPilot? Why not do it safely..

Building SyncBridge: When "Copy Here, Paste There" Gets an Upgrade 🚀

Explore use of agents to comment you code