DEV Community

MennahTullah Mabrouk
MennahTullah Mabrouk

Posted on

Intro: Analyzing RNA-seq data with DESeq2

“ Other Bioconductor packages with similar aims are edgeR, limma, DSS, EBSeq, and baySeq. “


Helps in identifying differentially expressed genes (DEGs) between experimental conditions. The package utilizes a negative binomial distribution model to account for the inherent variability in RNA-seq count data. DESeq2 uses a specific object class called DESeqDataSet. This class extends another class called RangedSummarizedExperiment, which allows association of the count data with genomic ranges.


Enter fullscreen mode Exit fullscreen mode


Enter fullscreen mode Exit fullscreen mode

DESeq2 expects un-normalized count data as input. The values in the count matrix represent the number of reads or fragments that can be assigned to a specific gene in a particular sample.

RNA-Seq questions

  • What genes are differentially expressed between sample groups?
  • Are there any specific genes that are significantly upregulated or downregulated between the sample groups?
  • Can you identify the top differentially expressed genes based on fold change or statistical significance?
  • Are there any specific gene expression patterns that emerge over time or across different conditions?
  • Can you identify clusters or groups of genes that exhibit similar expression patterns over time or across conditions?
  • Which biological processes or molecular pathways are significantly enriched among the differentially expressed genes?
  • Can you provide a comprehensive summary or visualization of the gene expression data, highlighting the most relevant genes and pathways related to the condition of interest?

RNA Workflow

Image description

Biological Samples: The experiment begins with the collection and preparation of biological samples, such as tissues or cells, from the organism of interest. (Lab)
Library Preparation: RNA molecules are extracted from the samples and converted into cDNA libraries suitable for sequencing.

Image description

Sequence Read: The prepared libraries are subjected to high-throughput sequencing, where the DNA fragments are sequenced to generate short reads or longer reads. This step results in a vast amount of raw sequencing data.

Quality Control: The raw sequencing data undergoes quality control to assess the read quality. This step involves checking for sequence errors, adapter contamination, and other artifacts. Low-quality or problematic reads are filtered or trimmed to improve the accuracy of downstream analyses.

Splice-Aware Mapping to the Genome: The high-quality reads are aligned or mapped to a reference genome using splice-aware alignment algorithms.

Image description

Counting Reads Associated with the Genome: Once the reads are aligned, the next step is to count the number of reads associated with each genomic feature, such as genes or exons. This step quantifies the expression levels of genes in terms of read counts.

Statistical Analysis for Differential Expression: The read counts are then used for statistical analysis to identify differentially expressed genes between different conditions or sample groups.

To Start

Data preprocessing: Before using DESeq2, it is important to perform some preprocessing steps on the raw RNA-seq data. These steps may include quality control, read alignment, and read counting. It’s worth mentioning that DESeq2 expects raw count data as input, but often, additional preprocessing steps are needed to obtain the count matrix.

Normalization: DESeq2 performs its own normalization internally using the method called “size factors.” This step adjusts for differences in library size between samples. It might be beneficial to provide a brief explanation of the normalization process and its importance for accurate analysis.

Experimental design: DESeq2 requires information about the experimental design, such as the different experimental conditions and the replicates for each condition. This information is used to fit statistical models and identify differentially expressed genes accurately. Including a mention of the importance of experimental design and the need for careful planning would be beneficial.

Statistical analysis: DESeq2 employs statistical models to estimate dispersion and perform hypothesis testing to identify differentially expressed genes. It might be helpful to provide a brief overview of the underlying statistical concepts, such as the negative binomial distribution model and the process of hypothesis testing.

Interpretation of results: After running DESeq2, it is essential to interpret the results correctly. This includes understanding the meaning of different statistical metrics, such as log-fold change and adjusted p-values, and their significance in identifying significant gene expression changes.

To Be Continued

Dataset : RNA-seq of human multiple myeloma patients myeloid-derived suppressor cells (M-MDSC)

# Load the libraries

# Read in the raw read counts
rawCounts <- read.delim("F:\\E-MTAB-9767-raw-counts.tsv")

# Read in the sample mappings
sampleData <- read.delim("F:\\E-MTAB-9767-experiment-design.tsv")

# Also save a copy for later
sampleData_v2 <- sampleData

# Plot the histogram of raw expression counts
ggplot(rawCounts) +
  geom_histogram(aes(x = ERR4843201), stat = "bin", bins = 200) +
  xlab("Raw expression counts") +
  ylab("Number of genes")
Enter fullscreen mode Exit fullscreen mode

Image description

Multiple myeloma is a cancer that affects plasma cells, which are a type of white blood cell responsible for producing antibodies. Common symptoms include bone pain, fatigue, recurrent infections, anemia, kidney problems, and weakened bones leading to fractures.

Also to Read

What is Bioconductor in R ?

Top comments (0)