M.sc Transcriptome Analysis 2025
M.sc Transcriptome Analysis 2025
Analysis
Background and Pipeline
(HISAT2 and DESeq2)
M.Sc. Bioinformatic Practical 29.04.2025
Introduction
• Transcriptome is the complete set of all RNA molecules in a cell
produced under specific conditions for a specific developmental stage.
• also referred as expression profiling, that most of the examines expression levels of
mRNA, which further gives us an idea of genes that are being expressed in cell at a
particular condition/ stage.
• Transcriptomics aims to
• catalogue all the species of transcripts inclusing mRNA, ncRNA and small RNA.
• determine transcriptional structure of genes in terms of their start and end sites,
splicing patterns etc.
• quantify the changing expression levels of each transcript under various conditions.
Basic steps of transcriptome sequencing and
data analyses
RNA isolation
cDNA library preparation
Sampling
Sequencing
Bioinformatic analyses
• N50- is the length of the shortest transcript such that 50% of the total assembled
transcriptome is contained in contigs of that length or longer.
• Phred Quality Score- A Phred score indicates the confidence of a base call in sequencing
data.
Option Purpose
-i, -I Input paired-end reads
-o, -O Output filtered paired-end reads
Auto-detect adapter sequences (very important for
--detect_adapter_for_pe
paired-end!)
Correct mismatches between overlapping paired-
--correction
end reads
Only keep bases with Phred quality ≥30 (very strict,
--qualified_quality_phred 30
good for downstream)
-w 16 Use 16 CPU threads (speeds up fastp)
--html, --json Create quality control reports
Transcriptome Assembly
• there are two types of assemblies
• Reference-based assembly- Reads are aligned (mapped) to an existing
known reference genome or transcriptome.
•hisat2-build creates a set of index files that HISAT2 needs for fast mapping.
•Saff_A2_final_assembly.fa is your reference genome file.
•saff is the base name for the output index files.
Then
•samtools view converts the output SAM to BAM (compressed binary format).
•samtools sort sorts BAM files by genomic coordinates (needed for StringTie and DESeq2).
4. Transcript Assembly and Quantification using StringTie
Option Purpose
(OPTIONAL) Provide the reference genome FASTA
--ref
(StringTie can optionally use it)
Mandatory: Use known gene annotation (GTF) for
-G
better transcript prediction
-o Output predicted transcript GTF file
Output gene abundances into a simple tab-
-A
separated table
Create output suitable for Ballgown differential
-B analysis (also useful if you want alternative DE
tools)
Strict mode: only quantify reference transcripts
-e
(faster, no novel transcript assembly)
• FPKM/RPKM: Normalize each gene individually first. The total expression can vary between samples.
Used when comparing genes within the same sample.
Option Purpose
Input file listing sample names and their GTF files (same as
-i merged_file.txt
used before)
Read length used for sequencing (important for accurate
-l 151
count estimation); here, it's 151 bp reads
-g gene_count_matrix.csv Output gene-level count matrix (for DESeq2, EdgeR, etc.)
Output transcript-level count matrix (useful for more
-t transcript_count_matrix.csv
detailed isoform analyses)
-v Verbose output (prints log messages, useful for debugging)
6. DESeq2 (Differential Expression Analysis)
STEP 1:
•DESeqDataSetFromMatrix():
•countData: gene count data matrix (gene_info) that
contains the counts for each gene (rows) and sample
(columns).
•colData: A data frame (sample_info) containing
metadata for the samples, including the condition
variable.
•design: Specifies the model design (~ condition),
where condition is the factor that will be used to
compare differential expression (e.g., two
experimental groups).
•DESeq(): Runs the DESeq analysis to estimate size
factors, dispersion, and perform the differential
analysis.
•saveRDS(dds, file = "dds.rds"): Saves the
DESeqDataSet object to an RDS file for future
reference
STEP 2:
•Get unique conditions: conditions <-
unique(sample_info$condition) extracts the distinct experimental
conditions from your sample_info data frame.
•Creating output directory: The line
dir.create("DESeq2_pairwise_results", showWarnings = FALSE)
creates a directory to store the results. If the directory already exists,
it suppresses warnings.
•Loop over condition pairs: The nested for loops iterate through all
pairs of conditions (conditionA vs. conditionB), ensuring that each
pair is only compared once. The cat() function prints which
comparison is being processed.
•Run DESeq2 analysis: results(dds, contrast = c("condition",
conditionB, conditionA)) computes the differential expression
between conditionB and conditionA. The contrast argument specifies
which levels of the condition factor you are comparing.
•Order results: res[order(res$padj), ] sorts the results by the
adjusted p-value (padj), which is important for filtering significant
results.
•Save results: write.csv() writes the ordered results to a CSV file in
the DESeq2_pairwise_results directory. The filename dynamically
includes the comparison (e.g.,
DESeq2_conditionB_vs_conditionA.csv).
Output
PCA plot
Heatmap
Presented by Manvi Sharma
Thank
you very
much!