[go: up one dir, main page]

0% found this document useful (0 votes)
10 views21 pages

M.sc Transcriptome Analysis 2025

The document outlines the process of transcriptome analysis, detailing the importance of understanding RNA molecules in cells for interpreting genomic functions and disease mechanisms. It describes the steps involved in transcriptome sequencing, including RNA isolation, cDNA library preparation, quality checks, and bioinformatic analyses using tools like HISAT2 and DESeq2. Additionally, it covers the concepts of transcript assembly, quantification, and differential expression analysis, emphasizing the significance of normalization techniques for accurate comparisons across samples.

Uploaded by

sabbuj54
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

M.sc Transcriptome Analysis 2025

The document outlines the process of transcriptome analysis, detailing the importance of understanding RNA molecules in cells for interpreting genomic functions and disease mechanisms. It describes the steps involved in transcriptome sequencing, including RNA isolation, cDNA library preparation, quality checks, and bioinformatic analyses using tools like HISAT2 and DESeq2. Additionally, it covers the concepts of transcript assembly, quantification, and differential expression analysis, emphasizing the significance of normalization techniques for accurate comparisons across samples.

Uploaded by

sabbuj54
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Transcriptome

Analysis
Background and Pipeline
(HISAT2 and DESeq2)
M.Sc. Bioinformatic Practical 29.04.2025
Introduction
• Transcriptome is the complete set of all RNA molecules in a cell
produced under specific conditions for a specific developmental stage.

• Understanding the transcriptome is essential for-


• interpreting the functional elements of genome

• revealing molecular constituents of cells and tissues


• understanding development and diseases

• Unlike genome, which is roughly fixed, transcriptome may vary depending


upon the external conditions.
Transcriptomics
• Transcriptomics- study of RNA in any of its forms.

• also referred as expression profiling, that most of the examines expression levels of
mRNA, which further gives us an idea of genes that are being expressed in cell at a
particular condition/ stage.

• Transcriptomics aims to
• catalogue all the species of transcripts inclusing mRNA, ncRNA and small RNA.

• determine transcriptional structure of genes in terms of their start and end sites,
splicing patterns etc.
• quantify the changing expression levels of each transcript under various conditions.
Basic steps of transcriptome sequencing and
data analyses
RNA isolation
cDNA library preparation
Sampling

Sequencing

Bioinformatic analyses

Created with BioRender.com


Pipeline used for bioinformatic analysis
cDNA Library preparation
Terminologies used
• Single end- only one read is generated from one end of the fragment

• Paired-End- two reads are generated from both ends,

• N50- is the length of the shortest transcript such that 50% of the total assembled
transcriptome is contained in contigs of that length or longer.

• Phred Quality Score- A Phred score indicates the confidence of a base call in sequencing
data.

Q = -10 × log₁₀(P), Where P = probability that the base call is wrong.

Phred Score Accuracy Error Rate


20 99% 1 in 100
30 99.9% 1 in 1000
40 99.99% 1 in 10,000
1. Quality checks
• Quality check of reads is performed prior to any step of analysis

• FASTQ files- FASTA sequence with quality information.

• Poor quality at the ends needs to be trimmed.

• Left-over adapter sequences need to be removed.

• This gives best quality of reads for data analysis.

• Tools used for quality restoration of reads- FastP, Trimmomatic.


1. QC: Read Filtering and Adapter Removal with fastp

Option Purpose
-i, -I Input paired-end reads
-o, -O Output filtered paired-end reads
Auto-detect adapter sequences (very important for
--detect_adapter_for_pe
paired-end!)
Correct mismatches between overlapping paired-
--correction
end reads
Only keep bases with Phred quality ≥30 (very strict,
--qualified_quality_phred 30
good for downstream)
-w 16 Use 16 CPU threads (speeds up fastp)
--html, --json Create quality control reports
Transcriptome Assembly
• there are two types of assemblies
• Reference-based assembly- Reads are aligned (mapped) to an existing
known reference genome or transcriptome.

• tools: HISAT2, STAR, TopHat


• de novo assembly- Reads are assembled without any reference — from
scratch — based only on overlaps.

• tools: Trinity, SOAPdenovo-Trans


• commonly used: Trinity for de novo assemblies. It uses De Bruijn graph
construction for assembling the reads.
2. Reference Genome Indexing with HISAT2

•hisat2-build creates a set of index files that HISAT2 needs for fast mapping.
•Saff_A2_final_assembly.fa is your reference genome file.
•saff is the base name for the output index files.

3. Alignment of Reads to the Reference Genome using HISAT2


Option Purpose
Optimizes output for transcriptome
--dta
assemblers like StringTie
-p 40 Use 40 CPU threads
Prefix path to genome index files created by
-x
hisat2-build
-1, -2 Input paired-end reads
Save a summary report (alignment rates,
--summary-file
error rates)

Then
•samtools view converts the output SAM to BAM (compressed binary format).
•samtools sort sorts BAM files by genomic coordinates (needed for StringTie and DESeq2).
4. Transcript Assembly and Quantification using StringTie
Option Purpose
(OPTIONAL) Provide the reference genome FASTA
--ref
(StringTie can optionally use it)
Mandatory: Use known gene annotation (GTF) for
-G
better transcript prediction
-o Output predicted transcript GTF file
Output gene abundances into a simple tab-
-A
separated table
Create output suitable for Ballgown differential
-B analysis (also useful if you want alternative DE
tools)
Strict mode: only quantify reference transcripts
-e
(faster, no novel transcript assembly)

Merge multiple GTF files (from different samples)


--merge
into a non-redundant reference annotation
Use the original annotation GTF as a guide during
-G final.gtf
merging (helps to maintain consistency)
A text file listing paths to all sample GTF files (one
m_file.txt
per line)
-o merged.gtf Output the merged master GTF
Quantification of reads (or read count)
• Quantification = Counting how many reads map to each gene or transcript.

• Tools: salmon, stringtie, kallisto

• Normalization = Adjusting those counts to remove biases. It is important because-

➢ Different sequencing depths


➢ Gene length differences

➢ Library composition bias

• Tools: kallisto, edgeR, DESeq2


Normalized values
• FPKM (Fragments Per Kilobase per Million)
• Used for paired-end reads.

• RPKM (Reads Per Kilobase per Million)


• Used for single-end reads.

• FPKM/RPKM: Normalize each gene individually first. The total expression can vary between samples.
Used when comparing genes within the same sample.

• TPM (Transcripts Per Million)


• TPM improves on FPKM by reordering the normalization steps
• First normalize read counts by gene length. Then normalize across the sample to sum to 1 million.
• TPM values are comparable across samples.
• Normalizes all genes together, making the total expression the same across samples → enables direct
comparison.
• Used when comparing genes across different samples.
5. Preparing for DESeq2 (Differential Expression Analysis)

Option Purpose
Input file listing sample names and their GTF files (same as
-i merged_file.txt
used before)
Read length used for sequencing (important for accurate
-l 151
count estimation); here, it's 151 bp reads
-g gene_count_matrix.csv Output gene-level count matrix (for DESeq2, EdgeR, etc.)
Output transcript-level count matrix (useful for more
-t transcript_count_matrix.csv
detailed isoform analyses)
-v Verbose output (prints log messages, useful for debugging)
6. DESeq2 (Differential Expression Analysis)
STEP 1:
•DESeqDataSetFromMatrix():
•countData: gene count data matrix (gene_info) that
contains the counts for each gene (rows) and sample
(columns).
•colData: A data frame (sample_info) containing
metadata for the samples, including the condition
variable.
•design: Specifies the model design (~ condition),
where condition is the factor that will be used to
compare differential expression (e.g., two
experimental groups).
•DESeq(): Runs the DESeq analysis to estimate size
factors, dispersion, and perform the differential
analysis.
•saveRDS(dds, file = "dds.rds"): Saves the
DESeqDataSet object to an RDS file for future
reference
STEP 2:
•Get unique conditions: conditions <-
unique(sample_info$condition) extracts the distinct experimental
conditions from your sample_info data frame.
•Creating output directory: The line
dir.create("DESeq2_pairwise_results", showWarnings = FALSE)
creates a directory to store the results. If the directory already exists,
it suppresses warnings.
•Loop over condition pairs: The nested for loops iterate through all
pairs of conditions (conditionA vs. conditionB), ensuring that each
pair is only compared once. The cat() function prints which
comparison is being processed.
•Run DESeq2 analysis: results(dds, contrast = c("condition",
conditionB, conditionA)) computes the differential expression
between conditionB and conditionA. The contrast argument specifies
which levels of the condition factor you are comparing.
•Order results: res[order(res$padj), ] sorts the results by the
adjusted p-value (padj), which is important for filtering significant
results.
•Save results: write.csv() writes the ordered results to a CSV file in
the DESeq2_pairwise_results directory. The filename dynamically
includes the comparison (e.g.,
DESeq2_conditionB_vs_conditionA.csv).
Output

PCA plot

Heatmap
Presented by Manvi Sharma

Thank
you very
much!

You might also like