Introduction to Bioinformatics Concepts
1. Nucleic Acid and Protein Sequence, Structure, and Function
Nucleic acids (DNA and RNA) store and transmit genetic information. DNA has bases A, T, C, G,
while RNA has A, U, C, G. Their structure can be described at four levels: Primary (sequence),
Secondary (helices, loops), Tertiary (3D folding), and Quaternary (multi-molecule complexes).
Proteins are chains of amino acids (20 types). Their function depends on structure: Primary
(sequence), Secondary (α-helices, β-sheets), Tertiary (3D folding), and Quaternary (complexes of
multiple chains). The central dogma links DNA → RNA → Protein, where sequence defines
structure, and structure defines function.
2. Introduction of Drug Molecules
Drug molecules are compounds that interact with proteins, DNA, or RNA to alter biological
processes. They include small molecules (e.g., aspirin, antibiotics) and biologics (e.g., antibodies,
peptides). Drugs work by binding to targets such as enzymes or receptors. For example, antibiotics
inhibit bacterial ribosomes, and anti-HIV drugs target viral enzymes like reverse transcriptase.
3. Scripting Language / Linux Commands for Big Biological Data
Large bioinformatics datasets (genomes, protein databases) require Linux tools for handling.
Common commands include:
1 ls, cd, pwd → Navigate files.
2 cat, less, head, tail → View FASTA/FASTQ sequences.
3 grep → Search motifs/patterns.
4 wc -l → Count number of lines/sequences.
5 cut, awk, sed → Extract sequence headers or IDs.
6 sort, uniq → Handle duplicates.
7 gunzip, tar → Decompress genome databases.
Example: `grep -c '^>' genome.fasta` counts sequences in a FASTA file.
4. Programming Using Python and R
Python in Bioinformatics:
1 Biopython – sequence parsing, BLAST, alignments.
2 Pandas – data manipulation.
3 Matplotlib/Seaborn – visualization.
Example (GC Content Calculation in Python):
from Bio.Seq import Seq seq = Seq("ATGCGTACGATCG") gc_content = 100 *
float(seq.count("G") + seq.count("C")) / len(seq) print("GC Content:", gc_content,
"%")
R in Bioinformatics:
1 Bioconductor – genomics and transcriptomics.
2 ggplot2 – advanced visualization.
3 dplyr – data wrangling.
Example (GC Content Histogram in R):
gc_content <- c(40, 42, 38, 50, 45) hist(gc_content, main="GC Content Distribution",
xlab="GC%", col="lightblue")
5. Integration of Linux, Python, and R
A complete bioinformatics pipeline often uses all three tools. Linux handles large raw data (FASTA,
FASTQ), Python processes sequences (translation, motif search), and R performs statistical
analysis and visualization (gene expression, RNA-seq).