Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
I. Administrivia
Introduction to the course and its goals
Course organization and content
Homework and Quiz
Term Project
Introductions
• Lecturer: Manolis Kellis
– MIT CSAIL, CompBio, Broad, Disease mechanism, Epigenomics,
Cancer, Brain, Gene Regulation, Evolution, Single-cell genomics
• Lecturer: Eric Alm
– MIT Biological Engineering, Gen AI, Computational, theoretical,
experimental understanding & engineering human microbiome
• TA: Jared Zheng
– MIT CSAIL, Zhang Lab, Chemistry, Biophysics, protein-ligand
interactions, drug discovery, deep generative models, PLMs
• TA: Sarah Gurev
– MIT EECS, Debbie Marks Lab Harvard, Stanford BS in CS,
protein design and evolution
• TA: Benjamin James
– MIT EECS, CSAIL, Computational Biology, Broad, Epigenomics,
Regulatory Circuitry, Single-Cell, Addiction, Neuroscience
Course Information
• Lectures
– TR 1pm – 2:30, Room 32-144
• Recitations/Mentoring/OfficeHours:
– On Friday at 3pm in 32-144
– Recitations at MIT
• Course Website
– https://canvas.mit.edu/courses/28242
– or simply: http://compbio.mit.edu/MLCB (redirects to canvas)
– All handouts, lectures, notes, etc will be posted here.
• Course calendar:
– On Google, add public calendar: “MLCB24 Lectures”
Goals for the term
• Introduction to computational biology
– Fundamental problems in computational biology
– Algorithmic/machine learning techniques for data analysis
– Research directions for active participation in the field
– Understanding how methods work
• Ability to tackle research
– Problem set questions: algorithmic rigorous thinking
– Programming assignments:
hands-on experience w/ real datasets
– Final project experience:
propose and carry out independent original research
present findings in conference format (written, oral)
Computation & Biology | Foundations & Frontiers
• Duality #1 (x-axis): Computation and Biology
– Important, relevant, current biology:
Important biological problems
– Fundamental computer science:
General techniques, principles
• Duality #2 (y-axis): Foundations and Frontiers
– Foundations:
– well-defined problems, general methodologies
– ‘The classics’ of the field
– Frontiers:
– in-depth look at complex, current problems, open questions
– combine techniques learned
– opens to projects, research directions
Course at a Glance
Fall 2020, 2019, 2018: YouTube, and ease of use anywhere
YouTube Playlist: (Fall 2021)
Fall 2020: https://www.youtube.com/playlist?list=PLypiXJdtIca6dEYlNoZJwBaz__CdsaoKJ
Fall 2019: https://www.youtube.com/playlist?list=PLypiXJdtIca6U5uQOCHjP9Op3gpa177fK
Fall 2018: https://www.youtube.com/playlist?list=PLypiXJdtIca6GBQwDTo4bIEDV8F4RcAgt
Bookmarks
Closed captions
Chapters Playlists
Fall 2021, 2022: Panopto, and awesome search capabilities
Panopto (Fall 2021)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=7c716154-6516-4a49-9a81-adad0135dcb8
Panopto (Fall 2022)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=176f8b23-0433-403d-8c26-af090151a28d
Speaker video
Search function!
Shared screen Automatic
transcript
2X speed
Automatic chapters
(from slide headers)
Slide Navigation
Details on the in-class quiz
• It’s not a midterm, and it’s not a final exam
– It’s a quiz, friendly, fun, interesting, cute, fuzzy
• Demonstrate mastery of the material in 4 modules
– Understand key points emphasized in lecture
– Understand subtleties revealed in the psets
– Ability to apply new skills to solve practical problems
• Types of questions
– Knowledge questions: T/F justify, multiple choice
– Deeper understanding questions: short answers
– Practical problems: work through simple algorithm
– Design problem(s): new/modified algorithm, need
both knowledge and new idea, argue correctness
Final Project: Original Research in Comp Bio
• A major aspect of the course is preparing you for
original research in computational biology.
– Framing a biological problem computationally
– Gathering relevant literature and datasets
– Solving it using new algorithms, machine learning
– Interpreting the results biologically
• Also ability to present your ideas and research
– Crafting a research proposal (fellowships/grants)
– Working in teams of complementary skill sets
– Review peer proposals, find flaws, suggest imprvmts
– Receiving feedback and revising your proposal
– Writing up your results in a scientific paper format
– Presenting a research talk to a scientific audience
• Term project experience mirrors this process
Project Milestones
• Round 0: Self-introduction
(due Week 2 Friday)
• Round 1: Literature search and paper
description (due Week 4 Friday)
• Round 2: Team formation, project proposal,
feasibility (due Week 6 Friday)
• Round 3: Office Hours, Update, Feedback
(Meet Week 8 + Week 10 Fridays)
• Round 4: Midcourse report
(due Week 12, Friday)
• Round 5: Final report+slides
(due Week 14, Friday)
Course at a Glance
Details on the final project
• Milestones ensure sufficient planning / feedback
– Set-up: find project matching your skills and interests
– Team: common interests and complementary skills
– Inspiration: last year’s projects, and recent papers
– Proposal: establish milestones, deliverables, expectations
– Midcourse: see endpoint, outline report, methods, figures
• Periodic mentoring sessions
– Senior students and postdocs can serve as your mentors
– Group discussions to share ideas, guidance, feedback
– Peer-review: think critically about peer proposals, receive
feedback/suggestions, respond to critiques, adjust course
• Real-world experience, condensed in a single term
– Grant/fellowships proposals, peer review, yearly reports,
budget time/effort, collaboration, paper writing, give talk
Comm Lab: Help communicating your research!
A free resource for peer feedback from trained EECS
grad students and postdocs.
Why people come to CommLab:
“Very, very valuable. Thank you!”
RESUME / CV 63
—Elena Glassman, EECS PhD
GRADUATE SCHOOL APPL. 43 alumna
OTHER (INCL. STARTUP PLANS, RQE) 38
FACULTY PACKAGE
OTHER REPORT OR ESSAY
35
Total: "I strongly encourage students to
34
ORAL PRESENTATION 33
400 appointments schedule a session; it’s a very
FELLOWSHIP / SCHOLARSHIP APPL. 33
impressive resource.”
THESIS You can be anywhere
—Dirk Englund,
31 in the process:
MANUSCRIPT 29 • Brainstorming professor
POSTER / VISUAL 27 • Outlining
THESIS PROPOSAL 14
• Revising “The experience and coaching
• Final polishing
ABSTRACT8 helped me apply successfully for
GRANT7
LAB REPORT
an important fellowship this
5
year.”
0 20 40 60 80
—Joel Jean, EECS grad
Number of appointments
Finding a research mentor / research advisor
• Chance to meet faculty at MIT/Broad/Harvard:
– Through guest lectures and mentoring
– Topics and papers covered in the lectures
– Experts on: (1) human comparative genomics, (2)
lincRNAs, (3) metabolic modeling, (4) disease mapping,
selection, evolution and ecology (following four modules)
• Chance to meet senior students and postdocs:
– On: coding genes, ncRNAs, regulatory motifs, networks,
epigenomics, phylogenomics (again on each module)
– Mentorship sessions with entire MIT CompBio group
• Your own personal research experience:
– collaborators, datasets
– learn active research directions, frontiers
– living, breathing changing field
Putting it all together
Course Activities: Mens et Manus
• Learning (25 lectures * 1.5 hours)
• Mentoring (4-7 meetings * 1 hour) Project 40% Psets 30% Quiz 25%
• 3 problem sets: [30% of your grade]
– Out on Tuesdays, due Mondays in 2-3 weeks. 5%
– Each problem set covers 1 module, contains ~4 problems.
– Algorithmic problems and programming assignments
• Final project [40% of your grade]
– Introduction to research in computational biology (full term!)
– Includes peer-reviewed NIH-style proposal and much feedback
• Quiz [25% of your grade]
– In-class quiz. No final exam.
• Office hours/recitations/lectures participation: 5% grade
• Collaboration policy [humans and AI]
– Collaboration allowed, but you must:
• Work independently on each problem before discussing it
• Write solutions on your own
• Acknowledge sources and collaborators. No outsourcing.
– ChatGPT / LLM policy
• Acknowledge the way you would for a collaboration partner
• Be transparent, save your chats, possibly submit w/ homework
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Why Computational Biology ?
Why Computational Biology: Last year’s answers
• Lots of data (* lots of data)
• There are rules
• Pattern finding
• It’s all about data
• Ability to visualize
• Simulations, temporal relationships
• Guess + verify (generate hypotheses for testing)
• Propose mechanisms / theory to explain observations
• Networks / combinations of variables
• Efficiency (reduce experimental space to cover)
• Informatics infrastructure (ability to combine datasets)
• Correlations, higher-order relationships
• Cycle from hypothesis generation to testing condensed
• Life itself is digital. Understand cellular instruction set
Why Computational Biology: Live in Zoom Chat F20
• Data-rich in a historically data-poor domain (Matthew West)
• potential to do whatever you want without waiting for experiments (Stuti Khandwala)
• DNA is a massive dataset (Pablo X Villalobos)
• More efficient and in depth way to explore biology (Lilly K Edwards)
• There're tons of biological datasets waiting to be analyzed (Hieu Q Dinh)
• Because you can use other people’s datasets and then get good research done on a budget
(Ari)
• Might be the biggest frontier of computing today (Erez Kaminski)
• More and more sequencing data are coming out (Evelyn Tong)
• New technologies - lots of data - (Manu Ponnapati)
• Biology benefits from approximation (Thomas Xiong)
• The need to integrate multi-omics data to gain more insights (Kathleen Sucipto)
• Its interesting and new (Daniel R Gutierrez)
• Can use expertise from other engineering fields to impact health (Swathi Manda)
• Complex patterns in biological data (Farhan Khodaee)
• impact real human lives, important applications (Lucy Zhang)
• answers questions not easily solvable by traditional experimental biology (Andrew D Hennes)
• Expands our horizons in asking biological questions (Dylan McCormick)
• Computational biology and simulations can help deconvolve results from experiments (Raina
Thomas)
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
Genes
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
Regulatory motifs
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
Encode
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
Control
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
proteins
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
gene expression
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
Extracting signal from noise
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
The components of genomes and gene regulation
Goal: A systems-level understanding of genomes and gene regulation:
• The genome: Map reads, align genes/genomes, assembly strategies
• The genes: Protein-coding exons, introns, non-coding RNA, RNA folding
• The control regions: Promoters, enhancers, insulators, chromatin states
• The actual words: Regulatory motifs, high-resolution accessibility maps
• The regulators: Transcription factors, chromatin modifiers, nucleosomes
• The dynamics: Changing maps between cell types, across development
• The networks: regulatorenhancertarget, ChIP-seq, correlated activity
• The grammars: TF/motif/mark combinations, predictive models
• Human variation: Human diversity, population genomics, linkage maps
• Evolution: Phylogenetics, phylogenomics, coalescent, human ancestry
• GWAS/QTLs: Genome variation organismal/molecular phenotypes
• Disease: Personal (epi)genomics, pharmacogenomics, synthetic biology
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Deep Data and the Next Wave of Medicines
God / temple physician
Committee Priestes
(peer review) s
(nurse)
Patient
Enkoimesis, Epidauros, 4th BC. Arch. Museum of Piraeus
Hippocrates, Alkmaion, Asclepius, Humorism, Aristotle Renaissance-1900: Anatomy, Microscopy, Versalius, Leonardo, Cajal
Alois Alzheimer, 1911: AD Plaques+Tangles Human genome, genetic studies, GWAS 2007 Single-cell: 430 donors, 2M cells
Three major paradigm shifts: Data, Genomes, AI
Hypothesis-driven research: Data-driven research:
Formulate hypothesis gather data Gather data Ask questions later
Lots of thinking before target study Systematic datasets, build resources,
Problem: Highly biased, little novelty massive data sharing, comprehensive
Correlation-based analysis: Causality-based analysis:
More Coffee Better Health Genetic variants Disease outcome
More Chocolate More Nobel Prizes Polygenic risk score Causal factors
‘Epidemiology’ all about correlations Perturbation experiments Confirm
Classical Data Analysis: Generative AI+Deep Learning
New methodology for each problem Foundation models, Multi-Modality
Human scientist does all the ‘thinking’ Representation learning, hierarchical
Few parameters, targeted models Truly ‘understand’ concepts insights
Dissect mechanisms of disease-associated regions
Roadmap
Nature 15
2. Profile RNA + Epigenome Boix EpiMap
1. Disease genetics reveals Nature 21
in healthy + disease samples
common + rare variants/regions
5. Disseminate results
Cell cultures Mouse models
Claussnitzer
NEJM’15
4. Validate predictions in 3. Integrate data to predict driver
Blanchard,
Nature, 2022 human cells + mouse models genes, regions, cell types Park NBT 15
Non-coding circuitry helps interpret disease loci
Region of association
• Expand each GWAS locus using SNP linkage disequilibrium (LD)
– Recognize relevant cell types: tissue-specific enhancer enrichment
– Recognize driver TFs: enriched motifs in multiple GWAS loci
– Recognize target genes: linked to causal enhancers Quon bioRxiv 467852
FTO & Obesity: Uncover & manipulate circuitry reverse disease phenotypes
BMI association (-log10P)
Lean
SNP genomic position (23 chrs) Obese
Speliotes NG 2010
Incr. ARID5B Lean C-to-T Lean Decrease IRX3, IRX5 Lean
Decr ARID5BObese T-to-C Obese Increase IRX3, IRX5 Obese
CRISPR-edit human fat cells IRX3 KD Burn calories in their sleep
Claussnitzer, NEJM 2015 able to burn calories again 54% weight loss. Can’t gain weight
ApoE4 & Alzheimer’s: Cholesterol transport Oligo ER accumulation Myelin Cognition
scRNA of ApoE33, ApoE34, ApoE44 individuals Cholesterol transport & biosynthesis in oligos Cholesterol accumulates in ER, Myelination decrease
Restoring cholesterol transport (Cyclodextrine)
restores myelination & restores cognition
Blanchard,
Causality: Lack of myelination recapitulated in ApoE4 iPSC-derived oligodendrocytes Nature, 2022
With: Joel Blanchard, Leyla Akay, Jose Davila-Velderrain, Djuna von Maydel, Li-Huei Tsai
Reverse cancer w/ immunotherapy: scRNA + epig + TFs personalized combination treatment
# first sample 2011-12-19, last sample 2020-09-02
# mean age 62
# 28 F and 56 M
# 61 PRE, 57 ON, 23 POST, 7 PRO, 3 NA
# 93 ICI (51 PD1, 30 PD1+CTLA4, 9 CTLA4, 3 PDL1)(43 no prior
treatment), 20 targeted, 13 targeted+ICI, 7 other, 7 other+ICI, 2 NA
# 69 responders (R), 74 progressive disease (PD), 8 NA
# 143 mets, 5 normal, 3 melanoma primary
bioRxiv 506051 ’22
Jackie Yang,
David Liu (Dana Farber),
Kunal Rai (MD Anderson),
Genevieve Boland (MGH)
What is GenAI and how can it help cure disease?
Manolis Kellis
GenAI Key idea: Representation learning
‘Modern’ Deep learning: ‘Classical’ Fully-connected
Hierarchical Representation Learning Neural Networks
Feature extraction Classification
In deep learning, the two tasks are coupled:
• the classification task “drives” the feature extraction
• Extremely powerful and general paradigm
Be creative! The field is still at its infancy!
New application domains (e.g. beyond images) can have
structure that current architectures do not capture/exploit
Genomics/biology/neuroscience can help
drive development of new architectures
Deep learning many layers of abstraction
Convolutional
Neural Networks
Learn complex
scenes/objects
from simpler Facial structure
parts
Bottom-up
building of world
representations
Convolution Eyes, ears,nose
operation:
scanning for
features in a field
Goodfellow 2016 Edges, dark spots
Deep Convolutional Neural Networks for Genomics
Predict probabilities using logistic neuron
Max pool thresholded scores over windows
Threshold scores using ReLU
Scan sequence using filters
Convolutional filters
learn motifs (PSSM)
Deep Learning Architectures: Graph Neural Networks GNNs
Graph Convolutional Networks
Idea: Node’s neighborhood defines a Basic approach: Average information from
computation graph neighbors and apply a neural network
(1) average messages
from neighbors
𝑖𝑖
Determine node Propagate and (2) apply neural network
computation graph transform information
Learn how to propagate information
across the graph to compute node
features [Kipf and Welling, ICLR 2017]
43
NLP, words, sentences: Distributional Semantics
• Terms that appear in the same context of other words are (probably) semantically
related
• Every term is mapped to a high-dimensional vector (the embedding space)
• Ever more sophisticated versions of embeddings, equivalent to matrix factorization
• Word2Vec word2vec
• GloVe
• Elmo
• Bert
• GPT
Embedding space calculations:
Plausibility of semantic claims
t-DistributedStochasticNeighbor Embedding of high-dim space
Mapping words to a conceptual embedding space: Word2Vec
• Words with similar contexts should
map to similar coordinates in the
embedding space
• To achieve this, use prediction
context:
encoding [embedding,
representation learning], decoding
[actual prediction]
• Train weights through densely-
connected network [dense] and
through embeddings [emb] with
backpropagation
• Initial embeddings are scattered,
but after training, characters group
together [and words similarly]
• Use multiple consecutive
characters to increase context
information Prediction improves
• From characters to words: need
larger context, more layers, higher-
dimensional representation
From Words Sentences Docs: Attention, Transformer, Re-shaping
• Attention is all you need. NeurIPS 2017
• 125k citations, as of March 1, 2023
(Watson & Crick’s 1953 Nature: 17k)
Key Idea: How important is this word,
with respect to ALL other words?
Encoder: reads the entire
sequence all at once.
Decoder: reads left
to right (but
parallelized)
Positional encodings
Multi-modal generative AI: Image Text Translation
Paint a classroom of students
listening to a lecture on multi
modality with astronauts and
knights and princesses where the
lecturer is a giant bear
The image depicts a bright classroom scene. There are multiple rows of wooden
desks, each accommodating two students, and the room is filled with children who
appear to be in elementary school. The students are wearing casual clothing, with a
variety of patterns including stripes and plaids. The majority of the children have
their hands raised, signaling eagerness to participate or answer a question. In the
background, there is a teacher standing next to a whiteboard, which is partially
obscured in the image. The whiteboard appears to be blank. The room has large
windows that allow plenty of natural light to fill the space, and there are white walls
and a green chalkboard behind the teacher. The desks have open fronts where books
and notebooks can be stored, and there are papers and books on the desks. The
children's attention is focused on the teacher, indicating an interactive and engaging
class environment
Cross-modal “Visual-Semantic Embeddings”
WSABI (Weston et al 2010), DeVise (Frome et al 2013),
Cross-Modal Transfer (Socher et al 2013)
Frome et al. 2013
Socher et al. 2013
Cross-Modal Transfer (Socher 2013)
• Zero-Shot Learning Through Cross-Modal Transfer
WSABIE: Scaling Up To Large Vocabulary Image Annotation • Object/concept recognition in one modality (e.g. image) even
• Improve image annotation/tagging, scale to large annot vocab when description only in other modality (e.g., text).
• Joint embedding space images + annotation words • NNs to understand relationship between modalities.
• Ranking loss function: learn representations tune to ranking • Zero-Shot Learning: categories not seen during training
annotations of a given image • Semantic Mapping of visual textual features in common space
where they can be compared and associated
Latent AI Embedding Space
Cartography and Navigation
Accelerate Discovery Process Itself: AI for the Future of Work
Cancer Research & Biomarkers
Cancer Genetics
& Epigenetics
Gene Expression
& Genome Stability
Neurobiological Genomics &
Factors in Neurological Gene Regulation
Disorders
Genetics & Gene Expression
Healthspan & Regulation
Genomics
Example 1: Embedding 155,011 papers citing our work
• Idea Navigator: Explicit Interactive Embedding Space Exploration
• Multi-Resolution: Team, Institution, Sub-Field, Humanity, Custom
• Multi-Scale: Hours, Day, Week, Month, Year, Humanity, Custom
• Multi-Modal: Video,Audio,Notes,Github,Gdocs,Dropbox,Email,Slack
• Density, relatedness, temporality, correlation, bridges, dynamics
• Navigate, explore, transparency, guidance, uncharted territories
• Now: Team collaboration, match ideasprojectspeopledata
• Applications: self-reflection, coordination, planning, evaluation, growth
• Training: students, new team members, re-allocation, re-training
• Ultimate Goal: enrich ways we think, create, collaborate, plan, reflect
• Interactivity: summarize, extrapolate, innovate, follow up, connect
Spruce Campbell
Will Hathaway
Dakota Goldberg
Evan Liu
Brian Zheng Example 2: Embedding 10,428 meeting auto-transcript
Pathological map of 100s of Patients
Every person is a point, based on their cellular expression patterns Use ‘cartography’ from gene expression to map phenotype
Reveal impact of gene expression, phenotype, genotype Common foundational map reason about health impact
Multi-modal Embeddings of 2.4 million Human Cells
Every dot is a 20,000-dimensional vector
Integrate 2.4 million ‘documents’
Impact:
• Understand gene relationships
• Understand impact of phenotype
• Understand impact of age, sex
• Understand pathway correlations
• Understand gene co-variation
• Map phenotype to cell space
Functional knowledge graph of 20,000 human genes
Knowledge graph integrates: Reveal dimensions of variation
• Diseases, phenotypes, drugs, anatomy, exposures • Function of every protein, in the context of all knowledge
• Biological process, molecular function, localization • Map protein structure, chemical function, gene expression
• Biological Pathways and Biological Functions • Foundational ‘Google Maps’ Layout for layering on knowledge
Joint Map of Protein Structure, Function, Text
Biomedical KnowledgeOntologiesProtein StructureDrugsPathwaysDisease
Geometric deep learning drug design: heart, cancer, Alz
Cardiovascular
Large language models Biological language models Single-cell AI models Geometric deep learning
Metastatic
melanoma
Reasoning, interactive, user-guided, AI-powered drug design Structure-to-Function for
proteins+chemistry
Brad Pentelute
Marinka Zitnik
Owen Queen
Yepeng Huang
Tianlong Chen
Tom Hartvigsen Alzheimer’s
Tom Cobley Self-supervised struct. foundation
models
Literature Description Map of 20k Human Proteins
Navigate 3 million
NY Times articles
57
Papers + Grants + Patents + Startups + Offices Dynamics over time: Knowledge evolution 100,000 loans, clustered by description
Flow of Knowledge, resources Disciplines emerging, maturing, changing Context-specific predictive algorithms
Education: MIT, EdX, AP, High-school, YouTube Google News, Podcasts, Websites, Wikipedia Collaboration, Productivity, Team Progress
Multi-modal learning, interdisciplinary links Ontology creation and labeling Link projects across team members
Match CVs, job descriptions, skill sets Auto-link creation, paragraph level Within-meeting live track of productivity
The power of Maps for Physical Space Navigation
Maps give us
Landscape
Landmarks
Anchor points
Street names
Highway names
Simplification
Abstraction
Summarization
Decision making
The road ahead: Systematic understanding in biology + work
• AI as a discovery partner: multi-modal foundation • AI as “Google Maps” for navigating knowledge space
models • Visual search + integration through millions of documents
• Gain insights previous inaccessible to human scientists • Hierarchical interactive knowledge representation + manip.
• Build rich intuition on biological + therapeutic space
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Course at a Glance
Challenges in Computational Biology
4 Genome Assembly
5 Regulatory motif discovery 1 Gene Finding
DNA
2 Sequence alignment
6 Comparative Genomics
TCATGCTAT
TCGTGATAA 3 Database lookup
7 Evolutionary Theory TGAGGATAT
TTATCATAT
TTATGATTT
8 Gene expression analysis
RNA transcript
9 Cluster discovery 10 Gibbs sampling
11 Protein network analysis
12 Metabolic modelling
13 Emerging network properties
Aligning and Modeling Genomes
• Foundations vs. frontiers
– Foundations: Classical computational methods / biological topics
– Frontiers: Latest developments, open questions, research areas
– Duality for each: basic problems / fundamental techniques
• Sequence alignment:
– Local/global alignment: infer nucleotide-level evolutionary events
– Database search: scan for regions that may have common ancestry
• Hidden Markov Models
– Hidden Markov Models (HMMs): Central tool in CS
– Decoding, evaluation, parsing, likelihood, scoring
Dynamic Programming Algorithms: Align, HMMs
x1 ………………………… xM State
y1 ………………………… yN
1
2
Vk(i)
x1 x2 x3 ………………………………………..xN
• Sequence alignment • Hidden Markov Models
• DP: Core computational technique
– Pervasive in computer science, and computational biology
– Fully explore exponential search spaces in poly time!
– Greedy algorithms will not work, back-tracking, saving soln
– Special requirements: Optimal substructure
– Found in: alignment, HMMs, phylogeny, genetics, pop gen…
Gene expression analysis and transcripts
• Computational foundations:
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
– Read mapping, significance testing, splice graphs
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Natural 1st step: group similar rows/columns
Clustering
Similar cell types Similarly-behaving groups of genes
Conditions
Conditions
Genes
Genes
Armstrong, Nature Gen 2002 Alizadeh, Nature 2000
Reveal common Reveal common gene behaviors
‘conditions’
If labels are known: find more of same type
Classification
Classify diseases Classify genes in different pathways
Armstrong, Nature Gen 2002 Alizadeh, Nature 2000
Find features that Find additional members of existing gene classes
distinguish known classes Predict function of uncharacterized genes
Epigenomics and gene regulation
• Computational Foundations
– Hidden Markov Models (HMMs): Central tool in CS
– Decoding, evaluation, parsing, likelihood, scoring
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Motifs summarize TF sequence specificity
• Summarize
information
• Integrate many
positions
• Measure of
information
• Distinguish motif
vs. motif instance
• Assumptions:
– Independence
– Fixed spacing
Starting positions Motif matrix
• given aligned sequences easy to compute profile matrix
shared motif sequence positions
1 2 3 4 5 6 7 8
A 0.1 0.3 0.1 0.2 0.2 0.4 0.3 0.1
C 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7
G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1
T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1
given profile matrix
• easy to find starting position probabilities
Key idea: Iterative procedure for estimating both, given
uncertainty
(learning problem with hidden variables: the starting positions)
Multivariate HMM for Chromatin States
Transcription
Enhancer
Start Site
Transcribed Region DNA
Observed
chromatin
marks. Called
K4me1 K4me3 K4me3 K4me1 K36me3 K36me3 K36me3
based on a K36me3
poisson
distribution K27ac K4me1
Most likely
Hidden State 1 2 3 4 6 6 6 6 6 5 5 5
High Probability Chromatin Marks in State
0.8 0.8
200bp 1: K4me1 K27ac
0.7 4: All probabilities are
intervals K4me1
0.9 learned from the data
2: 0.8
5:
K4me3 K4me1
3: 0.9 6: 0.9 72
K4me3 K36me3
Ernst and Kellis
Nature Biotech 2010
Evolution/phylogeny/populations
• Phylogenetics / Phylogenomics
– Phylogenetics: Evolutionary models, Tree building, Phylo inference
– Phylogenomics: gene/species trees, reconciliation, coalescent, pops
• Population genomics:
– Learning population history from genetic data (David Reich)
– Statistical genetics: disease mapping in populations (Mark Daly)
– Measuring natural selection in human populations (Pardis Sabeti)
– The missing heritability in genome-wide associations (Yaniv Erlich)
• And we’re done! Last pset Nov 21st, In-class quiz on Nov 22nd
– No lab 4! Then entire focus shifts to projects, Thanksgiving, Frontiers
Characterizing sub-threshold variants in heart arrhythmia
Focus on sub-threshold variants
(e.g. rs1743292 P=10-4.2)
Trait: QRS/QT interval
(1) Large cohorts, (2) many known hits
(3) well-characterized tissue drivers
Protein folding, 3D structure, Chemical Structure, Geometric Deep Learning, GNNs, PLMs
Course at a Glance
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Biology primer
Quick introduction to molecular biology
and information transfer within the cell
“Central dogma” of Molecular Biology
DNA
makes
RNA
makes
Protein
DNA: The double helix
• The most noble molecule of our time
DNA: the molecule of heredity
• Self-complementarity sets molecular basis of heredity
– Knowing one strand, creates a template for the other
– “It has not escaped our notice that the specific pairing we have postulated immediately
suggests a possible copying mechanism for the genetic material.” Watson & Crick, 1953
DNA: chemical details
2’ 3’
T 1’
4’ • Bases hidden on the inside
5’
5’
A • Phosphate
outside
• backbone
Weak hydrogen bonds hold the
two strands together
4’ 1’ 2’ 3’ • This allows low-energy opening
3’ 2’ C 1’
4’ and re-closing of two strands
5’
5’
G
• Anti-parallel strands
4’ 1’ 2’ 3’ • Extension 5’3’ tri-
3’ 2’ T 1’
4’ phosphate coming from
5’ newly added nucleotide
5’
A
4’ 1’ 2’ 3’ The only parings are:
3’ 2’ C 1’
4’
• A with T
5’
5’
G • C with G
4’ 1’
3’ 2’
DNA: the four bases
Purine Purine
Pyrimidine Pyrimidine
Weak Weak
Strong Strong
Amino Amino
Keto Keto
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology
DNA Epigenomics
makes
RNA
makes
Protein
Chromosomes inside the cell
• Eukaryote cell
• Prokaryote
cell
DNA packaging
• Why packaging
– DNA is very long
– Cell is very small
• Compression
– Chromosome is 50,000
times shorter than
extended DNA
• Using the DNA
– Before a piece of DNA
is used for anything,
this compact structure
must open locally
• Now emerging:
– Role of accessibility
– State in chromatin itself
– Role of 3D interactions
Diverse epigenetic modifications
89
Image source: http://nihroadmap.nih.gov/epigenomics/
Diversity of epigenetic modifications
modifications • 100+ different histone modifications
• Histone protein H3/H4/H2A/H2B
• AA residue Lysine4(K4)/K36…
• Chemical modification Met/Pho/Ubi
Histone tails • Number Me-Me-Me(me3)
• Shorthand: H3K4me3, H2BK5ac
• In addition:
• DNA modifications
• Methyl-C in CpG / Methyl-Adenosine
• Nucleosome positioning
• DNA accessibility
• The constant struggle of gene regulation
DNA wrapped around
histone proteins • TF/histone/nucleo/GFs/Chrom compete 90
Epigenomics Roadmap across 100+ tissues/cell types
Diverse epigenomic assays:
1. Histone modifications
• H3K4me3, H3K4me1
• H3K36me3
Art: Rae Senarighi, Richard Sandstrom • H3K27me3, H3K9me3
• H3K27/9ac, +20 more
2. Open chromatin:
• DNase
3. DNA methylation:
• WGBS, RRBS, MRE/MeDIP
4. Gene expression
Diverse tissues and cells: • RNA-seq, Exon Arrays
1. Adult tissues and cells (brain, muscle, heart, digestive, skin, adipose, lung, blood…)
2. Fetal tissues (brain, skeletal muscle, heart, digestive, lung, cord blood…)
3. ES cells, iPS, differentiated cells (meso/endo/ectoderm, neural, mesench, trophobl)
Deep sampling of 9 reference epigenomes (e.g. IMR90)
UWash Epigenome Browser, Ting Wang
Chromatin state+RNA+DNAse+28 histone marks+WGBS+Hi-C
Diverse chromatin signatures encode epigenomic state
Enhancers Promoters Transcribed Repressed
• H3K4me1 • H3K4me3 • H3K36me3 • H3K9me3
• H3K27ac • H3K9ac • H3K79me2 • H3K27me3
• DNase • DNase • H4K20me1 • DNAmethyl
• H3K4me3
• H3K4me1
• H3K27ac
• H3K36me3
• H4K20me1
• H3K79me3
• H3K27me3
• H3K9me3
• H3K9ac
• H3K18ac
• 100s of known modifications, many new still emerging
• Systematic mapping using ChIP-, Bisulfite-, DNase-Seq
Chromatin state annotations across 127 epigenomes
Reveal epigenomic variability: enh/prom/tx/repr/het
Anshul Kundaje
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology
DNA
makes
RNA
makes
Protein
Genes control the making of cell parts
• The gene is a fundamental unit of inheritance
– Each DNA molecule 10,000+ genes
– 1 gene 1 functional element (one “part” of cell
machinery)
– Every time a “part” is made, the corresponding gene is:
• Copied into mRNA, transported, used as blueprint to make protein
• RNA is a temporary copy
– The medium for transporting genetic information from the
DNA information repository to the protein-making machinery
is an RNA molecule
– The more parts are needed, the more copies are made
– Each mRNA only lasts a limited time before degradation
mRNA: The messenger
• Information changes medium
– single strand vs. double strand
– ribose vs. deoxyribose sugar
A T T A C G G T A C C G T
U A A U G C C A U G G C A
– Compatible base-pairing in
hybrid
From DNA to RNA: Transcription
From pre-mRNA to mRNA: Splicing
• In Eukaryotes, not every part of a gene is coding
– Functional exons interrupted by non-translated introns
– During pre-mRNA maturation, introns are spliced out
– In humans, primary transcript can be 106 bp long
– Alternative splicing can yield different exon subsets for the same gene,
and hence different protein products
RNA can be functional
• Single Strand allows complex structure
– Self-complementary regions form helical stems
– Three-dimensional structure allows functionality of RNA
• Four types of RNA
– mRNA: messenger of genetic information
– tRNA: codon-to-amino acid specificity
– rRNA: core of the ribosome
– snRNA: splicing reactions
• To be continued…
– We’ll learn more in a dedicated lecture on RNA world
– Once upon a time, before DNA and protein, RNA did all
RNA structure: 2ndary and 3rdary
Splicing machinery made of RNA
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology
DNA
makes
RNA
makes
Protein
Proteins carry out the cell’s chemistry
• More complex polymer
– Nucleic Acids have 4 building blocks
– Proteins have 20. Greater versatility
– Each amino acid has specific properties
• Sequence Structure Function
– The amino acid sequence determines the
three-dimensional fold of protein
– The protein’s function largely depends on
the features of the 3D structure
• Proteins play diverse roles
– Catalysis, binding, cell structure, signaling,
transport, metabolism
Protein structure
Alpha-beta horseshoe
Beta-barrel this placental ribonuclease inhibitor is a
Helix-turn-helix Some antiparallel b-sheet cytosolic protein that binds extremely
domains are better described as strongly to any ribonuclease that may leak
Common motif for into the cytosol. 17-stranded parallel b
b-barrels rather than b-
DNA-binding proteins sheet curved into an open horseshoe shape,
sandwiches, for example
that often play a with 16 a-helices packed against the outer
streptavadin and porin. Note
regulatory role as surface. It doesn't form a barrel although it
that some structures are
mRNA level looks as though it should. The strands are
transcription factors intermediate between the only very slightly slanted, being nearly
extreme barrel and sandwich parallel to the central `axis'.
arrangements.
Protein building blocks
• Amino Acids
From RNA to protein: Translation
•tRNA
• Ribosome
The Genetic Code
Use evolutionary and compositional properties
to computationally discover protein-coding genes
Summary: The Central Dogma
DNA makes RNA makes Protein
Inheritance
Messages
Reactions
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Cellular dynamics and regulation
How cells move through this Central Dogma
DNA
makes
Gene regulation RNA
makes
Protein
Animal/Human gene regulation:
One genome Many cell types
ACCAGTTACGACGGTCA
GGGTACTGATACCCCAA
ACCGTTGACCGCATTTA
CAGACGGGGTTTGGGTT
TTGCCCCACACAGGTAC
GTTAGCTACTGGTTTAG
CAATTTACCGTTACAAC
GTTTACAGGGTTACGGT
TGGGATTTGAAAAAAAG
TTTGAGTTGGTTTTTTC
ACGGTAGAACGTACCGT
TACCAGTA
114
Image Source wikipedia
Eukaryotic Gene Regulation
Diverse roles for regulatory non-coding RNAs
• Small RNA pathways (18-21 nt)
– microRNAs:
• Repress genes by targeting their 3’UTRs by complementarity
• Double-stranded RNA is then recognized and degraded
• Recently found to also target promoter regions in rare cases
– piwiRNAs
• Target and repress transposable elements in germline
– snoRNAs
– 21U-RNAs
• Long non-coding RNAs (1000s nt, many exons)
– Scaffolds for protein/TF binding
– Scaffolds for 3D structure of RNA
Regulation of Gene Expression
• Upstream of genes are
Transcription Factor Polymerase promoter regions
Promoter
• Contain promoter sequences
or motifs
• Transcription factors (TFs)
bind to motifs
mRNA
• TFs recruit RNA polymerase
Transcription Factor Binding Site • Gene transcription
Examples:
Predicted motif drivers
of enhancer modules
• Activator and
repressor motifs
consistent with
tissues
Pouya Kheradpour
Network components reveal functional modules
• Feed-forward loops in developmental patterning
• Cooperation of master reg. & downstream reg.
Zeitlinger et al, Genes & Development 2007
Systematic motif dissection in 2000 enhancers:
5 activators and 2 repressors in 2 cell lines
54000+ measurements (x2 cells, 2x repl)
Kheradpour et al Genome Research 2013
Emerging properties of regulatory networks
• Hierarchical levels of regulatory control
– Small number of backward-pointing edges
• Specific / distinct feedback by microRNAs at each level
– Two classes of TFs: miRNA regulators and miR-regulated
From Systems Biology to Synthetic Biology
Regulatory Networks
Synthetic
Jim Collins
• Components with
known properties
• Assemble based
Metabolic Pathways
on engineering
goals / principles
Synthetic
• Implement within
engineered cells
and organisms
• Study behavior &
adjust as needed
Jay Keasling
Over-express a single microRNA leads to new wing
wing
w/bristles
Note: C,D,E same magnification
Sensory bristles
haltere
wing haltere
WT
wing
sense Antisense
• Discovery of sense/anti-sense miRNAs
• Regulatory switch selects between two
developmental programs
• By over-expressing one strand (miRNAas)
the balance is tilted
• Wing program launched vs. haltere Stark et al, Genes&Development 2007
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Brief intro to Human Genetics
The role of genetic alterations
DNA
makes
RNA
makes
Protein
Brief intro to human genetics
• Human genome: 3.2B letters, 2 copies, 23 chromosomes,
20k genes, ~3M common SNPs, ~500k haplotype blocks
The power and challenge of disease-association studies
Slide credit: Luke Ward, Mark Daly
• Large associated blocks with many variants: Fine-mapping challenge
• No information on cell type/mechanism, most variants non-coding
Epigenomic annotations help find relevant cell types / nucleotides
The power of GWAS: reveal new disease genes
rs11209026 A G
Cases 22 976
IL23R cytokine receptor on a subset of effector T-cells
Controls 68 932
Chi-sq = 24.5, p=7.3 x 10-7
Genomewide association in schizophrenia
with 40,000 cases
More than 100 distinct regions of
the genome associated to
schizophrenia!!!
Stephan Ripke
Interpreting non- xx
coding variants
• Disease-associated SNPs enriched for enhancers in relevant cell types
• E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Mechanistic predictions for top disease-associated SNPs
Lupus erythromatosus in GM lymphoblastoid Erythrocyte phenotypes in K562 leukemia cells
Disrupt activator Ets-1 motif Creation of repressor Gfi1 motif
Loss of GM-specific activation Gain K562-specific repression
Loss of enhancer function Loss of enhancer function
Loss of HLA-DRB1 expression Loss of CCDC162 expression
Characterizing sub-threshold variants in heart arrhythmia
Focus on sub-threshold variants
(e.g. rs1743292 P=10-4.2)
Trait: QRS/QT interval
(1) Large cohorts, (2) many known hits
(3) well-characterized tissue drivers
GWAS hits in enhancers of relevant cell types
Linking traits to their relevant cell/tissue types
ES
Liver
Brain
Digestive
Heart
T cells B cells
Methylation differences a causal component of AD
Methylation probes altered in AD
are enriched in AD-associated SNPs
GMD
GMD
G D
AD predictive power reduced
M
after removing meQTL effect
Set-wise causality testing
Uncovering the molecular basis of top obesity gene
Lean
Obese
ARID5B KD IRX3, IRX5 knock-down
(obesity) (anti-obesity phenotypes)
ARID5B OE
IRX3, IRX5 overexpression
(anti-obesity)
(pro-obesity phenotypes)
C-to-T motif rescue T-to-C motif disruption
(anti-obesity phenotypes) (pro-obesity phenotypes)
Model: beige white adipocyte development
Shift therapeutic focus from brain to adipocytes
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Course at a Glance