WO2024249933A1 - Identification of marker sequences for rapid classification of microbial pathogens through machine learning analysis of genome assemblies - Google Patents
Identification of marker sequences for rapid classification of microbial pathogens through machine learning analysis of genome assemblies Download PDFInfo
- Publication number
- WO2024249933A1 WO2024249933A1 PCT/US2024/032104 US2024032104W WO2024249933A1 WO 2024249933 A1 WO2024249933 A1 WO 2024249933A1 US 2024032104 W US2024032104 W US 2024032104W WO 2024249933 A1 WO2024249933 A1 WO 2024249933A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- genes
- kmers
- identify
- machine learning
- genomes
- Prior art date
Links
- 239000003550 marker Substances 0.000 title claims abstract description 85
- 238000010801 machine learning Methods 0.000 title claims abstract description 34
- 238000000429 assembly Methods 0.000 title claims abstract description 24
- 230000000712 assembly Effects 0.000 title claims abstract description 24
- 244000000010 microbial pathogen Species 0.000 title claims description 19
- 238000004458 analytical method Methods 0.000 title abstract description 10
- 238000000034 method Methods 0.000 claims abstract description 97
- 230000000813 microbial effect Effects 0.000 claims abstract description 23
- 108090000623 proteins and genes Proteins 0.000 claims description 137
- 244000052769 pathogen Species 0.000 claims description 29
- 108091034117 Oligonucleotide Proteins 0.000 claims description 26
- 230000001717 pathogenic effect Effects 0.000 claims description 22
- 125000003729 nucleotide group Chemical group 0.000 claims description 20
- 238000012549 training Methods 0.000 claims description 18
- 239000002773 nucleotide Substances 0.000 claims description 17
- 102000004169 proteins and genes Human genes 0.000 claims description 15
- 108700026244 Open Reading Frames Proteins 0.000 claims description 13
- 241000894006 Bacteria Species 0.000 claims description 11
- 238000001914 filtration Methods 0.000 claims description 9
- 241000233866 Fungi Species 0.000 claims description 7
- 241000700605 Viruses Species 0.000 claims description 6
- 230000000295 complement effect Effects 0.000 claims description 6
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 5
- 244000052616 bacterial pathogen Species 0.000 claims description 4
- 230000001018 virulence Effects 0.000 claims description 3
- 206010059866 Drug resistance Diseases 0.000 claims description 2
- 244000052613 viral pathogen Species 0.000 claims description 2
- 239000013615 primer Substances 0.000 description 76
- MYSWGUAQZAJSOK-UHFFFAOYSA-N ciprofloxacin Chemical compound C12=CC(N3CCNCC3)=C(F)C=C2C(=O)C(C(=O)O)=CN1C1CC1 MYSWGUAQZAJSOK-UHFFFAOYSA-N 0.000 description 46
- 241000588724 Escherichia coli Species 0.000 description 24
- 229960003405 ciprofloxacin Drugs 0.000 description 23
- 238000004891 communication Methods 0.000 description 22
- 230000015654 memory Effects 0.000 description 22
- 238000010200 validation analysis Methods 0.000 description 19
- 230000002068 genetic effect Effects 0.000 description 15
- 238000013459 approach Methods 0.000 description 12
- 102000039446 nucleic acids Human genes 0.000 description 11
- 108020004707 nucleic acids Proteins 0.000 description 11
- 150000007523 nucleic acids Chemical class 0.000 description 11
- 230000005291 magnetic effect Effects 0.000 description 10
- 230000003287 optical effect Effects 0.000 description 10
- 239000000523 sample Substances 0.000 description 10
- 238000012360 testing method Methods 0.000 description 10
- 108091033319 polynucleotide Proteins 0.000 description 9
- 102000040430 polynucleotide Human genes 0.000 description 9
- 239000002157 polynucleotide Substances 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 108010041052 DNA Topoisomerase IV Proteins 0.000 description 8
- 238000004590 computer program Methods 0.000 description 8
- 208000015181 infectious disease Diseases 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 150000001413 amino acids Chemical class 0.000 description 6
- 230000000845 anti-microbial effect Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 6
- 230000035772 mutation Effects 0.000 description 6
- 241000894007 species Species 0.000 description 6
- 241000193830 Bacillus <bacterium> Species 0.000 description 5
- 108020004414 DNA Proteins 0.000 description 5
- 101100266926 Escherichia coli (strain K12) ydiJ gene Proteins 0.000 description 5
- 230000003321 amplification Effects 0.000 description 5
- 230000001580 bacterial effect Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 5
- 238000011109 contamination Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 101150070420 gyrA gene Proteins 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 238000003199 nucleic acid amplification method Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 239000000758 substrate Substances 0.000 description 5
- 241001148471 unidentified anaerobic bacterium Species 0.000 description 5
- 101100536415 Bacillus subtilis (strain 168) tatC2 gene Proteins 0.000 description 4
- 102000004190 Enzymes Human genes 0.000 description 4
- 108090000790 Enzymes Proteins 0.000 description 4
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000007613 environmental effect Effects 0.000 description 4
- 230000000670 limiting effect Effects 0.000 description 4
- 230000007170 pathology Effects 0.000 description 4
- 239000004055 small Interfering RNA Substances 0.000 description 4
- 241000203069 Archaea Species 0.000 description 3
- 208000035143 Bacterial infection Diseases 0.000 description 3
- 108010078791 Carrier Proteins Proteins 0.000 description 3
- 241000606161 Chlamydia Species 0.000 description 3
- 241000588722 Escherichia Species 0.000 description 3
- 241000124008 Mammalia Species 0.000 description 3
- 108091005804 Peptidases Proteins 0.000 description 3
- 101710199192 Uncharacterized membrane protein Proteins 0.000 description 3
- 238000000137 annealing Methods 0.000 description 3
- 238000003556 assay Methods 0.000 description 3
- 208000022362 bacterial infectious disease Diseases 0.000 description 3
- 102000023732 binding proteins Human genes 0.000 description 3
- 108091008324 binding proteins Proteins 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 102200089577 c.239G>T Human genes 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 231100000655 enterotoxin Toxicity 0.000 description 3
- 229940023064 escherichia coli Drugs 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000000126 in silico method Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 244000005700 microbiome Species 0.000 description 3
- 229920000642 polymer Polymers 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000002103 transcriptional effect Effects 0.000 description 3
- 238000011277 treatment modality Methods 0.000 description 3
- 102000005416 ATP-Binding Cassette Transporters Human genes 0.000 description 2
- 108010006533 ATP-Binding Cassette Transporters Proteins 0.000 description 2
- 241000589968 Borrelia Species 0.000 description 2
- 241000589562 Brucella Species 0.000 description 2
- 241000589876 Campylobacter Species 0.000 description 2
- 241000193403 Clostridium Species 0.000 description 2
- 208000035473 Communicable disease Diseases 0.000 description 2
- 108091035707 Consensus sequence Proteins 0.000 description 2
- 101710186981 DNA gyrase subunit A Proteins 0.000 description 2
- 239000003155 DNA primer Substances 0.000 description 2
- 230000004568 DNA-binding Effects 0.000 description 2
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 2
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 2
- 241000588914 Enterobacter Species 0.000 description 2
- 241000194033 Enterococcus Species 0.000 description 2
- 241000194032 Enterococcus faecalis Species 0.000 description 2
- 101100214816 Escherichia coli (strain K12) abgB gene Proteins 0.000 description 2
- 101100213781 Escherichia coli (strain K12) yldA gene Proteins 0.000 description 2
- 241000710831 Flavivirus Species 0.000 description 2
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 2
- 241000590002 Helicobacter pylori Species 0.000 description 2
- 241000282412 Homo Species 0.000 description 2
- 241000589248 Legionella Species 0.000 description 2
- 208000007764 Legionnaires' Disease Diseases 0.000 description 2
- 208000016604 Lyme disease Diseases 0.000 description 2
- 102000018697 Membrane Proteins Human genes 0.000 description 2
- 108010052285 Membrane Proteins Proteins 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 241000186362 Mycobacterium leprae Species 0.000 description 2
- 241000204031 Mycoplasma Species 0.000 description 2
- 102000004316 Oxidoreductases Human genes 0.000 description 2
- 108090000854 Oxidoreductases Proteins 0.000 description 2
- 238000002944 PCR assay Methods 0.000 description 2
- 102000035195 Peptidases Human genes 0.000 description 2
- 241000606701 Rickettsia Species 0.000 description 2
- 108091027967 Small hairpin RNA Proteins 0.000 description 2
- 108020004459 Small interfering RNA Proteins 0.000 description 2
- 241000191967 Staphylococcus aureus Species 0.000 description 2
- 108091036408 Toxin-antitoxin system Proteins 0.000 description 2
- 108020004566 Transfer RNA Proteins 0.000 description 2
- 241000589886 Treponema Species 0.000 description 2
- 101710159648 Uncharacterized protein Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 241000607598 Vibrio Species 0.000 description 2
- 241000607734 Yersinia <bacteria> Species 0.000 description 2
- 230000001154 acute effect Effects 0.000 description 2
- 241001148470 aerobic bacillus Species 0.000 description 2
- 230000001147 anti-toxic effect Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 239000003153 chemical reaction reagent Substances 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002538 fungal effect Effects 0.000 description 2
- 102000054767 gene variant Human genes 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 239000012678 infectious agent Substances 0.000 description 2
- 239000006166 lysate Substances 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- MYWUZJCMWCOHBA-VIFPVBQESA-N methamphetamine Chemical compound CN[C@@H](C)CC1=CC=CC=C1 MYWUZJCMWCOHBA-VIFPVBQESA-N 0.000 description 2
- 108091070501 miRNA Proteins 0.000 description 2
- 239000002679 microRNA Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 239000002777 nucleoside Substances 0.000 description 2
- 150000004713 phosphodiesters Chemical class 0.000 description 2
- 108090000765 processed proteins & peptides Proteins 0.000 description 2
- 235000019833 protease Nutrition 0.000 description 2
- 239000013074 reference sample Substances 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000008685 targeting Effects 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 235000011178 triphosphate Nutrition 0.000 description 2
- 239000001226 triphosphate Substances 0.000 description 2
- 238000012070 whole genome sequencing analysis Methods 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 108020004465 16S ribosomal RNA Proteins 0.000 description 1
- 241000589291 Acinetobacter Species 0.000 description 1
- 241000588626 Acinetobacter baumannii Species 0.000 description 1
- 241000186361 Actinobacteria <class> Species 0.000 description 1
- 241000186046 Actinomyces Species 0.000 description 1
- 241000186041 Actinomyces israelii Species 0.000 description 1
- 241000251468 Actinopterygii Species 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 108700028369 Alleles Proteins 0.000 description 1
- 241000710929 Alphavirus Species 0.000 description 1
- 108091093088 Amplicon Proteins 0.000 description 1
- 101001062362 Arabidopsis thaliana Berberine bridge enzyme-like 3 Proteins 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 241000193738 Bacillus anthracis Species 0.000 description 1
- 101100344619 Bacillus subtilis (strain 168) mccA gene Proteins 0.000 description 1
- 241000606660 Bartonella Species 0.000 description 1
- 241000588807 Bordetella Species 0.000 description 1
- 241000588832 Bordetella pertussis Species 0.000 description 1
- 241000589969 Borreliella burgdorferi Species 0.000 description 1
- 241000589567 Brucella abortus Species 0.000 description 1
- 241001509299 Brucella canis Species 0.000 description 1
- 241001148106 Brucella melitensis Species 0.000 description 1
- 241001148111 Brucella suis Species 0.000 description 1
- 241000589875 Campylobacter jejuni Species 0.000 description 1
- 108090000994 Catalytic RNA Proteins 0.000 description 1
- 102000053642 Catalytic RNA Human genes 0.000 description 1
- 241001647372 Chlamydia pneumoniae Species 0.000 description 1
- 241001647378 Chlamydia psittaci Species 0.000 description 1
- 241000606153 Chlamydia trachomatis Species 0.000 description 1
- 241000193163 Clostridioides difficile Species 0.000 description 1
- 241000193155 Clostridium botulinum Species 0.000 description 1
- 241000193468 Clostridium perfringens Species 0.000 description 1
- 241000193449 Clostridium tetani Species 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 241000186216 Corynebacterium Species 0.000 description 1
- 241000186227 Corynebacterium diphtheriae Species 0.000 description 1
- 241000938605 Crocodylia Species 0.000 description 1
- 241000195493 Cryptophyta Species 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 241000588697 Enterobacter cloacae Species 0.000 description 1
- 241000194031 Enterococcus faecium Species 0.000 description 1
- 241000186811 Erysipelothrix Species 0.000 description 1
- 241000186810 Erysipelothrix rhusiopathiae Species 0.000 description 1
- 101100213486 Escherichia coli (strain K12) yiaM gene Proteins 0.000 description 1
- 101100544388 Escherichia coli (strain K12) yjaA gene Proteins 0.000 description 1
- 101100053600 Escherichia coli (strain K12) yrhA gene Proteins 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 241000589601 Francisella Species 0.000 description 1
- 241000589602 Francisella tularensis Species 0.000 description 1
- 102000003793 Fructokinases Human genes 0.000 description 1
- 108090000156 Fructokinases Proteins 0.000 description 1
- 241000605986 Fusobacterium nucleatum Species 0.000 description 1
- 239000004471 Glycine Substances 0.000 description 1
- 241000606790 Haemophilus Species 0.000 description 1
- 241000606768 Haemophilus influenzae Species 0.000 description 1
- 241000589989 Helicobacter Species 0.000 description 1
- 108010034145 Helminth Proteins Proteins 0.000 description 1
- 102100034343 Integrase Human genes 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 241000588748 Klebsiella Species 0.000 description 1
- 241000588915 Klebsiella aerogenes Species 0.000 description 1
- 241000588747 Klebsiella pneumoniae Species 0.000 description 1
- 241000589242 Legionella pneumophila Species 0.000 description 1
- 241000589902 Leptospira Species 0.000 description 1
- 241000589929 Leptospira interrogans Species 0.000 description 1
- 241000186781 Listeria Species 0.000 description 1
- 241000186779 Listeria monocytogenes Species 0.000 description 1
- 241000588771 Morganella <proteobacterium> Species 0.000 description 1
- 241001529936 Murinae Species 0.000 description 1
- 101100463616 Mus musculus Pfkl gene Proteins 0.000 description 1
- 241000186359 Mycobacterium Species 0.000 description 1
- 241001467553 Mycobacterium africanum Species 0.000 description 1
- 241000186367 Mycobacterium avium Species 0.000 description 1
- 241000186366 Mycobacterium bovis Species 0.000 description 1
- 241000187484 Mycobacterium gordonae Species 0.000 description 1
- 241000186363 Mycobacterium kansasii Species 0.000 description 1
- 241000187479 Mycobacterium tuberculosis Species 0.000 description 1
- 241000187917 Mycobacterium ulcerans Species 0.000 description 1
- 241000202934 Mycoplasma pneumoniae Species 0.000 description 1
- BAWFJGJZGIEFAR-NNYOXOHSSA-O NAD(+) Chemical compound NC(=O)C1=CC=C[N+]([C@H]2[C@@H]([C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OC[C@@H]3[C@H]([C@@H](O)[C@@H](O3)N3C4=NC=NC(N)=C4N=C3)O)O2)O)=C1 BAWFJGJZGIEFAR-NNYOXOHSSA-O 0.000 description 1
- 241000588653 Neisseria Species 0.000 description 1
- 241000588652 Neisseria gonorrhoeae Species 0.000 description 1
- 241000588650 Neisseria meningitidis Species 0.000 description 1
- 238000000636 Northern blotting Methods 0.000 description 1
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 1
- 108020005187 Oligonucleotide Probes Proteins 0.000 description 1
- 241000588912 Pantoea agglomerans Species 0.000 description 1
- 102000009097 Phosphorylases Human genes 0.000 description 1
- 108010073135 Phosphorylases Proteins 0.000 description 1
- 102000029797 Prion Human genes 0.000 description 1
- 108091000054 Prion Proteins 0.000 description 1
- 239000004365 Protease Substances 0.000 description 1
- 241000588769 Proteus <enterobacteria> Species 0.000 description 1
- 241000588768 Providencia Species 0.000 description 1
- 241000589516 Pseudomonas Species 0.000 description 1
- 241000589517 Pseudomonas aeruginosa Species 0.000 description 1
- 108010092799 RNA-directed DNA polymerase Proteins 0.000 description 1
- 238000011529 RT qPCR Methods 0.000 description 1
- 102100037486 Reverse transcriptase/ribonuclease H Human genes 0.000 description 1
- 102000004431 Riboflavin transporter Human genes 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 241000606695 Rickettsia rickettsii Species 0.000 description 1
- 241000607142 Salmonella Species 0.000 description 1
- 241001138501 Salmonella enterica Species 0.000 description 1
- 241000293871 Salmonella enterica subsp. enterica serovar Typhi Species 0.000 description 1
- 241000293869 Salmonella enterica subsp. enterica serovar Typhimurium Species 0.000 description 1
- 241000607720 Serratia Species 0.000 description 1
- 108010017898 Shiga Toxins Proteins 0.000 description 1
- 241000607768 Shigella Species 0.000 description 1
- 241000607760 Shigella sonnei Species 0.000 description 1
- 238000002105 Southern blotting Methods 0.000 description 1
- 241000589970 Spirochaetales Species 0.000 description 1
- 241000191940 Staphylococcus Species 0.000 description 1
- 241000191963 Staphylococcus epidermidis Species 0.000 description 1
- 241001147691 Staphylococcus saprophyticus Species 0.000 description 1
- 241000122971 Stenotrophomonas Species 0.000 description 1
- 241000122973 Stenotrophomonas maltophilia Species 0.000 description 1
- 102000048514 Stomatin Human genes 0.000 description 1
- 108700037714 Stomatin Proteins 0.000 description 1
- 241001478880 Streptobacillus moniliformis Species 0.000 description 1
- 241000194017 Streptococcus Species 0.000 description 1
- 241000193985 Streptococcus agalactiae Species 0.000 description 1
- 241000194049 Streptococcus equinus Species 0.000 description 1
- 241000193998 Streptococcus pneumoniae Species 0.000 description 1
- 241000193996 Streptococcus pyogenes Species 0.000 description 1
- 241001505901 Streptococcus sp. 'group A' Species 0.000 description 1
- 241000193990 Streptococcus sp. 'group B' Species 0.000 description 1
- 241001312524 Streptococcus viridans Species 0.000 description 1
- 101710127774 Stress response protein Proteins 0.000 description 1
- 241000589904 Treponema pallidum subsp. pertenue Species 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- 241000607626 Vibrio cholerae Species 0.000 description 1
- 101100071828 Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961) hutX gene Proteins 0.000 description 1
- 241000607479 Yersinia pestis Species 0.000 description 1
- SWPYNTWPIAZGLT-UHFFFAOYSA-N [amino(ethoxy)phosphanyl]oxyethane Chemical compound CCOP(N)OCC SWPYNTWPIAZGLT-UHFFFAOYSA-N 0.000 description 1
- 230000021736 acetylation Effects 0.000 description 1
- 238000006640 acetylation reaction Methods 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 150000007513 acids Chemical class 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 230000007815 allergy Effects 0.000 description 1
- 239000004599 antimicrobial Substances 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 229940065181 bacillus anthracis Drugs 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000003339 best practice Methods 0.000 description 1
- 238000006664 bond formation reaction Methods 0.000 description 1
- 229940056450 brucella abortus Drugs 0.000 description 1
- 229940038698 brucella melitensis Drugs 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 150000001720 carbohydrates Chemical class 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 229940038705 chlamydia trachomatis Drugs 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- NIUVHXTXUXOFEB-UHFFFAOYSA-J coproporphyrinogen III(4-) Chemical compound C1C(=C(C=2C)CCC([O-])=O)NC=2CC(=C(C=2C)CCC([O-])=O)NC=2CC(N2)=C(CCC([O-])=O)C(C)=C2CC2=C(C)C(CCC([O-])=O)=C1N2 NIUVHXTXUXOFEB-UHFFFAOYSA-J 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000004925 denaturation Methods 0.000 description 1
- 230000036425 denaturation Effects 0.000 description 1
- 238000001739 density measurement Methods 0.000 description 1
- 239000005549 deoxyribonucleoside Substances 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- XPPKVPWEQAFLFU-UHFFFAOYSA-J diphosphate(4-) Chemical compound [O-]P([O-])(=O)OP([O-])([O-])=O XPPKVPWEQAFLFU-UHFFFAOYSA-J 0.000 description 1
- 235000011180 diphosphates Nutrition 0.000 description 1
- 229940092559 enterobacter aerogenes Drugs 0.000 description 1
- 229940032049 enterococcus faecalis Drugs 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 229940118764 francisella tularensis Drugs 0.000 description 1
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 1
- 229910000078 germane Inorganic materials 0.000 description 1
- 230000013595 glycosylation Effects 0.000 description 1
- 238000006206 glycosylation reaction Methods 0.000 description 1
- 150000003278 haem Chemical class 0.000 description 1
- 229940047650 haemophilus influenzae Drugs 0.000 description 1
- 229940037467 helicobacter pylori Drugs 0.000 description 1
- 244000000013 helminth Species 0.000 description 1
- 229940025294 hemin Drugs 0.000 description 1
- BTIJJDXEELBZFS-QDUVMHSLSA-K hemin Chemical compound CC1=C(CCC(O)=O)C(C=C2C(CCC(O)=O)=C(C)\C(N2[Fe](Cl)N23)=C\4)=N\C1=C/C2=C(C)C(C=C)=C3\C=C/1C(C)=C(C=C)C/4=N\1 BTIJJDXEELBZFS-QDUVMHSLSA-K 0.000 description 1
- 101150020446 hflC gene Proteins 0.000 description 1
- 101150093316 hmuV gene Proteins 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 230000002458 infectious effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 229940115932 legionella pneumophila Drugs 0.000 description 1
- 230000029226 lipidation Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000002844 melting Methods 0.000 description 1
- 230000008018 melting Effects 0.000 description 1
- 108010009674 methylaspartate mutase Proteins 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000002853 nucleic acid probe Substances 0.000 description 1
- 125000003835 nucleoside group Chemical group 0.000 description 1
- -1 nucleoside triphosphates Chemical class 0.000 description 1
- 239000002751 oligonucleotide probe Substances 0.000 description 1
- 244000045947 parasite Species 0.000 description 1
- 230000003071 parasitic effect Effects 0.000 description 1
- 239000013610 patient sample Substances 0.000 description 1
- 239000008188 pellet Substances 0.000 description 1
- 239000000816 peptidomimetic Substances 0.000 description 1
- 230000026731 phosphorylation Effects 0.000 description 1
- 238000006366 phosphorylation reaction Methods 0.000 description 1
- 238000013081 phylogenetic analysis Methods 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 239000013641 positive control Substances 0.000 description 1
- 239000002987 primer (paints) Substances 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 102000016670 prohibitin Human genes 0.000 description 1
- 108010028138 prohibitin Proteins 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 235000019419 proteases Nutrition 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 238000003762 quantitative reverse transcription PCR Methods 0.000 description 1
- LISFMEBWQUVKPJ-UHFFFAOYSA-N quinolin-2-ol Chemical compound C1=CC=C2NC(=O)C=CC2=C1 LISFMEBWQUVKPJ-UHFFFAOYSA-N 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 108020001053 riboflavin transporter Proteins 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 108091092562 ribozyme Proteins 0.000 description 1
- 229940075118 rickettsia rickettsii Drugs 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012772 sequence design Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 229940115939 shigella sonnei Drugs 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 229940031000 streptococcus pneumoniae Drugs 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 231100000765 toxin Toxicity 0.000 description 1
- 239000003053 toxin Substances 0.000 description 1
- 108700012359 toxins Proteins 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 125000002264 triphosphate group Chemical class [H]OP(=O)(O[H])OP(=O)(O[H])OP(=O)(O[H])O* 0.000 description 1
- 201000008827 tuberculosis Diseases 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 229940118696 vibrio cholerae Drugs 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
- 239000002351 wastewater Substances 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/689—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/6895—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/70—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving virus or bacteriophage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
Definitions
- TECHNICAL FIELD [003] The disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. NCORPORATION BY REFERENCE OF SEQUENCE LISTING [004] Accompanying this filing is a Sequence Listing entitled, “00015-421WO1.xml” created on May 31, 2024 and having 56,079 bytes of data, machine formatted on IBM-PC, MS-Windows operating system. The sequence listing is hereby incorporated by reference in its entirety for all purposes. BACKGROUND [005] A challenge in dealing with acute bacterial infections in a clinical setting is to obtain rapid identification of the invading pathogen.
- AMR antimicrobial resistance
- This pangenome compendium of strains is then run through a machine-learning pipeline alongside a classification schema (e.g., pathogenic vs. nonpathogenic strains) to identify genes which can robustly classify microbial strains into the desired schema. These genes are then used to develop primer sequences which can rapidly identify these microbial strains through PCR tests performed in a laboratory.
- a prokaryotic pangenome of interest e.g., a known pathogenic species like E. coli or S. aureus
- a desired classification scheme e.g., pathotypes, anti-microbial resistance (AMR), etc.
- the disclosure presented herein reproduce three different classification schemes: (1) determining the phylogenetic group of Escherichia coli traditionally defined by Clermont Typing, (2) determining the clonal complex of Staphylococcus aureus, and (3) determining resistance against ciprofloxacin for Escherichia coli isolates in the B2 phylogroup.
- the disclosure provides a method to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, comprising carrying out steps (1)-(8): (1) clustering open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filtering the genes to identify sets of genes that are associated by a targeted classification scheme; (3) training machine learning models to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes; (4) identifying sequence variants from each of the candidate genes, and from which, subsequences having a defined number of nucleotides are enumerated
- the microbial pathogens are selected from bacterial pathogens, viral pathogens, and fungi pathogens.
- the dataset comprises microbial genome assemblies from bacteria, viruses or fungi.
- the microbial genomes assemblies contain genomes that have been filtered for quality based on genome status, size, number of contigs, number of genes and consistency.
- the microbial genomes assemblies have been annotated to mark open reading frames in the genomes.
- the open reading frames are clustered using a program that clusters and compares protein or nucleotide sequences.
- the targeted classification scheme comprises genes that are related to drug-resistance, virulence, pathotype, or phylogenetically-defined subgroups.
- the genes are filtered for those associated with the classification scheme, wherein association is based upon a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10.
- the machine learning models are trained to identify 4-8 genes that most accurately recover the targeted classification scheme.
- the disclosure also provides a computer readable medium comprising instructions to cause a computer to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, the computer instructions causing the computer to: (1) cluster open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filter the genes to identify sets of genes that are associated by a targeted classification scheme; (3) train machine learning model(s) to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes; Attorney docket No.00015-421WO1 (4) identify sequence variants from each of the candidate genes, and from which
- the disclosure also provides a catalog of oligonucleotide marker sequences obtained by the method or a computer running the computer readable medium of the disclosure. [0012] The disclosure also provides a method of identifying a pathogen in a sample, the method comprising comparing sequence reads obtained from the genome of the pathogen in the sample to the catalog of oligonucleotide markers in the catalog to identify the pathogen with a complementary sequence. DESCRIPTION OF DRAWINGS [0013] Figure 1A-B presents exemplary flowcharts/workflows for identifying a minimal set of marker sequences or “primers” that accurately reproduces a known genome classification scheme using massive public datasets and genetic algorithms.
- FIG. 1 A flowchart demonstrating how a minimal set of marker sequences can be generated from a collection of genome assemblies, with various decisions indicated to improve machine learning model accuracy.
- FIG. 2 Additional workflows to identify a minimal set of marker sequences or “primers” using different intermediate data types.
- Figure 2 displays confusion matrices for classifying E. coli into phylogroups from primer sets of 4, 5, and 6 sequences. For Attorney docket No.00015-421WO1 marker sets of size 4 and 5, predictions corresponding to any of the cryptic clades, non-Escherichia “Non-Esc.”, or unknown were not generated.
- Figure 3 displays confusion matrices for classifying E.
- Figure 4 presents interpretation of the 4-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Figure 5 presents interpretation of the 5-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Figure 6 presents interpretation of the 6-sequence primer set for predicting E. coli phylogroup.
- Figure 10 presents interpretation of the 4-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Figure 11 presents interpretation of the 5-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Attorney docket No.00015-421WO1 [0024]
- Figure 12 provides interpretation of the 6-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Figure 13A-B provides performance and interpretation of four primers for predicting the ciprofloxacin resistance phenotype of E. coli B2 strains.
- A Confusion matrix for the binary classification of E. coli B2 strains by ciprofloxacin resistance phenotype from the four primers.
- B Interpretation of the four primers to predict ciprofloxacin resistance phenotype. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
- Figure 14 provide genetic and antimicrobial resistance profiles of 442 candidate strains for ciprofloxacin resistance marker sequence validation.
- Figure 15 provides a comparison of predicted vs. experimentally observed ciprofloxacin resistance phenotypes for 72 validation strains.
- Prediction method “ML” refers to the pangenome- based XGBoost model
- Marker refers to the original proposed interpretation of the four marker sequences
- Marker+1 refers to the modified interpretation adding a fifth marker to also capture the S80I double SNP mutation.
- Strains have been sorted by observed resistance phenotype first, then by strain genetic cluster.
- Figure 16 shows detection of ciprofloxacin resistance marker sequence targets using a PCR-based approach.
- Figure 17 illustrates a block diagram of an example machine upon which one or more embodiments (e.g., discussed methodologies) can be implemented.
- Figure 18 depicts a block diagram for a system or related method of an embodiment of the present invention in whole or in part. DETAILED DESCRIPTION [0031] As used herein and in the appended claims, the singular forms "a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.
- the term “about,” as used herein can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which can depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. Alternatively, “about” can mean a range of plus or minus 20%, plus or minus 10%, plus or minus 5%, or plus or minus 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” means within an acceptable error range for the particular value.
- the Attorney docket No.00015-421WO1 ranges and/or subranges can include the endpoints of the ranges and/or subranges.
- each intervening number there between with the same degree of precision is explicitly contemplated.
- the numbers 7 and 8 are contemplated in addition to 6 and 9, and for the range 6.0-7.0, the number 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are explicitly contemplated.
- amplifying refers to the process of synthesizing nucleic acid molecules that are complementary to one (or both strands) of a template nucleic acid molecule.
- Amplifying a nucleic acid molecule typically includes denaturing the template nucleic acid, particularly if the template nucleic acid is double- stranded, annealing one or more primers (e.g., primers generated by the methods of the disclosure) to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product.
- primers e.g., primers generated by the methods of the disclosure
- synthesis initiates at the 3′ end of a primer and proceeds in a 5′ to 3′ direction along the template nucleic acid strand.
- Amplification typically requires the presence of deoxyribonucleoside triphosphates, a polymerase enzyme (e.g., DNA or RNA polymerase or T7 for in vitro transcription in TMA) and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme (e.g., MgCl and/or KCl).
- a polymerase enzyme e.g., DNA or RNA polymerase or T7 for in vitro transcription in TMA
- an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme e.g., MgCl and/or KCl.
- the term is distinct from the “particular taxon of pathogens”.
- the different taxon of pathogenic microorganisms does not overlap with the particular taxon of pathogens.
- a particular taxon of pathogenic microorganisms includes the family of Flavivirus
- the different taxon of pathogenic microorganisms does not include Flavivirus but can include another family of viruses, such as Alphaviruses, bacterial, fungal, archaea, algal, protozoan, and/or parasitic pathogens.
- the particular taxon of pathogenic microorganisms and different taxon of pathogenic microorganisms are from the same domain (e.g., bacterial domain), the two taxa identified by the method are distinct.
- microorganism or “microbial organism” is used in its broadest sense and includes Gram negative aerobic bacteria, Gram positive aerobic bacteria, Gram negative microaerophillic bacteria, Gram positive microaerophillic bacteria, Gram negative facultative anaerobic bacteria, Gram positive facultative anaerobic bacteria, Gram negative anaerobic bacteria, Gram positive anaerobic bacteria, Gram positive asporogenic bacteria, Actinomycetes, fungal microorganism, protazoan microorganism and the like.
- pathogen refers to a virus, bacterium, protozoa, prion, archaea, fungus, algae, parasite, or other microbe (helminth) that causes or induces disease or illness in a subject or that may be found in biological and/or environmental samples.
- the term includes both the disease-causing organism per se and toxins produced by the pathogen (e.g., Shiga toxins) present in a sample.
- Detection of a pathogen as set forth in the methods disclosed herein includes detection of a portion of the genome of the pathogen or a nucleic acid molecule that is complementary or substantially complementary (i.e., at least 90% complementary) to a portion of the genome of the pathogen.
- polynucleotide refers to a polymeric form of nucleotides of any length, either deoxyribonucleotides or ribonucleotides, or analogs thereof.
- polynucleotides coding or non-coding regions of a gene or gene fragment, loci (locus) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), micro-RNA (miRNA), ribozymes, cDNA, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers.
- loci defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), micro-RNA (miRNA), ribozymes, cDNA, recombinant polyn
- a polynucleotide may comprise methylated nucleotides and nucleotide analogs.
- polypeptide polypeptide
- peptide protein
- the terms “polypeptide”, “peptide” and “protein” are used interchangeably herein to refer to polymers of amino acids of any length.
- the polymer may be linear or branched.
- the terms also encompass an amino acid polymer that has been modified; for example, disulfide bond formation, glycosylation, lipidation, acetylation, Attorney docket No.00015-421WO1 phosphorylation, or any other manipulation, such as conjugation with a labeling component.
- amino acid includes natural and/or unnatural or synthetic amino acids, including glycine and both the D or L optical isomers, and amino acid analogs and peptidomimetics.
- the term “primer” refers to oligomeric compounds, primarily to oligonucleotides containing naturally occurring nucleotides such as adenine, guanine, cytosine, thymine and/or uracil, but may also include modified oligonucleotides (e.g., modified nucleotides, nucleosides, synthetic nucleotides having modified base moieties and/or modified sugar moieties (See, Protocols for Oligonucleotide Conjugates, Methods in Molecular Biology, Vol 26, (Sudhir Agrawal, Ed., Humana Press, Totowa, N.J., (1994)); and Oligonucleotides and Analogues, A Practical Approach (Fritz Eck).
- Oligonucleotides can be prepared by any suitable method, including, for example, cloning and restriction of appropriate sequences and direct chemical synthesis by a method such as the phosphotriester method of Narang et al., 1979, Meth. Enzymol. 68:90-99; the phosphodiester method of Brown et al., 1979, Meth. Enzymol. 68:109-151; the diethylphosphoramidite method of Beaucage et al., 1981, Tetrahedron Lett. 22:1859-1862; and the solid support method of U.S. Pat. No. 4,458,066.
- a review of synthesis methods is provided in Goodchild, 1990, Bioconjugate Chemistry 1(3):165-187.
- a primer is typically a single-stranded deoxyribonucleic acid.
- the appropriate length of a primer depends on the intended use of the primer but typically ranges from 6 to 50 nucleotides. Short primer molecules (e.g., having a length within a range of 11-17 nucleotides) generally require cooler temperatures to form sufficiently stable hybrid complexes with a template (or target) nucleic acid.
- Attorney docket No.00015-421WO1 [0049]
- the terms "subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets.
- taxon or “taxa”, “taxonomic group,” and “taxonomic unit” are used interchangeably to refer to a group of one or more organisms that comprises a node in a clustering tree.
- the level of a cluster is determined by its hierarchical order.
- a taxon is a group tentatively assumed to be a valid taxon for purposes of phylogenetic analysis.
- a taxon is any of the extant taxonomic units under study.
- a taxon is given a name and a rank.
- a taxon can represent a domain, a sub-domain, a kingdom, a sub- kingdom, a phylum, a sub- phylum, a class, a sub-class, an order, a sub-order, a family, a subfamily, a genus, a subgenus, or a species.
- taxa can represent one or more organisms from the kingdoms eubacteria, protista, or fungi at any level of a hierarchal order.
- AMR antimicrobial resistance
- the identity of such sequences can be obtained through the assessment of genomic sequences from a large set of strains of the invading pathogenic strain.
- genomic sequences from a large set of strains of the invading pathogenic strain.
- the disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. Once the proper phylogroup (i.e., which strain subgroup of a species, which is known to impact the severity of infection) designation of the microbe has been made, a follow-up panel of genetic tests can be applied to determine the AMR status of the microbe.
- the reference sequences are from one or more of bacteria, archaea, chromalveolata, viruses, fungi, plants, fish, amphibians, reptiles, birds, mammals, and humans.
- the database of reference sequences consists of sequences from a reference individual or a reference sample source.
- the method may further comprise identifying polynucleotides from the sample source as being derived from the reference individual or the reference sample source.
- the database of reference sequences comprises one or more mutations with respect to known polynucleotide sequences, such that a plurality of variants of the known polynucleotide sequences are represented in the database of reference sequences.
- the database of reference sequences can comprise marker gene sequences for taxonomic classification of bacterial sequences, such as 16S rRNA sequences.
- the database of reference sequences consists of sequences associated with a condition. One or more such sequences may form a biosignature for a condition, a plurality of Attorney docket No.00015-421WO1 which may together form the reference database.
- the record database is associated with a condition of the sample source to establish a biosignature for a condition.
- the method may further comprise identifying a condition of the sample source by comparison of the record database to a biosignature, including identifying the sample source as having the condition.
- the condition may be contamination, such as food contamination, surface contamination, or environmental contamination.
- the condition is infection.
- the reference database consists of sequences associated with infectious disease or contamination
- the sequences may be derived from and associated with any of a variety of infectious agents.
- the infectious agent can be bacterial.
- Non- limiting examples of bacterial pathogens include Mycobacteria (e.g. M. tuberculosis, M. bovis, M. avium, M. leprae, and M. africanum), rickettsia, mycoplasma, chlamydia, and legionella.
- bacterial infections include, but are not limited to, infections caused by Gram positive bacillus (e.g., Listeria, Bacillus such as Bacillus anthracis, Erysipelothrix species), Gram negative bacillus (e.g., Bartonella, Brucella, Campylobacter, Enterobacter, Escherichia, Francisella, Hemophilus, Klebsiella, Morganella, Proteus, Providencia, Pseudomonas, Salmonella, Serratia, Shigella, Vibrio and Yersinia species), spirochete bacteria (e.g., Borrelia species including Borrelia burgdorferi that causes Lyme disease), anaerobic bacteria (e.g., Actinomyces and Clostridium species), Gram positive and negative coccal bacteria, Enterococcus species, Streptococcus species, Pneumococcus species, Staphylococcus species, and Neisseria species.
- infectious bacteria include, but are not limited to: Helicobacter pyloris, Legionella pneumophilia, M. intracellular, M. kansaii, M. gordonae, Staphylococcus aureus, Neisseria gonorrhoeae, Neisseria meningitidis, Listeria monocytogenes, Streptococcus pyogenes (Group A Streptococcus), Streptococcus agalactiae (Group B Streptococcus), Streptococcus viridans, Streptococcus faecalis, Streptococcus bovis, Streptococcus pneumoniae, Haemophilus influenzae, Bacillus antracis, Erysipelothrix rhusiopathiae, Clostridium tetani, Enterobacter aerogenes, Klebsiella pneumoniae, Pasturella multocida, Attorney docket No.
- FIG. 1A-B provides exemplary workflow/flowchart to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. As shown, for a species for which a classification scheme is to be developed, a relevant public genomic data is used, assembled and structured into a “pangenome” as described in Hyun et al. (BMC Genomics 23(1):7 (2022)), which is incorporated herein in its entirety.
- BV-BRC Bacterial and Viral Bioinformatics Resource Center
- genomes are filtered for quality based on genome status, size, number of contigs, number of genes, and consistency.
- open reading frames that are either publicly annotated or generated using a program (e.g., Prodigal) are clustered by protein sequence using CD- HIT (Cluster Database at High Identity with Tolerance) into “genes”.
- GA genetic algorithm
- the identified gene sets across the resulting GA models are then combined as a short list of candidate genes from which marker sequences are derived. For each candidate gene, all DNA sequence variants are identified and subsequences of length 11-50 nt (e.g., about 25 nt) are enumerated, referred to as “kmers”.
- Such marker sequences can then be used to synthesize oligonucleotide probes and/or primers.
- This two-phase approach of analyzing the classification at the gene-level then at the kmer-level provides two benefits towards developing marker sequence sets. First, there are far fewer genes than unique kmers observed within a pangenome, so the two problems of identifying predictive genes followed by identifying predictive kmers among those genes are both much more computationally tractable than the direct approach of enumerating all observed kmers and identifying predictive kmers directly. Second, this approach yields marker sequences that closely track specific genes, allowing for a more straightforward biological interpretation of proposed marker sequences and potentially facilitate commercial adoption.
- the methods of the disclosure are innovative in that they fully utilize big data and generate large pangenome collections which fully represent all publicly available sequences of strains. This improves the accuracy of the classification schema over traditional methods and better identifies rarer classification subtypes. Moreover, the methods of the disclosure provide faster classification of pathogens isolated from a patient than is currently possible. Millions of pathogens are classified in pathology labs in the US annually. Similarly, the methods of the disclosure can rapidly identify drug-resistant strains. Other applications can also be envisaged, such as screening wastewater or other samples (e.g., patient samples, environmental samples, etc.) for specific strains of bacteria or viruses (e.g., COVID/Sars COV- 2).
- the markers or primers developed by the methods of the disclosure can be used to screen for a pathogen or pathogen’s antimicrobial susceptibility.
- the methods (and resulting oligonucleotide compositions and kits) provide improved identification and/or quantification of target nucleic acid molecules in a sample from a subject, e.g., by RT-qPCR and/or next- gen-sequencing (NGS).
- NGS next- gen-sequencing
- a number of different assay techniques can use the oligonucleotide primers obtained by the methods of the disclosure including, but not limited to, lateral flow assays, PCR, NGS, southern blots, northern blots, and the like.
- FIG. 17 illustrates a block diagram of an example machine 400 upon which one or more embodiments (e.g., the foregoing discussed methodologies) can be implemented (e.g., run).
- Examples of machine 400 can include logic, one or more components, circuits (e.g., modules), or mechanisms. Circuits are tangible entities configured to perform certain operations. In an example, circuits can be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner.
- one or more computer systems e.g., a standalone, client or server computer system
- one or more hardware processors can be configured by software (e.g., instructions, an application portion, or an application) as a circuit that operates to perform certain operations as described herein.
- the software can reside (1) on a non-transitory machine readable medium or (2) in a transmission signal.
- the software when executed by the underlying hardware of the circuit, causes the circuit to perform the certain operations.
- a circuit can be implemented mechanically or electronically.
- a circuit can comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques such as discussed above, such as including a special-purpose processor, a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
- a circuit can comprise programmable logic (e.g., circuitry, as encompassed within a general-purpose processor or other programmable processor) that can be temporarily configured (e.g., by software) to Attorney docket No.00015-421WO1 perform the certain operations. It will be appreciated that the decision to implement a circuit mechanically (e.g., in dedicated and permanently configured circuitry), or in temporarily configured circuitry (e.g., configured by software) can be driven by cost and time considerations.
- processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute processor-implemented circuits that operate to perform one or more operations or functions. In an example, the circuits referred to herein can comprise processor-implemented circuits.
- the methods described herein can be at least partially processor-implemented. For example, at least some of the operations of a method can be performed by one or processors or processor-implemented circuits. The performance of certain of the operations can be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines.
- the processor or processors can be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other examples the processors can be distributed across a number of locations.
- the one or more processors can also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations can be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., Application Program Interfaces (APIs)).
- APIs Application Program Interfaces
- Example embodiments can be implemented in digital electronic circuitry, in computer hardware, in firmware, in software, or in any combination thereof.
- Example embodiments can be implemented using a computer program product (e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution Attorney docket No.00015-421WO1 by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers).
- a computer program product e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution Attorney docket No.00015-421WO1 by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
- a computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a software module, subroutine, or other unit suitable for use in a computing environment.
- a computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- operations can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Examples of method operations can also be performed by, and example apparatus can be implemented as, special purpose logic circuitry (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)).
- FPGA field programmable gate array
- ASIC application-specific integrated circuit
- the computing system can include clients and servers.
- a client and server are generally remote from each other and generally interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- both hardware and software architectures require consideration.
- the choice of whether to implement certain functionality in permanently configured hardware e.g., an ASIC
- temporarily configured hardware e.g., a combination of software and a programmable processor
- a combination of permanently and temporarily configured hardware can be a design choice.
- hardware e.g., machine 400
- software architectures that can be deployed in example embodiments.
- the machine 400 can operate as a standalone device or the machine 400 can be connected (e.g., networked) to other machines.
- the machine 400 can operate in the capacity of either a server or a client machine in server-client Attorney docket No.00015-421WO1 network environments.
- machine 400 can act as a peer machine in peer-to-peer (or other distributed) network environments.
- the machine 400 can be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) specifying actions to be taken (e.g., performed) by the machine 400.
- PC personal computer
- PDA Personal Digital Assistant
- STB set-top box
- PDA Personal Digital Assistant
- mobile telephone a web appliance
- network router switch or bridge
- Example machine 400 can include a processor 402 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 404 and a static memory 406, some or all of which can communicate with each other via a bus 408.
- the machine 400 can further include a display unit 410, an alphanumeric input device 412 (e.g., a keyboard), and a user interface (UI) navigation device 411 (e.g., a mouse).
- the display unit 810, input device 417 and UI navigation device 414 can be a touch screen display.
- the machine 400 can additionally include a storage device (e.g., drive unit) 416, a signal generation device 418 (e.g., a speaker), a network interface device 420, and one or more sensors 421, such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor.
- the storage device 416 can include a machine readable medium 422 on which is stored one or more sets of data structures or instructions 424 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein.
- the instructions 424 can also reside, completely or at least partially, within the main memory 404, within static memory 406, or within the processor 402 during execution thereof by the machine 400.
- one or any combination of the processor 402, the main memory 404, the static memory 406, or the storage device 416 can constitute machine readable media.
- Attorney docket No.00015-421WO1 [0078] While the machine readable medium 422 is illustrated as a single medium, the term “machine readable medium” can include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that configured to store the one or more instructions 424.
- machine readable medium can also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions.
- machine readable medium can accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media.
- machine readable media can include non-volatile memory, including, by way of example, semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)
- flash memory devices e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)
- flash memory devices e.g., electrically Erasable Programmable Read-Only Memory (EEPROM)
- EPROM Electrically Programmable Read-Only Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- flash memory devices e.g., electrically Era
- Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., IEEE 802.16 standards family known as Wi-Fi®, IEEE 802.16 standards family known as WiMax®), peer-to-peer (P2P) networks, among others.
- the term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software.
- FIG. 18 is a block diagram that illustrates a system 130 including a computer system 140 and the associated Internet 11 connection upon which an embodiment may be Attorney docket No.00015-421WO1 implemented.
- Such configuration is typically used for computers (hosts) connected to the Internet 11 and executing a server or a client (or a combination) software.
- a source computer such as laptop, an ultimate destination computer and relay servers, for example, as well as any computer or processor described herein, may use the computer system configuration and the Internet connection shown in FIG. 18.
- the system 140 may be used as a portable electronic device such as a notebook/laptop computer, a media player (e.g., MP3 based or video player), a cellular phone, a Personal Digital Assistant (PDA), a sample device, an image processing device (e.g., a digital camera or video recorder), and/or any other handheld computing devices, or a combination of any of these devices.
- a portable electronic device such as a notebook/laptop computer, a media player (e.g., MP3 based or video player), a cellular phone, a Personal Digital Assistant (PDA), a sample device, an image processing device (e.g., a digital camera or video recorder), and/or any other handheld computing devices, or a combination of any of these devices.
- PDA Personal Digital Assistant
- FIG. 17 illustrates various components of a computer system, it is not intended to represent any particular architecture or manner of interconnecting the components; as such details are not germane to the present invention. It will also be appreciated that network computers, handheld computers
- Computer system 140 includes a bus 137, an interconnect, or other communication mechanism for communicating information, and a processor 138, commonly in the form of an integrated circuit, coupled with bus 137 for processing information and for executing the computer executable instructions.
- Computer system 140 also includes a main memory 134, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 137 for storing information and instructions to be executed by processor 138.
- Main memory 134 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 138.
- Computer system 140 further includes a Read Only Memory (ROM) 136 (or other non-volatile memory) or other static storage device coupled to bus 137 for storing static information and instructions for processor 138.
- ROM Read Only Memory
- the hard disk drive, magnetic disk drive, and optical disk drive may be connected to the system bus by a hard disk drive interface, a magnetic disk drive interface, and an optical disk drive interface, respectively.
- the drives and their associated computer-readable media provide non- volatile storage of computer readable instructions, data structures, program modules and other data for the general purpose computing devices.
- OS Operating System
- An operating system commonly processes system data and user input, and responds by allocating and managing tasks and internal system resources, such as controlling and allocating memory, prioritizing system requests, controlling input and output devices, facilitating networking and managing files.
- Non-limiting examples of operating systems are Microsoft Windows, Mac OS X, and Linux.
- processor is meant to include any integrated circuit or other electronic device (or collection of devices) capable of performing an operation on at least one instruction including, without limitation, Reduced Instruction Set Core (RISC) processors, CISC microprocessors, Microcontroller Units (MCUs), CISC-based Central Processing Units (CPUs), and Digital Signal Processors (DSPs).
- RISC Reduced Instruction Set Core
- MCU Microcontroller Unit
- CPU Central Processing Unit
- DSPs Digital Signal Processors
- the hardware of such devices may be integrated onto a single substrate (e.g., silicon “die”), or distributed among two or more substrates.
- various functional aspects of the processor may be implemented solely as software or firmware associated with the processor.
- Computer system 140 may be coupled via bus 137 to a display 131, such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a flat screen monitor, a touch screen monitor or similar means for displaying text and graphical data to a user.
- the display may be connected via a video adapter for supporting the display.
- the display allows a user to view, enter, and/or edit information that is relevant to the operation of the system.
- An input device 132 is Attorney docket No.00015-421WO1 coupled to bus 137 for communicating information and command selections to processor 138.
- cursor control 133 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 138 and for controlling cursor movement on display 131.
- This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
- the computer system 140 may be used for implementing the methods and techniques described herein. According to one embodiment, those methods and techniques are performed by computer system 140 in response to processor 138 executing one or more sequences of one or more instructions contained in main memory 134.
- Such instructions may be read into main memory 134 from another computer-readable medium, such as storage device 135. Execution of the sequences of instructions contained in main memory 134 causes processor 138 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the arrangement. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.
- the term “computer-readable medium” (or “machine-readable medium”) as used herein is an extensible term that refers to any medium or any memory, that participates in providing instructions to a processor, (such as processor 138) for execution, or any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
- Such a medium may store computer- executable instructions to be executed by a processing element and/or control logic, and data which is manipulated by a processing element and/or control logic, and may take many forms, including but not limited to, non-volatile medium, volatile medium, and transmission medium.
- Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 137. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infrared data communications, or other form of propagated Attorney docket No.00015-421WO1 signals (e.g., carrier waves, infrared signals, digital signals, etc.).
- the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
- a modem local to computer system 140 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
- An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 137.
- Bus 137 carries the data to main memory 134, from which processor 138 retrieves and executes the instructions.
- the instructions received by main memory 134 may optionally be stored on storage device 135 either before or after execution by processor 138.
- Computer system 140 also includes a communication interface 141 coupled to bus 137.
- Communication interface 141 provides a two-way data communication coupling to a network link 139 that is connected to a local network 111.
- communication interface 141 may be an Integrated Services Digital Network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line.
- ISDN Integrated Services Digital Network
- communication interface 141 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
- LAN local area network
- Ethernet based connection based on IEEE802.3 standard may be used such as 10/100 BaseT, 1000 BaseT (gigabit Ethernet), 10 gigabit Ethernet (10 GE or 10 GbE or 10 GigE per IEEE Std 802.3ae-2002 as standard), 40 Gigabit Ethernet (40 GbE), or 100 Gigabit Ethernet (100 GbE as per Ethernet Attorney docket No.00015-421WO1 standard IEEE P802.3ba), as described in Cisco Systems, Inc. Publication number 1-587005-001-3 (June 1999), “Internetworking Technologies Handbook”, Chapter 7: “Ethernet Technologies”, pages 7- 1 to 7-38, which is incorporated in its entirety for all purposes as if fully set forth herein.
- network link 139 may provide a connection through local network 111 to a host computer or to data equipment operated by an Internet Service Provider (ISP) 142.
- ISP 142 provides data communication services through the world wide packet data communication network Internet 11.
- Local network 111 and Internet 11 both use electrical, electromagnetic or optical signals that carry digital data streams.
- satellite and network satellite communication and modules may be implemented.
- the signals through the various networks and the signals on the network link 139 and through the communication interface 141, which carry the digital data to and from computer system 140, are exemplary forms of carrier waves transporting the information.
- a received code may be executed by processor 138 as it is received, and/or stored in storage device 135, or other non-volatile storage for later execution.
- computer system 140 may obtain application code in the form of a carrier wave.
- the concept of rapid classification of DNA for applications in various clinical, agricultural, environmental and military/forensic scenarios are disclosed herein, and may be Attorney docket No.00015-421WO1 implemented and utilized with the related processors, networks, computer systems, internet, and components and functions according to the schemes disclosed herein.
- the methods and computer implemented methods of the disclosure can be implemented on the computers, systems and architecture described herein. The methods may be implemented by a computer program or programs stored on a computer readable medium such that the program causes the computer to carry out the methods and steps set forth above.
- Example 1 Primers for rapid classification of E. coli isolates into Clermont phylogroups. 15,278 E. coli genomes were split between a training set of 14,638 public genomes from BV-BRC and RefSeq using quality control measures as described in Hyun et al. (BMC Genomics 23(1):7 (2022)) and a validation set of 640 internal genomes. All genomes were assigned phylogroup annotations in line with current practice using ClermonTyping.
- 293,826 genes were identified through pangenome construction and were filtered for those potentially associated with the phylogroup classification (defined as: there exists a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10), yielding 14,836 genes. [0095] Genetic algorithms (GAs) were applied to iteratively identify genes that (1) both accurately discriminate phylogroups and (2) can be reliably identified by a marker sequence, given all observed variants and against the background of the E. coli pangenome.
- GAs Genetic algorithms
- each candidate gene all 25 nt subsequences or “kmers” were counted for all observed variants of the gene, and kmers occurring in at least 90% of variants weighted by frequency were identified as primer candidates.
- 100 genomes were randomly sampled for each gene with balanced representation of genomes with/without the gene and across all phylogroups, and the presence/absence of primer candidates in the sampled genomes was used to compute the F1 score of each primer candidate at recovering its corresponding gene. Genes for which no primer candidate with F1 > 90% were removed, completing an iteration of the marker identification process.
- a total of six iterations were completed, after which all primer candidates across all iterations with F1 > 80% were identified.
- the primers selected by the most accurate GA for each primer set size (4-8) are included as the final set of marker sequences for classifying E. coli by phylogroup.
- the marker sequences for each size (4-8) are available in Table 1, with annotations for corresponding genes derived from eggNOG-mapper. Confusion matrices describing the performance of each marker set presented in FIG. 2 and FIG. 3. Interpretation of marker sequences to predict E. coli phylogroup are described in FIGs. 4-8.
- Table 1 Sets of marker sequences for the identification of E. coli phylogroups. For each marker, the predicted protein product Attorney docket No.00015-421WO1 of the targeted gene, gene name, and gene identifiers are shown when available. Markers 4B and 8A (starred) are identical.
- S. aureus genomes of high quality and with clonal complex metadata were identified from the BV-BRC database, resulting in 753 genomes.
- the genomes spanned 14 clonal complexes with at least 7 genomes each, as well as 8 other rarer clonal complexes which were grouped together as “other”. 10% of genomes were randomly selected to be the validation set, with the remaining 90% as the training set.
- a S. aureus pangenome was prepared similarly to the previous example, and genes were filtered for those present in at least three genomes or missing in at least three genomes, resulting in 6,990 genes. [00100] Sets of genes and marker sequences were iteratively identified using the same genetic algorithm approach as the previous example and as described in FIG.
- RefSeq/GenBank refers to the most common variant of the gene targeted by the primer.
- ID Primer Primer Target Description RefSeq/GenBank COG 2-4A CATGACATGTTCGTAAATGATGATT ( SEQ ID NO:31) Staphylococcal enterotoxin type 26 WP_001622271.1 2CC33 2-4B TCTGAAAAACCAAATTGTACAGACG ( SEQ ID NO:32) MFS transporter WP_000130758.1 COG0477 2-4C CAAATGTATAAATAATAAATGCTAT ( SEQ ID NO:33) Uncharacterized membrane protein WP_000956429.1 COG4858 2-4D TTAGATAATTATTTAGTATTAGCAT HTH-type transcriptional regulator ( SEQ ID NO:34) SarT ADC38645.1 COG1846 2-5A TTAGAATCTTTTGCCTTTACCGCAT LPXTG-anchored surface protein ( SEQ ID NO:35) SasK AYV00666.1 - - 2-5
- E. coli genomes from Example 1 were filtered to those with antimicrobial resistance metadata against ciprofloxacin available on BV-BRC, resulting in 1044 genomes (179 resistant, 865 susceptible).
- the initial feature set was expanded from genes to also include individual amino acid variants of those genes or “alleles”, resulting in 267,328 initial genetic features. These features were pre-filtered based on association with the resistance phenotype similarly to Example 1, yielding 13,953 features that satisfy: the feature is present in X% of resistant genomes and Y% of susceptible genomes and
- a single set of marker sequences (size 4) is available in Table 3 with annotations for corresponding genes derived from eggNOG-mapper and NCBI blastp. Larger primer sets were excluded as they targeted similar sets of genes and did not confer meaningful improvements to accuracy. The confusion matrix and interpretation of this marker sequence set is available in FIG. 13. Notably, the GA approach independently selected primers targeting regions of gyrA and parC, which are known determinants of ciprofloxacin resistance. Table 3: Four marker sequences for the determination of resistance against ciprofloxacin in E. coli B2 strains. For each primer, the predicted protein product of the targeted gene, gene name, and gene identifiers are shown when available.
- the target of primer 3-4D is an undercharacterized protein represented by ESA89826.1 (GenBank).
- ID Primer Primer Target Description Gene bnum COG KEGG Attorney docket No.00015-421WO1 3 -4A TTCGTGGTATTCGTTTAGGCGAAGG ( SEQ ID NO:46) DNA gyrase subunit A gyrA b2231 COG0188 K02469 3-4B TAACAGGCAATATCGCCGTGCGGAT DNA topoisomerase IV subunit ( SEQ ID NO:47) A parC b3019 COG0188 K02621 3 -4C GTTATCGCGATGAATATAAACTGGC Putative FAD-linked ( SEQ ID NO:48) oxidoreductase ydiJ b1687 COG0247 - CAGCATGGCCCATCCTACTGAAACT - [00103] Validation of Example 3: Marker sequences for determining resistance against ciprofloxacin for E.
- AMR phenotypes (binary ciprofloxacin resistant/susceptible calls) were predicted in silico and independently of the marker sequences by first constructing a pangenome combining 1260 publicly available B2 strains with ciprofloxacin resistance data from BV-BRC (Olson et al., Nucl. Acids. Res., 51:D678-689, 2023) with the 442 candidate validation strains.
- ⁇ 3-4B Targets part of the quinolone resistance-determining region of parC (DNA topoisomerase IV subunit A), capturing a single SNP against the consensus that yields the resistance mutation S80I (AGT -> ATT).
- ⁇ 3-4C Targets a stable region of ydiJ (putative FAD-linked oxidoreductase). Aligned exactly in all 72 validation strains. Was previously found missing very rarely in the initial training strains used to develop the marker sequences.
- ⁇ 3-4D Targets a stable region of rarely detected abi (abortive infection family protein). Aligned in only 6 validation strains, and all such alignments were exact matches. [00107] Ciprofloxacin resistance testing of validation strains.
- Ciprofloxacin resistance predictions based on the four original marker sequences were 72% accurate with 20 false negatives and 0 false positives (FIG. 15). Analysis of the errors found that 17/20 false negatives could be attributed to missing a double SNP mutation (AGT -> ATC) that would yield the same S80I substitution as the single SNP captured by marker 3-4B. This double SNP is able to be captured by the addition of a 5th marker 3-4B* (TAACAGGCGATATCGCCGTGCGGAT (SEQ ID NO:50)) with one base pair difference from marker 3-4B.
- annealing temperature was selected which optimized the likelihood of the gyrA and parC ARMS primer pairs working well on their intended targets but poorly or not at all on the non-targeted counterparts.
- the ydiJ and abi primers were adjusted to function well at the same temperature so that all six assays could be run on the same plate at the same time.
- the PCR reactions were all run in a 96-well format on a CFX Duet Real-Time PCR System. Thirty-five cycles were run in total, each cycle consisting of 10 seconds at 95°C, 15 seconds at 69°C, and 30 seconds at 72°C.
- Outcomes were scored by first setting a threshold to cross all of the amplification curves in their log-linear region, and calls were based on Cq numbers as follows: For gyrA and parC targets, the marker or variant was scored as positive if its Cq number was lower than its counterpart by at least 5 cycles. For samples whose amplification curves did not reach threshold, 35, the total number of cycles run, was used for the calculations. For ydiJ and abi targets, Cq ⁇ 25 was considered positive. [00112] Consistency between PCR assay and in silico marker sequence presence/absence calls derived from assembled genomes on validation strains.
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Organic Chemistry (AREA)
- Physics & Mathematics (AREA)
- Wood Science & Technology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Zoology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Botany (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Mycology (AREA)
- Public Health (AREA)
- Virology (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
Abstract
The disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies.
Description
Attorney docket No.00015-421WO1 IDENTIFICATION OF MARKER SEQUENCES FOR RAPID CLASSIFICATION OF MICROBIAL PATHOGENS THROUGH MACHINE LEARNING ANALYSIS OF GENOME ASSEMBLIES CROSS REFERENCE TO RELATED APPLICATIONS [001] This application claims priority under 35 U.S.C. §119 from Provisional Application Serial No. 63/470,417, filed June 1, 2023 the disclosures of which are incorporated herein by reference. STATEMENT OF GOVERNMENT SUPPORT [002] This invention was made with Government support under Grant Nos U01-AI124316, awarded by the National Institute of Allergy and Infectious Diseases. The Government has certain rights in the invention. TECHNICAL FIELD [003] The disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. NCORPORATION BY REFERENCE OF SEQUENCE LISTING [004] Accompanying this filing is a Sequence Listing entitled, “00015-421WO1.xml” created on May 31, 2024 and having 56,079 bytes of data, machine formatted on IBM-PC, MS-Windows operating system. The sequence listing is hereby incorporated by reference in its entirety for all purposes. BACKGROUND [005] A challenge in dealing with acute bacterial infections in a clinical setting is to obtain rapid identification of the invading pathogen. Once the taxonomic identity of the pathogen has been determined, a second challenge is to determine if it has known antimicrobial resistance (AMR) characteristics, which are important for determining the appropriate treatment modality. Currently, these determinations are made by time consuming laboratory procedures that are typically performed in a pathology laboratory that require a day or two for analysis. SUMMARY [006] The disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. In a particular embodiment, a method disclosed herein provides for generating a pangenome compendium from genome assemblies of all publicly available
Attorney docket No.00015-421WO1 microbial strains of interest. This pangenome compendium of strains is then run through a machine-learning pipeline alongside a classification schema (e.g., pathogenic vs. nonpathogenic strains) to identify genes which can robustly classify microbial strains into the desired schema. These genes are then used to develop primer sequences which can rapidly identify these microbial strains through PCR tests performed in a laboratory. [007] For example, given a prokaryotic pangenome of interest (e.g., a known pathogenic species like E. coli or S. aureus) and a desired classification scheme (e.g., pathotypes, anti-microbial resistance (AMR), etc.), the methods disclosed herein can identify a limited number of short DNA sequences that can accurately reproduce the classification scheme. As an exemplary embodiment, the disclosure presented herein reproduce three different classification schemes: (1) determining the phylogenetic group of Escherichia coli traditionally defined by Clermont Typing, (2) determining the clonal complex of Staphylococcus aureus, and (3) determining resistance against ciprofloxacin for Escherichia coli isolates in the B2 phylogroup. [008] The disclosure provides a method to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, comprising carrying out steps (1)-(8): (1) clustering open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filtering the genes to identify sets of genes that are associated by a targeted classification scheme; (3) training machine learning models to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes; (4) identifying sequence variants from each of the candidate genes, and from which, subsequences having a defined number of nucleotides are enumerated, wherein the enumerated subsequences of length k are
Attorney docket No.00015-421WO1 designated “kmers”; (5) identifying conserved kmers for all candidate genes across the pangenome; (6) filtering the conserved kmers to identify those conserved kmers that accurately reproduce the presence/absence of their corresponding gene, wherein these identified conserved kmers are designated as accurate kmers; (7) training machine learning models to classify genomes from the accurate kmers and filtering out genes that cannot be reproduced accurately from a single accurate kmers, wherein steps (3)-(6) are repeated until a kmers-based machine learning model able to reproduce the target classification scheme with accuracy greater than 95%, 98%, or 99% is generated; and (8) selecting a minimal set of oligonucleotide marker sequences from the accurate machine learning model, wherein the minimal set of oligonucleotide marker sequences can be used to generate primers for the classification of microbial pathogens. In one embodiment, the microbial pathogens are selected from bacterial pathogens, viral pathogens, and fungi pathogens. In another or further embodiment, the dataset comprises microbial genome assemblies from bacteria, viruses or fungi. In another embodiment, for step (1), the microbial genomes assemblies contain genomes that have been filtered for quality based on genome status, size, number of contigs, number of genes and consistency. In still another or further embodiment, in step (1), the microbial genomes assemblies have been annotated to mark open reading frames in the genomes. In still another or further embodiment, in step (1), the open reading frames are clustered using a program that clusters and compares protein or nucleotide sequences. In yet another or further embodiment, in step (2), the targeted classification scheme comprises genes that are related to drug-resistance, virulence, pathotype, or phylogenetically-defined subgroups. In another or further embodiment, in step (2), the genes are filtered for those associated with the classification scheme, wherein association is based upon a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10. In yet another or further embodiment, in step (3), the machine learning models are trained to identify 4-8 genes that most accurately recover the targeted classification scheme. In still another or further embodiment, in step (4), the subsequences have a number of
Attorney docket No.00015-421WO1 nucleotides selected from 18 nt, 19 nt, 20 nt, 21 nt, 22 nt, 23 nt, 24 nt, 25 nt, 26 nt, 27 nt, 28 nt, 29 nt, 30 nt, 31 nt, 32 nt, 33 nt, 34 nt, and 35 nt, or a range of nucleotides that includes or is between any two of the foregoing numbers. In yet a further embodiment, the subsequences have a number of nucleotides selected from 20 nt to 30 nt. In still another or further embodiment, in step (5), the conserved kmers occur in greater than 80%, 85%, 90%, 91%, 92%, 93%, 94% or 95% of instances of the gene. In another or further embodiment, in step (6), the accuracy of the conserved kmers to reproduce the presence/absence of their corresponding gene is determined based upon the F1 score. In still another or further embodiment, in step (7), the machine learning models are trained to identify 4-8 kmers that most accurately recover the targeted classification scheme. In still another or further embodiment, of any of the foregoing, the method further comprises: classifying a microbial pathogen using one or more oligonucleotide marker sequences selected in step (8). [009] The disclosure also provides that the foregoing method can be implemented by a computer. In one embodiment, the computer system implementing the method can be linked to an oligonucleotide synthesizer. [0010] The disclosure also provides a computer readable medium comprising instructions to cause a computer to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, the computer instructions causing the computer to: (1) cluster open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filter the genes to identify sets of genes that are associated by a targeted classification scheme; (3) train machine learning model(s) to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes;
Attorney docket No.00015-421WO1 (4) identify sequence variants from each of the candidate genes, and from which, subsequences having a defined number of nucleotides are enumerated, wherein the enumerated subsequences of length k are designated “kmers”; (5) identify conserved kmers for all candidate genes across the pangenome; (6) filter the conserved kmers to identify those conserved kmers that accurately reproduce the presence/absence of their corresponding gene, wherein these identified conserved kmers are designated as accurate kmers; (7) train machine learning models to classify genomes from the accurate kmers and filtering out genes that cannot be reproduced accurately from a single accurate kmers, wherein steps (3)-(6) are repeated until a kmers-based machine learning model able to reproduce the target classification scheme with accuracy greater than 95%, 98%, or 99% is generated; and (8) select a minimal set of oligonucleotide marker sequences from the accurate machine learning model, wherein the minimal set of oligonucleotide marker sequences can be used to generate primers for the classification of microbial pathogens. [0011] The disclosure also provides a catalog of oligonucleotide marker sequences obtained by the method or a computer running the computer readable medium of the disclosure. [0012] The disclosure also provides a method of identifying a pathogen in a sample, the method comprising comparing sequence reads obtained from the genome of the pathogen in the sample to the catalog of oligonucleotide markers in the catalog to identify the pathogen with a complementary sequence. DESCRIPTION OF DRAWINGS [0013] Figure 1A-B presents exemplary flowcharts/workflows for identifying a minimal set of marker sequences or “primers” that accurately reproduces a known genome classification scheme using massive public datasets and genetic algorithms. (A) A flowchart demonstrating how a minimal set of marker sequences can be generated from a collection of genome assemblies, with various decisions indicated to improve machine learning model accuracy. (B) Additional workflows to identify a minimal set of marker sequences or “primers” using different intermediate data types. [0014] Figure 2 displays confusion matrices for classifying E. coli into phylogroups from primer sets of 4, 5, and 6 sequences. For
Attorney docket No.00015-421WO1 marker sets of size 4 and 5, predictions corresponding to any of the cryptic clades, non-Escherichia “Non-Esc.”, or unknown were not generated. [0015] Figure 3 displays confusion matrices for classifying E. coli into phylogroups from primer sets of 7 or 8 sequences. The label “Non-Esc.” corresponds to a non-Escherichia classification. [0016] Figure 4 presents interpretation of the 4-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0017] Figure 5 presents interpretation of the 5-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0018] Figure 6 presents interpretation of the 6-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0019] Figure 7 presents interpretation of the 7-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0020] Figure 8 presents interpretation of the 8-sequence primer set for predicting E. coli phylogroup. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0021] Figure 9 displays confusion matrices for classifying S. aureus into clonal complexes from primer sets of 4, 5, and 6 sequences. [0022] Figure 10 presents interpretation of the 4-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0023] Figure 11 presents interpretation of the 5-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data.
Attorney docket No.00015-421WO1 [0024] Figure 12 provides interpretation of the 6-sequence primer set for predicting S. aureus clonal complex. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0025] Figure 13A-B provides performance and interpretation of four primers for predicting the ciprofloxacin resistance phenotype of E. coli B2 strains. (A) Confusion matrix for the binary classification of E. coli B2 strains by ciprofloxacin resistance phenotype from the four primers. (B) Interpretation of the four primers to predict ciprofloxacin resistance phenotype. White indicates the primer sequence is present, black indicates absence. Combinations not shown were not encountered in the training data. [0026] Figure 14 provide genetic and antimicrobial resistance profiles of 442 candidate strains for ciprofloxacin resistance marker sequence validation. Genetic cluster membership and predicted ciprofloxacin resistance phenotypes are shown for all 442 candidates (top) and the 72 selected validation strains (bottom). [0027] Figure 15 provides a comparison of predicted vs. experimentally observed ciprofloxacin resistance phenotypes for 72 validation strains. Prediction method “ML” refers to the pangenome- based XGBoost model, “Marker” refers to the original proposed interpretation of the four marker sequences, and “Marker+1” refers to the modified interpretation adding a fifth marker to also capture the S80I double SNP mutation. Strains have been sorted by observed resistance phenotype first, then by strain genetic cluster. [0028] Figure 16 shows detection of ciprofloxacin resistance marker sequence targets using a PCR-based approach. [0029] Figure 17 illustrates a block diagram of an example machine upon which one or more embodiments (e.g., discussed methodologies) can be implemented. [0030] Figure 18 depicts a block diagram for a system or related method of an embodiment of the present invention in whole or in part. DETAILED DESCRIPTION [0031] As used herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to
Attorney docket No.00015-421WO1 "a gene" includes a plurality of such genes and reference to "the gene variant" includes reference to one or more gene variants and equivalents thereof known to those skilled in the art, and so forth. [0032] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this disclosure belongs. Although many methods and reagents are similar or equivalent to those described herein, the exemplary methods and materials are disclosed herein. [0033] All publications mentioned herein are incorporated by reference in full for the purpose of describing and disclosing methodologies that might be used in connection with the description herein. The publications are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior disclosure. Moreover, with respect to any term that is presented in one or more publications that is like, or identical with, a term that has been expressly defined in this disclosure, the definition of the term as expressly provided in this disclosure will control in all respects. [0034] Also, the use of “or” means “and/or” unless stated otherwise. Similarly, “comprise,” “comprises,” “comprising” “include,” “includes,” and “including” are interchangeable and not intended to be limiting. [0035] It is to be further understood that where descriptions of various embodiments use the term “comprising,” those skilled in the art would understand that in some specific instances, an embodiment can be alternatively described using language “consisting essentially of” or “consisting of.” [0036] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Allen et al., Remington: The Science and Practice of Pharmacy 22 ed., Pharmaceutical Press (September 15, 2012); Hornyak et al., Introduction to Nanoscience and Nanotechnology, CRC Press (2008); Singleton and Sainsbury, Dictionary of Microbiology and Molecular Biology 3 ed., revised ed., J. Wiley & Sons (New York, NY 2006);
Attorney docket No.00015-421WO1 Smith, March’s Advanced Organic Chemistry Reactions, Mechanisms and Structure 7 ed., J. Wiley & Sons (New York, NY 2013); Singleton, Dictionary of DNA and Genome Technology 3 ed., Wiley-Blackwell (November 28, 2012); and Green and Sambrook, Molecular Cloning: A Laboratory Manual 4th ed., Cold Spring Harbor Laboratory Press (Cold Spring Harbor, NY 2012), provide one skilled in the art with a general guide to many of the terms used in the present application. [0037] All headings and subheading provided herein are solely for ease of reading and should not be construed to limit the invention. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the invention, suitable methods and materials are described below. [0038] It should be understood that this disclosure is not limited to the particular methodology, protocols, and reagents, etc., described herein and as such may vary. The terminology used herein is for the purpose of describing particular embodiments or aspects only and is not intended to limit the scope of the present disclosure. [0039] Other than in the operating examples, or where otherwise indicated, all numbers expressing quantities of ingredients or reaction conditions used herein should be understood as modified in all instances by the term "about." The term "about" when used to describe embodiments of the disclosure, in connection with percentages means ±1%. The term “about,” as used herein can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which can depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. Alternatively, “about” can mean a range of plus or minus 20%, plus or minus 10%, plus or minus 5%, or plus or minus 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” means within an acceptable error range for the particular value. Also, where ranges and/or subranges of values are provided, the
Attorney docket No.00015-421WO1 ranges and/or subranges can include the endpoints of the ranges and/or subranges. [0040] For the recitation of numeric ranges herein, each intervening number there between with the same degree of precision is explicitly contemplated. For example, for the range of 6-9, the numbers 7 and 8 are contemplated in addition to 6 and 9, and for the range 6.0-7.0, the number 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are explicitly contemplated. [0041] As used herein, the term “amplifying” refers to the process of synthesizing nucleic acid molecules that are complementary to one (or both strands) of a template nucleic acid molecule. Amplifying a nucleic acid molecule typically includes denaturing the template nucleic acid, particularly if the template nucleic acid is double- stranded, annealing one or more primers (e.g., primers generated by the methods of the disclosure) to the template nucleic acid at a temperature that is below the melting temperatures of the primers, and enzymatically elongating from the primers to generate an amplification product. Generally, synthesis initiates at the 3′ end of a primer and proceeds in a 5′ to 3′ direction along the template nucleic acid strand. Amplification typically requires the presence of deoxyribonucleoside triphosphates, a polymerase enzyme (e.g., DNA or RNA polymerase or T7 for in vitro transcription in TMA) and an appropriate buffer and/or co-factors for optimal activity of the polymerase enzyme (e.g., MgCl and/or KCl). [0042] With respect to the term “different taxon of pathogens”, the term is distinct from the “particular taxon of pathogens”. Here, the different taxon of pathogenic microorganisms does not overlap with the particular taxon of pathogens. For example, if a particular taxon of pathogenic microorganisms includes the family of Flavivirus, the different taxon of pathogenic microorganisms does not include Flavivirus but can include another family of viruses, such as Alphaviruses, bacterial, fungal, archaea, algal, protozoan, and/or parasitic pathogens. If the particular taxon of pathogenic microorganisms and different taxon of pathogenic microorganisms are from the same domain (e.g., bacterial domain), the two taxa identified by the method are distinct.
Attorney docket No.00015-421WO1 [0043] The term "microorganism" or “microbial organism” is used in its broadest sense and includes Gram negative aerobic bacteria, Gram positive aerobic bacteria, Gram negative microaerophillic bacteria, Gram positive microaerophillic bacteria, Gram negative facultative anaerobic bacteria, Gram positive facultative anaerobic bacteria, Gram negative anaerobic bacteria, Gram positive anaerobic bacteria, Gram positive asporogenic bacteria, Actinomycetes, fungal microorganism, protazoan microorganism and the like. [0044] As used herein, the term “pathogen” refers to a virus, bacterium, protozoa, prion, archaea, fungus, algae, parasite, or other microbe (helminth) that causes or induces disease or illness in a subject or that may be found in biological and/or environmental samples. The term includes both the disease-causing organism per se and toxins produced by the pathogen (e.g., Shiga toxins) present in a sample. Detection of a pathogen as set forth in the methods disclosed herein includes detection of a portion of the genome of the pathogen or a nucleic acid molecule that is complementary or substantially complementary (i.e., at least 90% complementary) to a portion of the genome of the pathogen. [0045] The terms "polynucleotide", "nucleotide sequence", "nucleic acid" and "oligonucleotide" are used interchangeably. They refer to a polymeric form of nucleotides of any length, either deoxyribonucleotides or ribonucleotides, or analogs thereof. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (locus) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), micro-RNA (miRNA), ribozymes, cDNA, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers. A polynucleotide may comprise methylated nucleotides and nucleotide analogs. [0046] The terms "polypeptide", "peptide" and "protein" are used interchangeably herein to refer to polymers of amino acids of any length. The polymer may be linear or branched. The terms also encompass an amino acid polymer that has been modified; for example, disulfide bond formation, glycosylation, lipidation, acetylation,
Attorney docket No.00015-421WO1 phosphorylation, or any other manipulation, such as conjugation with a labeling component. As used herein the term "amino acid" includes natural and/or unnatural or synthetic amino acids, including glycine and both the D or L optical isomers, and amino acid analogs and peptidomimetics. [0047] As used herein, the term “primer” refers to oligomeric compounds, primarily to oligonucleotides containing naturally occurring nucleotides such as adenine, guanine, cytosine, thymine and/or uracil, but may also include modified oligonucleotides (e.g., modified nucleotides, nucleosides, synthetic nucleotides having modified base moieties and/or modified sugar moieties (See, Protocols for Oligonucleotide Conjugates, Methods in Molecular Biology, Vol 26, (Sudhir Agrawal, Ed., Humana Press, Totowa, N.J., (1994)); and Oligonucleotides and Analogues, A Practical Approach (Fritz Eckstein, Ed., IRL Press, Oxford University Press, Oxford) that are able to prime polynucleotide (e.g., DNA) synthesis by an enzyme, typically in a template-dependent manner, i.e., the 3′ end of the primer provides a free 3′-OH group to which further nucleotides are attached by the enzyme (e.g., DNA polymerase or reverse transcriptase) establishing a 3′ to 5′ phosphodiester linkage whereby nucleoside triphosphates are used and pyrophosphate is released. Oligonucleotides can be prepared by any suitable method, including, for example, cloning and restriction of appropriate sequences and direct chemical synthesis by a method such as the phosphotriester method of Narang et al., 1979, Meth. Enzymol. 68:90-99; the phosphodiester method of Brown et al., 1979, Meth. Enzymol. 68:109-151; the diethylphosphoramidite method of Beaucage et al., 1981, Tetrahedron Lett. 22:1859-1862; and the solid support method of U.S. Pat. No. 4,458,066. A review of synthesis methods is provided in Goodchild, 1990, Bioconjugate Chemistry 1(3):165-187. [0048] A primer is typically a single-stranded deoxyribonucleic acid. The appropriate length of a primer depends on the intended use of the primer but typically ranges from 6 to 50 nucleotides. Short primer molecules (e.g., having a length within a range of 11-17 nucleotides) generally require cooler temperatures to form sufficiently stable hybrid complexes with a template (or target) nucleic acid.
Attorney docket No.00015-421WO1 [0049] The terms "subject," "individual," and "patient" are used interchangeably herein to refer to a vertebrate, a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. [0050] The terms "taxon" or "taxa", "taxonomic group," and "taxonomic unit" are used interchangeably to refer to a group of one or more organisms that comprises a node in a clustering tree. The level of a cluster is determined by its hierarchical order. In one embodiment, a taxon is a group tentatively assumed to be a valid taxon for purposes of phylogenetic analysis. In another embodiment, a taxon is any of the extant taxonomic units under study. In yet another embodiment, a taxon is given a name and a rank. For example, a taxon can represent a domain, a sub-domain, a kingdom, a sub- kingdom, a phylum, a sub- phylum, a class, a sub-class, an order, a sub-order, a family, a subfamily, a genus, a subgenus, or a species. In some embodiments, taxa can represent one or more organisms from the kingdoms eubacteria, protista, or fungi at any level of a hierarchal order. [0051] A challenge in dealing with acute infections (e.g., bacterial infections) in a clinical setting is to obtain rapid identification of the invading pathogen. Once the taxonomic identity of the pathogen has been determined, a second challenge is to determine if it has known antimicrobial resistance (AMR) characteristics, which are important for determining the appropriate treatment modality. Currently, these determinations are made by time consuming laboratory procedures that are typically performed in a pathology laboratory that require a day or two to perform. [0052] An alternative is to perform genetic assessment of the invading pathogens. Such tests can be performed more rapidly than pathology laboratory procedures requiring pathogen cultures and test screening, thereby enabling earlier implementation of appropriate treatment modality. Genetic tests require a priori knowledge of what DNA sequences to look for in an invading pathogen. The identity of such sequences can be obtained through the assessment of genomic sequences from a large set of strains of the invading pathogenic strain. Currently, there is a rapid increase in the number of
Attorney docket No.00015-421WO1 genomic sequences of strains of major human bacterial pathogens that can be used to achieve such an assessment. [0053] The disclosure provides methods to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. Once the proper phylogroup (i.e., which strain subgroup of a species, which is known to impact the severity of infection) designation of the microbe has been made, a follow-up panel of genetic tests can be applied to determine the AMR status of the microbe. [0054] The methods of the disclosure use various machine learning processes. Machine learning may include, but are not limited to, one or more of any combination of the following: Naïve Bayes Classifier, Neural Networks, Decision Trees, Generalized Linear Models, Nearest Neighbors, Support Vector Machines, or “ensemble” methods such as Random Forests that combine the predictions of multiple supervised machine learning models. Still yet, the training may be accomplished through simulation. [0055] In addition, the methods of the disclosure can use various genetic/genome databases. The database of reference sequences can comprise any of a variety of reference sequences. In some embodiments, the reference sequences are from one or more of bacteria, archaea, chromalveolata, viruses, fungi, plants, fish, amphibians, reptiles, birds, mammals, and humans. In some cases, the database of reference sequences consists of sequences from a reference individual or a reference sample source. In this case, the method may further comprise identifying polynucleotides from the sample source as being derived from the reference individual or the reference sample source. In some embodiments, the database of reference sequences comprises one or more mutations with respect to known polynucleotide sequences, such that a plurality of variants of the known polynucleotide sequences are represented in the database of reference sequences. The database of reference sequences can comprise marker gene sequences for taxonomic classification of bacterial sequences, such as 16S rRNA sequences. [0056] In some embodiments, the database of reference sequences consists of sequences associated with a condition. One or more such sequences may form a biosignature for a condition, a plurality of
Attorney docket No.00015-421WO1 which may together form the reference database. In some cases, the record database is associated with a condition of the sample source to establish a biosignature for a condition. When sequences are associated with a condition, the method may further comprise identifying a condition of the sample source by comparison of the record database to a biosignature, including identifying the sample source as having the condition. The condition may be contamination, such as food contamination, surface contamination, or environmental contamination. In some embodiments, the condition is infection. [0057] Where the reference database consists of sequences associated with infectious disease or contamination, the sequences may be derived from and associated with any of a variety of infectious agents. The infectious agent can be bacterial. Non- limiting examples of bacterial pathogens include Mycobacteria (e.g. M. tuberculosis, M. bovis, M. avium, M. leprae, and M. africanum), rickettsia, mycoplasma, chlamydia, and legionella. Other examples of bacterial infections include, but are not limited to, infections caused by Gram positive bacillus (e.g., Listeria, Bacillus such as Bacillus anthracis, Erysipelothrix species), Gram negative bacillus (e.g., Bartonella, Brucella, Campylobacter, Enterobacter, Escherichia, Francisella, Hemophilus, Klebsiella, Morganella, Proteus, Providencia, Pseudomonas, Salmonella, Serratia, Shigella, Vibrio and Yersinia species), spirochete bacteria (e.g., Borrelia species including Borrelia burgdorferi that causes Lyme disease), anaerobic bacteria (e.g., Actinomyces and Clostridium species), Gram positive and negative coccal bacteria, Enterococcus species, Streptococcus species, Pneumococcus species, Staphylococcus species, and Neisseria species. Specific examples of infectious bacteria include, but are not limited to: Helicobacter pyloris, Legionella pneumophilia, M. intracellular, M. kansaii, M. gordonae, Staphylococcus aureus, Neisseria gonorrhoeae, Neisseria meningitidis, Listeria monocytogenes, Streptococcus pyogenes (Group A Streptococcus), Streptococcus agalactiae (Group B Streptococcus), Streptococcus viridans, Streptococcus faecalis, Streptococcus bovis, Streptococcus pneumoniae, Haemophilus influenzae, Bacillus antracis, Erysipelothrix rhusiopathiae, Clostridium tetani, Enterobacter aerogenes, Klebsiella pneumoniae, Pasturella multocida,
Attorney docket No.00015-421WO1 Fusobacterium nucleatum, Streptobacillus moniliformis, Treponema pallidium, Treponema pertenue, Leptospira, Rickettsia, Actinomyces israelii, Acinetobacter, Bacillus, Bordetella, Borrelia, Brucella, Campylobacter, Chlamydia, Chlamydophila, Clostridium, Corynebacterium, Enterococcus, Haemophilus, Helicobacter, Mycobacterium, Mycoplasma, Stenotrophomonas, Treponema, Vibrio, Yersinia, Acinetobacter baumanii, Bordetella pertussis, Brucella abortus, Brucella canis, Brucella melitensis, Brucella suis, Campylobacter jejuni, Chlamydia pneumoniae, Chlamydia trachomatis, Chlamydophila psittaci, Clostridium botulinum, Clostridium difficile, Clostridium perfringens, Corynebacterium diphtheriae, Enterobacter sazakii, Enterobacter agglomerans, Enterobacter cloacae, Enterococcus faecalis, Enterococcus faecium, Escherichia coli, Francisella tularensis, Helicobacter pylori, Legionella pneumophila, Leptospira interrogans, Mycobacterium leprae, Mycobacterium tuberculosis, Mycobacterium ulcerans, Mycoplasma pneumoniae, Pseudomonas aeruginosa, Rickettsia rickettsii, Salmonella typhi, Salmonella typhimurium, Salmonella enterica, Shigella sonnei, Staphylococcus epidermidis, Staphylococcus saprophyticus, Stenotrophomonas maltophilia, Vibrio cholerae, Yersinia pestis, and the like. [0058] FIG. 1A-B provides exemplary workflow/flowchart to identify marker sequences for rapid classification of microbial strains through machine learning analysis of genome assemblies. As shown, for a species for which a classification scheme is to be developed, a relevant public genomic data is used, assembled and structured into a “pangenome” as described in Hyun et al. (BMC Genomics 23(1):7 (2022)), which is incorporated herein in its entirety. [0059] In one embodiment of the disclosure, starting with all available public genome sequences for the species on the Bacterial and Viral Bioinformatics Resource Center (BV-BRC; see, e.g., internet at bv-brc.org), genomes are filtered for quality based on genome status, size, number of contigs, number of genes, and consistency. Across all remaining high-quality genomes, open reading frames that are either publicly annotated or generated using a program (e.g., Prodigal) are clustered by protein sequence using CD- HIT (Cluster Database at High Identity with Tolerance) into “genes”.
Attorney docket No.00015-421WO1 The gene content across all genomes is then represented as a binary matrix of presence/absence calls between genes and genomes. All genomes are classified following a target classification scheme with an appropriate existing method (e.g., by examining existing metadata or implementing the current best practice), and genes are filtered for those associated with the classification (defined as: there exists a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10). Examples of a target classification scheme include, but are not limited to, classifying by pathogenic vs. nonpathogenic, classifying by drug-resistant vs. drug-susceptible, classifying by virulence, classifying by clonal complex, phylogroup, or other phylogenetically-defined subgroup, etc. [0060] Next, a genetic algorithm (GA) approach is used to iteratively identify minimal sets of genes that are (1) able to accurately reproduce the target classification scheme, and (2) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the species’ pangenome. In one embodiment, prior to testing a non-identified sample, 5-15 (e.g., 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 or 15) randomly seeded GA models are trained to identify 4-8 (e.g., 4, 5, 6, 7, or 8) genes that most accurately recover the target classification scheme. More particularly, the machine learning model can be implemented by splitting genomes into training/testing sets, with the following parameter set: num_generations=200, sol_per_pop=512, num_parents_mating=256, keep_elitism=1, parent_selection_type='sss', crossover_type='uniform’, mutation_type='random', mutation_probability=0.20. The identified gene sets across the resulting GA models are then combined as a short list of candidate genes from which marker sequences are derived. For each candidate gene, all DNA sequence variants are identified and subsequences of length 11-50 nt (e.g., about 25 nt) are enumerated, referred to as “kmers”. Conserved kmers (occurring in > 90% of instances of the gene) are identified, and the presence/absence of such kmers for all candidate genes is/are computed across all genomes and are filtered for those that accurately reproduce the presence/absence of their corresponding
Attorney docket No.00015-421WO1 gene (e.g., based on an F1 score). GA models are then similarly trained to identify 4-8 of the remaining kmers that accurately recover the target classification. If the final kmer-based GA models are not sufficiently accurate, genes that cannot be reproduced accurately by a single kmer are filtered out and the entire process is repeated, starting from the GA models trained on genes. If a sufficiently accurate GA model is found, then the selected kmers of that model is a novel set of marker sequences that accurately reproduces the original classification scheme. Such marker sequences can then be used to synthesize oligonucleotide probes and/or primers. [0061] This two-phase approach of analyzing the classification at the gene-level then at the kmer-level provides two benefits towards developing marker sequence sets. First, there are far fewer genes than unique kmers observed within a pangenome, so the two problems of identifying predictive genes followed by identifying predictive kmers among those genes are both much more computationally tractable than the direct approach of enumerating all observed kmers and identifying predictive kmers directly. Second, this approach yields marker sequences that closely track specific genes, allowing for a more straightforward biological interpretation of proposed marker sequences and potentially facilitate commercial adoption. [0062] The methods of the disclosure are innovative in that they fully utilize big data and generate large pangenome collections which fully represent all publicly available sequences of strains. This improves the accuracy of the classification schema over traditional methods and better identifies rarer classification subtypes. Moreover, the methods of the disclosure provide faster classification of pathogens isolated from a patient than is currently possible. Millions of pathogens are classified in pathology labs in the US annually. Similarly, the methods of the disclosure can rapidly identify drug-resistant strains. Other applications can also be envisaged, such as screening wastewater or other samples (e.g., patient samples, environmental samples, etc.) for specific strains of bacteria or viruses (e.g., COVID/Sars COV- 2).
Attorney docket No.00015-421WO1 [0063] The markers or primers developed by the methods of the disclosure can be used to screen for a pathogen or pathogen’s antimicrobial susceptibility. The methods (and resulting oligonucleotide compositions and kits) provide improved identification and/or quantification of target nucleic acid molecules in a sample from a subject, e.g., by RT-qPCR and/or next- gen-sequencing (NGS). [0064] A number of different assay techniques can use the oligonucleotide primers obtained by the methods of the disclosure including, but not limited to, lateral flow assays, PCR, NGS, southern blots, northern blots, and the like. [0065] FIG. 17 illustrates a block diagram of an example machine 400 upon which one or more embodiments (e.g., the foregoing discussed methodologies) can be implemented (e.g., run). Examples of machine 400 can include logic, one or more components, circuits (e.g., modules), or mechanisms. Circuits are tangible entities configured to perform certain operations. In an example, circuits can be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner. In an example, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware processors (processors) can be configured by software (e.g., instructions, an application portion, or an application) as a circuit that operates to perform certain operations as described herein. In an example, the software can reside (1) on a non-transitory machine readable medium or (2) in a transmission signal. In an example, the software, when executed by the underlying hardware of the circuit, causes the circuit to perform the certain operations. [0066] In an example, a circuit can be implemented mechanically or electronically. For example, a circuit can comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques such as discussed above, such as including a special-purpose processor, a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). In an example, a circuit can comprise programmable logic (e.g., circuitry, as encompassed within a general-purpose processor or other programmable processor) that can be temporarily configured (e.g., by software) to
Attorney docket No.00015-421WO1 perform the certain operations. It will be appreciated that the decision to implement a circuit mechanically (e.g., in dedicated and permanently configured circuitry), or in temporarily configured circuitry (e.g., configured by software) can be driven by cost and time considerations. [0067] The various operations of method examples described herein can be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute processor-implemented circuits that operate to perform one or more operations or functions. In an example, the circuits referred to herein can comprise processor-implemented circuits. [0068] Similarly, the methods described herein can be at least partially processor-implemented. For example, at least some of the operations of a method can be performed by one or processors or processor-implemented circuits. The performance of certain of the operations can be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In an example, the processor or processors can be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other examples the processors can be distributed across a number of locations. [0069] The one or more processors can also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations can be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., Application Program Interfaces (APIs)). [0070] Example embodiments (e.g., apparatus, systems, or methods) can be implemented in digital electronic circuitry, in computer hardware, in firmware, in software, or in any combination thereof. Example embodiments can be implemented using a computer program product (e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution
Attorney docket No.00015-421WO1 by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers). [0071] A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a software module, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. [0072] In an example, operations can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Examples of method operations can also be performed by, and example apparatus can be implemented as, special purpose logic circuitry (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)). [0073] The computing system can include clients and servers. A client and server are generally remote from each other and generally interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In embodiments deploying a programmable computing system, it will be appreciated that both hardware and software architectures require consideration. Specifically, it will be appreciated that the choice of whether to implement certain functionality in permanently configured hardware (e.g., an ASIC), in temporarily configured hardware (e.g., a combination of software and a programmable processor), or a combination of permanently and temporarily configured hardware can be a design choice. Below are set out hardware (e.g., machine 400) and software architectures that can be deployed in example embodiments. [0074] In an example, the machine 400 can operate as a standalone device or the machine 400 can be connected (e.g., networked) to other machines. [0075] In a networked deployment, the machine 400 can operate in the capacity of either a server or a client machine in server-client
Attorney docket No.00015-421WO1 network environments. In an example, machine 400 can act as a peer machine in peer-to-peer (or other distributed) network environments. The machine 400 can be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) specifying actions to be taken (e.g., performed) by the machine 400. Further, while only a single machine 400 is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. [0076] Example machine (e.g., computer system) 400 can include a processor 402 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 404 and a static memory 406, some or all of which can communicate with each other via a bus 408. The machine 400 can further include a display unit 410, an alphanumeric input device 412 (e.g., a keyboard), and a user interface (UI) navigation device 411 (e.g., a mouse). In an example, the display unit 810, input device 417 and UI navigation device 414 can be a touch screen display. The machine 400 can additionally include a storage device (e.g., drive unit) 416, a signal generation device 418 (e.g., a speaker), a network interface device 420, and one or more sensors 421, such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor. [0077] The storage device 416 can include a machine readable medium 422 on which is stored one or more sets of data structures or instructions 424 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 424 can also reside, completely or at least partially, within the main memory 404, within static memory 406, or within the processor 402 during execution thereof by the machine 400. In an example, one or any combination of the processor 402, the main memory 404, the static memory 406, or the storage device 416 can constitute machine readable media.
Attorney docket No.00015-421WO1 [0078] While the machine readable medium 422 is illustrated as a single medium, the term “machine readable medium” can include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that configured to store the one or more instructions 424. The term “machine readable medium” can also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. The term “machine readable medium” can accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Specific examples of machine readable media can include non-volatile memory, including, by way of example, semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. [0079] The instructions 424 can further be transmitted or received over a communications network 426 using a transmission medium via the network interface device 420 utilizing any one of a number of transfer protocols (e.g., frame relay, IP, TCP, UDP, HTTP, etc.). Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., IEEE 802.16 standards family known as Wi-Fi®, IEEE 802.16 standards family known as WiMax®), peer-to-peer (P2P) networks, among others. The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software. [0080] FIG. 18 is a block diagram that illustrates a system 130 including a computer system 140 and the associated Internet 11 connection upon which an embodiment may be
Attorney docket No.00015-421WO1 implemented. Such configuration is typically used for computers (hosts) connected to the Internet 11 and executing a server or a client (or a combination) software. A source computer such as laptop, an ultimate destination computer and relay servers, for example, as well as any computer or processor described herein, may use the computer system configuration and the Internet connection shown in FIG. 18. The system 140 may be used as a portable electronic device such as a notebook/laptop computer, a media player (e.g., MP3 based or video player), a cellular phone, a Personal Digital Assistant (PDA), a sample device, an image processing device (e.g., a digital camera or video recorder), and/or any other handheld computing devices, or a combination of any of these devices. Note that while FIG. 17 illustrates various components of a computer system, it is not intended to represent any particular architecture or manner of interconnecting the components; as such details are not germane to the present invention. It will also be appreciated that network computers, handheld computers, cell phones and other data processing systems which have fewer components or perhaps more components may also be used. Computer system 140 includes a bus 137, an interconnect, or other communication mechanism for communicating information, and a processor 138, commonly in the form of an integrated circuit, coupled with bus 137 for processing information and for executing the computer executable instructions. Computer system 140 also includes a main memory 134, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 137 for storing information and instructions to be executed by processor 138. [0081] Main memory 134 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 138. Computer system 140 further includes a Read Only Memory (ROM) 136 (or other non-volatile memory) or other static storage device coupled to bus 137 for storing static information and instructions for processor 138. A storage device 135, such as a magnetic disk or optical disk, a hard disk drive for reading from and writing to a hard disk, a magnetic disk drive for reading from and writing to a magnetic disk, and/or an optical disk drive (such as DVD) for
Attorney docket No.00015-421WO1 reading from and writing to a removable optical disk, is coupled to bus 137 for storing information and instructions. The hard disk drive, magnetic disk drive, and optical disk drive may be connected to the system bus by a hard disk drive interface, a magnetic disk drive interface, and an optical disk drive interface, respectively. The drives and their associated computer-readable media provide non- volatile storage of computer readable instructions, data structures, program modules and other data for the general purpose computing devices. Typically computer system 140 includes an Operating System (OS) stored in a non-volatile storage for managing the computer resources and provides the applications and programs with an access to the computer resources and interfaces. An operating system commonly processes system data and user input, and responds by allocating and managing tasks and internal system resources, such as controlling and allocating memory, prioritizing system requests, controlling input and output devices, facilitating networking and managing files. Non-limiting examples of operating systems are Microsoft Windows, Mac OS X, and Linux. [0082] The term “processor” is meant to include any integrated circuit or other electronic device (or collection of devices) capable of performing an operation on at least one instruction including, without limitation, Reduced Instruction Set Core (RISC) processors, CISC microprocessors, Microcontroller Units (MCUs), CISC-based Central Processing Units (CPUs), and Digital Signal Processors (DSPs). The hardware of such devices may be integrated onto a single substrate (e.g., silicon “die”), or distributed among two or more substrates. Furthermore, various functional aspects of the processor may be implemented solely as software or firmware associated with the processor. [0083] Computer system 140 may be coupled via bus 137 to a display 131, such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a flat screen monitor, a touch screen monitor or similar means for displaying text and graphical data to a user. The display may be connected via a video adapter for supporting the display. The display allows a user to view, enter, and/or edit information that is relevant to the operation of the system. An input device 132, including alphanumeric and other keys, is
Attorney docket No.00015-421WO1 coupled to bus 137 for communicating information and command selections to processor 138. Another type of user input device is cursor control 133, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 138 and for controlling cursor movement on display 131. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane. [0084] The computer system 140 may be used for implementing the methods and techniques described herein. According to one embodiment, those methods and techniques are performed by computer system 140 in response to processor 138 executing one or more sequences of one or more instructions contained in main memory 134. Such instructions may be read into main memory 134 from another computer-readable medium, such as storage device 135. Execution of the sequences of instructions contained in main memory 134 causes processor 138 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the arrangement. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software. [0085] The term “computer-readable medium” (or “machine-readable medium”) as used herein is an extensible term that refers to any medium or any memory, that participates in providing instructions to a processor, (such as processor 138) for execution, or any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). Such a medium may store computer- executable instructions to be executed by a processing element and/or control logic, and data which is manipulated by a processing element and/or control logic, and may take many forms, including but not limited to, non-volatile medium, volatile medium, and transmission medium. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 137. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infrared data communications, or other form of propagated
Attorney docket No.00015-421WO1 signals (e.g., carrier waves, infrared signals, digital signals, etc.). Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch- cards, paper-tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read. [0086] Various forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to processor 138 for execution. For example, the instructions may initially be carried on a magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 140 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 137. Bus 137 carries the data to main memory 134, from which processor 138 retrieves and executes the instructions. The instructions received by main memory 134 may optionally be stored on storage device 135 either before or after execution by processor 138. [0087] Computer system 140 also includes a communication interface 141 coupled to bus 137. Communication interface 141 provides a two-way data communication coupling to a network link 139 that is connected to a local network 111. For example, communication interface 141 may be an Integrated Services Digital Network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another non-limiting example, communication interface 141 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. For example, Ethernet based connection based on IEEE802.3 standard may be used such as 10/100 BaseT, 1000 BaseT (gigabit Ethernet), 10 gigabit Ethernet (10 GE or 10 GbE or 10 GigE per IEEE Std 802.3ae-2002 as standard), 40 Gigabit Ethernet (40 GbE), or 100 Gigabit Ethernet (100 GbE as per Ethernet
Attorney docket No.00015-421WO1 standard IEEE P802.3ba), as described in Cisco Systems, Inc. Publication number 1-587005-001-3 (June 1999), “Internetworking Technologies Handbook”, Chapter 7: “Ethernet Technologies”, pages 7- 1 to 7-38, which is incorporated in its entirety for all purposes as if fully set forth herein. In such a case, the communication interface 141 typically include a LAN transceiver or a modem, such as Standard Microsystems Corporation (SMSC) LAN91C11110/100 Ethernet transceiver described in the Standard Microsystems Corporation (SMSC) data-sheet “LAN91C11110/100 Non-PCI Ethernet Single Chip MAC+PHY” Data-Sheet, Rev. 15 (Feb. 20, 2004), which is incorporated in its entirety for all purposes as if fully set forth herein. [0088] Wireless links may also be implemented. In any such implementation, communication interface 141 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information. [0089] Network link 139 typically provides data communication through one or more networks to other data devices. For example, network link 139 may provide a connection through local network 111 to a host computer or to data equipment operated by an Internet Service Provider (ISP) 142. ISP 142 in turn provides data communication services through the world wide packet data communication network Internet 11. Local network 111 and Internet 11 both use electrical, electromagnetic or optical signals that carry digital data streams. Also, satellite and network satellite communication and modules may be implemented. The signals through the various networks and the signals on the network link 139 and through the communication interface 141, which carry the digital data to and from computer system 140, are exemplary forms of carrier waves transporting the information. [0090] A received code may be executed by processor 138 as it is received, and/or stored in storage device 135, or other non-volatile storage for later execution. In this manner, computer system 140 may obtain application code in the form of a carrier wave. [0091] The concept of rapid classification of DNA for applications in various clinical, agricultural, environmental and military/forensic scenarios are disclosed herein, and may be
Attorney docket No.00015-421WO1 implemented and utilized with the related processors, networks, computer systems, internet, and components and functions according to the schemes disclosed herein. [0092] The methods and computer implemented methods of the disclosure can be implemented on the computers, systems and architecture described herein. The methods may be implemented by a computer program or programs stored on a computer readable medium such that the program causes the computer to carry out the methods and steps set forth above. [0093] Although some embodiments have been described, the same and other embodiments are described in the Examples below, which are meant to illustrate, but not limit the invention. EXAMPLES [0094] Example 1: Primers for rapid classification of E. coli isolates into Clermont phylogroups. 15,278 E. coli genomes were split between a training set of 14,638 public genomes from BV-BRC and RefSeq using quality control measures as described in Hyun et al. (BMC Genomics 23(1):7 (2022)) and a validation set of 640 internal genomes. All genomes were assigned phylogroup annotations in line with current practice using ClermonTyping. 293,826 genes were identified through pangenome construction and were filtered for those potentially associated with the phylogroup classification (defined as: there exists a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10), yielding 14,836 genes. [0095] Genetic algorithms (GAs) were applied to iteratively identify genes that (1) both accurately discriminate phylogroups and (2) can be reliably identified by a marker sequence, given all observed variants and against the background of the E. coli pangenome. In a single iteration of this process, GAs were first trained to identify 4-8 genes from this set that best reproduces phylogroup classification, randomly seeded with seed=1-10 for a total of 50 GA models. Performance was quantified by treating all possible presence/absence combinations of a given gene set as a predicted cluster and computing the purity of the resulting cluster predictions against true phylogroups for all genomes. Fitness was defined as ^ ^^ି^௨^^௧௬^ to enable smaller improvements closer to 100%
Attorney docket No.00015-421WO1 purity to continue translating to larger fitness gains. GAs were implemented using PyGAD with the following parameters (num_generations=200, sol_per_pop=512, num_parents_mating=256, keep_elitism=1, crossover_type='uniform', mutation_type='random', mutation_probability=0.20, parent_selection_type='sss'), trained on the public genomes and evaluated on the internal genomes. Across all trained GAs, solutions at generation 110, 120, ..., 200 were extracted and genes selected in at least three solutions were identified as candidates. [0096] For each candidate gene, all 25 nt subsequences or “kmers” were counted for all observed variants of the gene, and kmers occurring in at least 90% of variants weighted by frequency were identified as primer candidates. 100 genomes were randomly sampled for each gene with balanced representation of genomes with/without the gene and across all phylogroups, and the presence/absence of primer candidates in the sampled genomes was used to compute the F1 score of each primer candidate at recovering its corresponding gene. Genes for which no primer candidate with F1 > 90% were removed, completing an iteration of the marker identification process. [0097] A total of six iterations were completed, after which all primer candidates across all iterations with F1 > 80% were identified. The presence/absence of these primer candidates across all genomes was computed, and GAs were similarly trained to identify 4-8 primers from this set that best discriminate phylogroups (using identical parameters as before, except num_generations=1000). The primers selected by the most accurate GA for each primer set size (4-8) are included as the final set of marker sequences for classifying E. coli by phylogroup. The marker sequences for each size (4-8) are available in Table 1, with annotations for corresponding genes derived from eggNOG-mapper. Confusion matrices describing the performance of each marker set presented in FIG. 2 and FIG. 3. Interpretation of marker sequences to predict E. coli phylogroup are described in FIGs. 4-8. [0098] Table 1: Sets of marker sequences for the identification of E. coli phylogroups. For each marker, the predicted protein product
Attorney docket No.00015-421WO1 of the targeted gene, gene name, and gene identifiers are shown when available. Markers 4B and 8A (starred) are identical. ID Primer Primer Target Description Gene bnum COG KEGG 1-4A TGCAATACCGCATTCATCACCTCCG Regulator of protease activity, (SEQ ID NO:1) stomatin/prohibitin superfamily hflC - COG0330 - 1-4B* TTATGGCGGCGATTATCGTCATGGC (SEQ ID NO:2) Uncharacterized small protein yldA b4734 - - 1-4C GTGCTTGCAGCACATGTTGCGGATC DNA-binding transcriptional (SEQ ID NO:3) regulator yiaG b3555 COG2944 K07726 1-4D TTTCCTGTTATTCGGTAATGGCGAT (SEQ ID NO:4) Predicted peptidase abgB - COG1473 K01451 1-5A GTTGTCACAATCTAAATTTGCCGAT DNA-binding transcriptional (SEQ ID NO:5) regulator yiaG b3555 COG2944 K07726 1-5B CAGTTAACGAAGCCGATTGATCATT of - COG0330 -
1-5C GGCGGTTTCCTGTTATTCGGTAATG (SEQ ID NO:7) Predicted peptidase abgB - COG1473 K01451 1-5D CGCGATATCGGTCTCGACGGCGTCG Coproporphyrinogen III (SEQ ID NO:8) oxidase hutW - COG0635 K21936 G Antitoxin component of type II 1-5E CCGATGATTTATTTGATAAATTAG (SEQ ID NO:9) toxin-antitoxin system (YafQ- dinJ b0226 COG3077 K07473
TA Antitoxin component of type II 1-6D CTGGCCGGGATGGGGCTGACCAT ID toxin-antitoxin system (YafQ- dinJ b0226 COG3077 K07473
1-7D TACCGTCTGCGAATGGGGCAACGTC Heme utilization protein (SEQ ID NO:19) ChuX/HutX hutX - COG3721 K07227 GTCATTTATATCAACTCATCACGCG TR
1-7E AP-type C4-dicarboxylate (SEQ ID NO:20) transporter subunit yiaM b3577 COG3090 K21394
Attorney docket No.00015-421WO1 CCAATTTCTTTGTTGCAGAAA Stress response protein, 1-7F AAGT (SEQ ID NO:21) single-species biofilm yjaA b4011 2DQT3 - formation 1-7G CGCGATTGTGGGTGACGCGGTCAAT SNARE-associated (SEQ ID NO:22) membrane protein dedA b2317 COG0586 K03975 1-8A* TTATGGCGGCGATTATCGTCATGGC (SEQ ID NO:23) Uncharacterized protein yldA b4734 - - 1-8B TAAGAATGAAGAGAAATCACTAAAC (SEQ ID NO:24) Uncharacterized protein yrhA - 292QA - 1-8C ATGGTTATTTCTTGATGAACCAACT ABC-type hemin transporter (SEQ ID NO:25) subunit (hmuTUV) hmuV - COG4559 K02013 1-8D CACGGTGCTTGTCGATCTGAACGAT PfkB family carbohydrate (SEQ ID NO:26) kinase scrK - COG0524 K00847 GCAGCCGTCGCGGGGATGTCTGGTG Glutamate mutase
isolates into clonal complexes. S. aureus genomes of high quality and with clonal complex metadata were identified from the BV-BRC database, resulting in 753 genomes. The genomes spanned 14 clonal complexes with at least 7 genomes each, as well as 8 other rarer clonal complexes which were grouped together as “other”. 10% of genomes were randomly selected to be the validation set, with the remaining 90% as the training set. A S. aureus pangenome was prepared similarly to the previous example, and genes were filtered for those present in at least three genomes or missing in at least three genomes, resulting in 6,990 genes. [00100] Sets of genes and marker sequences were iteratively identified using the same genetic algorithm approach as the previous example and as described in FIG. 1, with a total of four iterations completed. The marker sequences were developed for sets of size 4, 5, and 6, and are presented in Table 2. Annotations for corresponding genes are supplemented with results from blastp results from NCBI, due to poor coverage from eggNOG-mapper. Confusion matrices describing the performance of each marker set are
Attorney docket No.00015-421WO1 presented in FIG. 9. Interpretation of marker sequences to predict S. aureus clonal complex are described in FIGs. 10-12. Table 2: Sets of marker sequences for the identification of S. aureus clonal complexes. For each marker, the predicted protein product of the targeted gene and gene identifiers are shown when available. “RefSeq/GenBank” refers to the most common variant of the gene targeted by the primer. ID Primer Primer Target Description RefSeq/GenBank COG 2-4A CATGACATGTTCGTAAATGATGATT (SEQ ID NO:31) Staphylococcal enterotoxin type 26 WP_001622271.1 2CC33 2-4B TCTGAAAAACCAAATTGTACAGACG (SEQ ID NO:32) MFS transporter WP_000130758.1 COG0477 2-4C CAAATGTATAAATAATAAATGCTAT (SEQ ID NO:33) Uncharacterized membrane protein WP_000956429.1 COG4858 2-4D TTAGATAATTATTTAGTATTAGCAT HTH-type transcriptional regulator (SEQ ID NO:34) SarT ADC38645.1 COG1846 2-5A TTAGAATCTTTTGCCTTTACCGCAT LPXTG-anchored surface protein (SEQ ID NO:35) SasK AYV00666.1 -
- 2-5C CTTAAGCATAAAAATTTATATGAAT (SEQ ID NO:37) Staphylococcal enterotoxin type U WP_000764690.1 2CC33 2-5D TCAATTTTATAGTCTGTAGTCTTTG Iron-hydroxamate ABC transporter (SEQ ID NO:38) substrate-binding protein WP_000825510.1 COG0614 2-5E TGATCAGCCATTGACTTAATCGGTG (SEQ ID NO:39) Uncharacterized membrane protein WP_000956429.1 COG4858 2-6A CATTTAACTGATTAGTATCTAATTT (SEQ ID NO:40) Hypothetical protein WP_000410720.1 - 2-6B ATTGGGACAATATTATTAAAAGCAT ECF-type riboflavin transporter (SEQ ID NO:41) substrate-binding protein WP_000743714.1 COG4720 2-6C CATTAAAAAAGATTCATAAAGGAAT (SEQ ID NO:42) RES family NAD+ phosphorylase WP_103145692.1 - 2-6D TTCAATTGTTCTGGTTTAGGATTGC (SEQ ID NO:43) Staphylococcal enterotoxin type U WP_000764690.1 2CC33 2-6E TATAGTCTGTAGTCTTTGTCGAGTT Iron-hydroxamate ABC transporter (SEQ ID NO:44) substrate-binding protein WP_000825510.1 COG0614 2-6F AATAACAATTCAACACGTAATTTTT (SEQ ID NO:45) Uncharacterized membrane protein WP_000956429.1 COG4858 [00101] Example 3: Primers for determining resistance against ciprofloxacin for E. coli isolates in the B2 phylogroup. E. coli genomes from Example 1 were filtered to those with antimicrobial resistance metadata against ciprofloxacin available on BV-BRC, resulting in 1044 genomes (179 resistant, 865 susceptible). To
Attorney docket No.00015-421WO1 capture mechanisms of resistance conferred by point mutations, the initial feature set was expanded from genes to also include individual amino acid variants of those genes or “alleles”, resulting in 267,328 initial genetic features. These features were pre-filtered based on association with the resistance phenotype similarly to Example 1, yielding 13,953 features that satisfy: the feature is present in X% of resistant genomes and Y% of susceptible genomes and |X - Y| > 10. [00102] Smaller sets of features predictive of resistance were identified using the same genetic algorithm approach as the previous examples. GA models from 10 different randomization states were trained to identify 4-8 features that distinguish resistant from susceptible genomes, for a total of 50 models. Across all trained GAs, solutions at generation 110, 120, ..., 200 were extracted and features selected in at least three solutions were identified as candidate features to target. Primer candidates for each feature were identified using the same kmer counting and F1 score filtering approach as the previous examples. Only a single iteration of the full workflow described in FIG. 1 was completed before sufficiently accurate primer sets were identified. However, it is expected that improvement in the accuracy of the primer set can be realized with additional iterations and/or longer GA generation limits like in Examples 1-2. A single set of marker sequences (size 4) is available in Table 3 with annotations for corresponding genes derived from eggNOG-mapper and NCBI blastp. Larger primer sets were excluded as they targeted similar sets of genes and did not confer meaningful improvements to accuracy. The confusion matrix and interpretation of this marker sequence set is available in FIG. 13. Notably, the GA approach independently selected primers targeting regions of gyrA and parC, which are known determinants of ciprofloxacin resistance. Table 3: Four marker sequences for the determination of resistance against ciprofloxacin in E. coli B2 strains. For each primer, the predicted protein product of the targeted gene, gene name, and gene identifiers are shown when available. The target of primer 3-4D is an undercharacterized protein represented by ESA89826.1 (GenBank). ID Primer Primer Target Description Gene bnum COG KEGG
Attorney docket No.00015-421WO1 3-4A TTCGTGGTATTCGTTTAGGCGAAGG (SEQ ID NO:46) DNA gyrase subunit A gyrA b2231 COG0188 K02469 3-4B TAACAGGCAATATCGCCGTGCGGAT DNA topoisomerase IV subunit (SEQ ID NO:47) A parC b3019 COG0188 K02621 3-4C GTTATCGCGATGAATATAAACTGGC Putative FAD-linked (SEQ ID NO:48) oxidoreductase ydiJ b1687 COG0247 - CAGCATGGCCCATCCTACTGAAACT
- [00103] Validation of Example 3: Marker sequences for determining resistance against ciprofloxacin for E. coli isolates in the B2 phylogroup [00104] Selection of a genetically and phenotypically diverse set of strains for marker validation. A set of 442 B2 E. coli strains not used for marker sequence design were characterized as follows. Starting with the strains’ genome assemblies, pairwise genetic distances were computed by using MASH (Ondov et al., Genome Biol, 17:132, 2016), which were then used to generate a hierarchical clustering (average linkage) of the strains and assign the strains to one of 23 resulting genetic clusters. AMR phenotypes (binary ciprofloxacin resistant/susceptible calls) were predicted in silico and independently of the marker sequences by first constructing a pangenome combining 1260 publicly available B2 strains with ciprofloxacin resistance data from BV-BRC (Olson et al., Nucl. Acids. Res., 51:D678-689, 2023) with the 442 candidate validation strains. An XGBoost machine learning model was trained on the 1260 BV-BRC strains to predict ciprofloxacin resistance from pangenomic features (n_estimators = 32, max_depth = 3, colsample_bytree = 0.75, subsample = 0.75), achieving mean test MCC=0.95 and accuracy=0.98 during 5-fold cross validation. This model was then used to assign predicted AMR phenotypes to the 442 candidate validation strains. [00105] From the characterized 442 candidate validation strains, a balanced set of 72 strains was selected by randomly selecting 2-4 predicted-susceptible strains from each MASH cluster to a total of 36 strains, and similarly for 36 predicted-resistant strains. Clusters with fewer than four strains were excluded. Only two MASH clusters had any predicted-resistant strains, compared to 11 with predicted-susceptible strains (FIG. 14).
Attorney docket No.00015-421WO1 [00106] Characterization of proposed marker sequences against 72 validation strains. The four marker sequences were aligned to the genome assemblies of the 72 validation strains and interpreted against their corresponding consensus sequences. ● 3-4A: Targets a stable region of gyrA (DNA gyrase subunit A) matching the consensus sequence. SNPs against the consensus are observed rarely in 3/25 positions covered by this sequence. ● 3-4B: Targets part of the quinolone resistance-determining region of parC (DNA topoisomerase IV subunit A), capturing a single SNP against the consensus that yields the resistance mutation S80I (AGT -> ATT). ● 3-4C: Targets a stable region of ydiJ (putative FAD-linked oxidoreductase). Aligned exactly in all 72 validation strains. Was previously found missing very rarely in the initial training strains used to develop the marker sequences. ● 3-4D: Targets a stable region of rarely detected abi (abortive infection family protein). Aligned in only 6 validation strains, and all such alignments were exact matches. [00107] Ciprofloxacin resistance testing of validation strains. Ciprofloxacin resistance was measured experimentally in growth/no growth experiments with 0.5mg/L CIP. Each strain isolate was grown on CA-MHB media at 37°C, transferred to CA-MHB + ciprofloxacin media to an initial density of OD=0.10 and ciprofloxacin concentration of 0.5 mg/L, then incubated at 37°C for 48 hours with shaking at 800 rpm and density measurements every 15 minutes (BioTek plate reader). 39 strains were resistant and 33 strains were susceptible, confirming the validation set to be nearly phenotypically balanced (FIG. 15). [00108] Performance of the original proposed and modified marker sequences on validation strains. Ciprofloxacin resistance predictions based on the four original marker sequences were 72% accurate with 20 false negatives and 0 false positives (FIG. 15). Analysis of the errors found that 17/20 false negatives could be attributed to missing a double SNP mutation (AGT -> ATC) that would yield the same S80I substitution as the single SNP captured by marker 3-4B. This double SNP is able to be captured by the addition of a 5th marker 3-4B* (TAACAGGCGATATCGCCGTGCGGAT (SEQ ID NO:50)) with one base pair difference from marker 3-4B. Updating the interpretation of the four markers to accept both 3-4B or 3-4B* as
Attorney docket No.00015-421WO1 being positive for 3-4B increases accuracy to 94% with 3 false negatives and 1 false positive (FIG. 15). [00109] Development of a PCR assay for direct capture of marker sequences targets. A PCR-based approach was developed to capture the targets of the modified marker sequences without requiring whole genome sequencing, as follows. Templates were prepared by first making cleared lysates from plated cells. Each strain isolate was plated on LB and grown overnight. An amount roughly the size of a grain of sand was transferred to a 200µL tube with 40µL water, heated to 95°C in a thermocycler with a heated lid for 5 minutes to lyse, and then centrifuged to pellet the cellular debris. Six 12.5µL PCR reactions per strain were run using an equally divided master mix which included 1X SYBR dye and 3µL of the cleared lysates but no primers. Six primer pairs (Table 4) were then added to the samples, one primer pair per well. For targets involving gyrA (marker 3-4A) and parC (markers 3-4B, 3-4B*) the Amplification Refractory Mutation System PCR (ARMS-PCR) (Little, Curr. Protoc. Hum. Genet., Ch. 9, Unit 9.8, 2001) was used to design primers that distinguish between the relevant variants. Both targets required two sets of primers, one to detect the dominant form of the region targeted by the marker sequence and the other to detect the variant described by the marker sequence. For targets involving ydiJ (marker 3-4C) and abi (marker 3-4D), standard PCR primers targeting the genes were used after verifying that they did not yield non-specific amplicons on strains lacking these genes. [00110] Table 4: PCR primer sequences for detection of ciprofloxacin resistance marker sequences. ID ID
Attorney docket No.00015-421WO1 abi_F CAGCATGGCCCATCCTACTG(SEQ ID NO:61) abi_R 3-4D (abi) TGCCGCCATCCTGATCC(SEQ ID NO:62)
proofreading activity, was used for all of the PCR reactions. Good performance was observed for both the hot start and standard versions of the master. When the standard version was used, the plate was loaded onto a block that was pre-heated to the initial denaturation temperature. The six primer pairs were initially run on their respective positive control templates using an annealing gradient of eight different temperatures. An annealing temperature was selected which optimized the likelihood of the gyrA and parC ARMS primer pairs working well on their intended targets but poorly or not at all on the non-targeted counterparts. The ydiJ and abi primers were adjusted to function well at the same temperature so that all six assays could be run on the same plate at the same time. The PCR reactions were all run in a 96-well format on a CFX Duet Real-Time PCR System. Thirty-five cycles were run in total, each cycle consisting of 10 seconds at 95°C, 15 seconds at 69°C, and 30 seconds at 72°C. Outcomes were scored by first setting a threshold to cross all of the amplification curves in their log-linear region, and calls were based on Cq numbers as follows: For gyrA and parC targets, the marker or variant was scored as positive if its Cq number was lower than its counterpart by at least 5 cycles. For samples whose amplification curves did not reach threshold, 35, the total number of cycles run, was used for the calculations. For ydiJ and abi targets, Cq < 25 was considered positive. [00112] Consistency between PCR assay and in silico marker sequence presence/absence calls derived from assembled genomes on validation strains. Against the 72 validation strains, the PCR-based approach was able to exactly match in silico presence/absence calls for each marker sequence based on assembled genomes (FIG. 16), enabling prediction of ciprofloxacin resistance with 94% accuracy without requiring whole genome sequencing. [00113] A number of embodiments have been described herein. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of this
Attorney docket No.00015-421WO1 disclosure. Accordingly, other embodiments are within the scope of the following claims.
Claims
Attorney docket No.00015-0421WO1 WHAT IS CLAIMED IS: 1. A method to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, comprising carrying out steps (1)-(8): (1) clustering open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filtering the genes to identify sets of genes that are associated by a targeted classification scheme; (3) training machine learning models to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes; (4) identifying sequence variants from each of the candidate genes, and from which, subsequences having a defined number of nucleotides are enumerated, wherein the enumerated subsequences of length k are designated “kmers”; (5) identifying conserved kmers for all candidate genes across the pangenome; (6) filtering the conserved kmers to identify those conserved kmers that accurately reproduce the presence/absence of their corresponding gene, wherein these identified conserved kmers are designated as accurate kmers; (7) training machine learning models to classify genomes from the accurate kmers and filtering out genes that cannot be reproduced accurately from a single accurate kmers, wherein steps (3)-(6) are repeated until a kmers-based machine learning model able to reproduce the target classification scheme with accuracy greater than 95%, 98%, or 99% is generated; and (8) selecting a minimal set of oligonucleotide marker sequences from the accurate machine learning model, wherein the minimal set of oligonucleotide marker sequences can be used to generate primers for the classification of microbial pathogens.
Attorney docket No.00015-0421WO1 2. The method of claim 1, wherein the microbial pathogens are selected from bacterial pathogens, viral pathogens, and fungi pathogens. 3. The method of claim 1 or claim 2, wherein the dataset comprises microbial genome assemblies from bacteria, viruses or fungi. 4. The method of claim 1, wherein for step (1), the microbial genomes assemblies contain genomes that have been filtered for quality based on genome status, size, number of contigs, number of genes and consistency. 5. The method of claim 1, wherein for step (1), the microbial genomes assemblies have been annotated to mark open reading frames in the genomes. 6. The method of claim 1, wherein for step (1), wherein the open reading frames are clustered using a program that clusters and compares protein or nucleotide sequences. 7. The method of claim 1, wherein for step (2), wherein the targeted classification scheme comprises genes that are related to drug-resistance, virulence, pathotype, or phylogenetically-defined subgroups. 8. The method of claim 1, wherein for step (2), wherein the genes are filtered for those associated with the classification scheme, wherein association is based upon a pair of classes for which the gene is observed in X% of class 1 genomes and Y% of class 2 genomes, where X > Y + 10. 9. The method of claim 1, wherein for step (3), the machine learning models are trained to identify 4-8 genes that most accurately recover the targeted classification scheme.
Attorney docket No.00015-0421WO1 10. The method of claim 1, wherein for step (4), the subsequences have a number of nucleotides selected from 18 nt, 19 nt, 20 nt, 21 nt, 22 nt, 23 nt, 24 nt, 25 nt, 26 nt, 27 nt, 28 nt, 29 nt, 30 nt, 31 nt, 32 nt, 33 nt, 34 nt, and 35 nt, or a range of nucleotides that includes or is between any two of the foregoing numbers. 11. The method of claim 1, wherein for step (4), the subsequences have a number of nucleotides selected from 20 nt to 30 nt. 12. The method of claim 1, wherein for step (5), wherein conserved kmers occur in greater than 80%, 85%, 90%, 91%, 92%, 93%, 94% or 95% of instances of the gene. 13. The method of claim 1, wherein for step (6), wherein the accuracy of the conserved kmers to reproduce the presence/absence of their corresponding gene is determined based upon the F1 score. 14. The method of claim 1, wherein for step (7), the machine learning models are trained to identify 4-8 kmers that most accurately recover the targeted classification scheme. 15. The method of claim 1, wherein the method further comprises: classifying a microbial pathogen using one or more oligonucleotide marker sequences selected in step (8). 16. The method of any one of the preceding claims, wherein the method is implemented by a computer. 17. The method of claim 16, wherein the computer system is linked to an oligonucleotide synthesizer. 18. A computer readable medium comprising instructions to cause a computer to identify a set of oligonucleotide marker sequences for rapid classification of microbial pathogens from a dataset of microbial genome assemblies, the computer instructions causing the computer to:
Attorney docket No.00015-0421WO1 (1) cluster open reading frames by protein sequence to form a pangenome from the dataset of microbial genome assemblies, wherein individual clusters of open reading frames are designated as “genes”; (2) filter the genes to identify sets of genes that are associated by a targeted classification scheme; (3) train machine learning model(s) to iteratively identify minimal sets of related “genes” that are (a) able to accurately reproduce a targeted classification scheme, and (b) can be accurately identified by a marker sequence, given all observed variants of the gene and against the background of the pangenome, wherein the identified minimal sets of associated genes that meet criteria (a) and (b) are designated as candidate genes; (4) identify sequence variants from each of the candidate genes, and from which, subsequences having a defined number of nucleotides are enumerated, wherein the enumerated subsequences of length k are designated “kmers”; (5) identify conserved kmers for all candidate genes across the pangenome; (6) filter the conserved kmers to identify those conserved kmers that accurately reproduce the presence/absence of their corresponding gene, wherein these identified conserved kmers are designated as accurate kmers; (7) train machine learning models to classify genomes from the accurate kmers and filtering out genes that cannot be reproduced accurately from a single accurate kmers, wherein steps (3)-(6) are repeated until a kmers-based machine learning model able to reproduce the target classification scheme with accuracy greater than 95%, 98%, or 99% is generated; and (8) select a minimal set of oligonucleotide marker sequences from the accurate machine learning model, wherein the minimal set of oligonucleotide marker sequences can be used to generate primers for the classification of microbial pathogens. 19. A catalog of oligonucleotide marker sequences obtained by the method of claim 1 or a computer running the computer readable medium of claim 18.
Attorney docket No.00015-0421WO1 20. A method of identifying a pathogen in a sample, the method comprising comparing sequence reads obtained from the genome of the pathogen in the sample to the catalog of oligonucleotide markers of claim 19 to identify the pathogen with a complementary sequence.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202363470417P | 2023-06-01 | 2023-06-01 | |
US63/470,417 | 2023-06-01 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024249933A1 true WO2024249933A1 (en) | 2024-12-05 |
Family
ID=93658564
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2024/032104 WO2024249933A1 (en) | 2023-06-01 | 2024-05-31 | Identification of marker sequences for rapid classification of microbial pathogens through machine learning analysis of genome assemblies |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024249933A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180365375A1 (en) * | 2015-04-24 | 2018-12-20 | University Of Utah Research Foundation | Methods and systems for multiple taxonomic classification |
US20220122696A1 (en) * | 2018-12-06 | 2022-04-21 | Yanmei Huang | System and method for achieving high gene data resolution using training sets |
WO2022159838A1 (en) * | 2021-01-22 | 2022-07-28 | Idbydna Inc. | Methods and systems for metagenomics analysis |
-
2024
- 2024-05-31 WO PCT/US2024/032104 patent/WO2024249933A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180365375A1 (en) * | 2015-04-24 | 2018-12-20 | University Of Utah Research Foundation | Methods and systems for multiple taxonomic classification |
US20230055403A1 (en) * | 2015-04-24 | 2023-02-23 | University Of Utah Research Foundation | Methods and systems for multiple taxonomic classification |
US20220122696A1 (en) * | 2018-12-06 | 2022-04-21 | Yanmei Huang | System and method for achieving high gene data resolution using training sets |
WO2022159838A1 (en) * | 2021-01-22 | 2022-07-28 | Idbydna Inc. | Methods and systems for metagenomics analysis |
Non-Patent Citations (1)
Title |
---|
HYUN ET AL.: "Comparative pangenomics: analysis of 12 microbial pathogen pangenomes reveals conserved global structures of genetic and functional diversity", GENOMICS, vol. 23, no. 7, 4 January 2022 (2022-01-04), pages 1 - 18, XP021300775, DOI: 10.1186/s12864-021-08223-8 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Besser et al. | Next-generation sequencing technologies and their application to the study and control of bacterial infections | |
Shokralla et al. | Massively parallel multiplex DNA sequencing for specimen identification using an Illumina MiSeq platform | |
US10865410B2 (en) | Next-generation sequencing libraries | |
Suchan et al. | Hybridization capture using RAD probes (hyRAD), a new tool for performing genomic analyses on collection specimens | |
MacLean et al. | Application of'next-generation'sequencing technologies to microbial genetics | |
Almeida et al. | Bioinformatics tools to assess metagenomic data for applied microbiology | |
US12176071B2 (en) | Systems and methods for ultra-fast identification and abundance estimates of microorganisms using a kmer-depth based approach and privacy-preserving protocols | |
RU2708337C2 (en) | Methods and compositions for dna profiling | |
Su et al. | Next-generation sequencing and its applications in molecular diagnostics | |
Strueder-Kypke et al. | Comparative analysis of the mitochondrial cytochrome c oxidase subunit I (COI) gene in ciliates (Alveolata, Ciliophora) and evaluation of its suitability as a biodiversity marker | |
Buchan et al. | Emerging technologies for the clinical microbiology laboratory | |
Cornejo-Castillo et al. | Cyanobacterial symbionts diverged in the late Cretaceous towards lineage-specific nitrogen fixation factories in single-celled phytoplankton | |
Pinto et al. | Sequencing-based analysis of microbiomes | |
Kozińska et al. | A crash course in sequencing for a microbiologist | |
Fraley et al. | Nested machine learning facilitates increased sequence content for large-scale automated high resolution melt genotyping | |
Méndez-García et al. | Metagenomic protocols and strategies | |
KR20200059208A (en) | Method and system for manufacturing library with unique molecular identifier | |
Del Chierico et al. | Choice of next-generation sequencing pipelines | |
Deatherage et al. | High-throughput characterization of mutations in genes that drive clonal evolution using multiplex adaptome capture sequencing | |
Richards et al. | Low-cost cross-taxon enrichment of mitochondrial DNA using in-house synthesised RNA probes | |
Sekiguchi et al. | A large-scale genomically predicted protein mass database enables rapid and broad-spectrum identification of bacterial and archaeal isolates by mass spectrometry | |
Agyabeng‐Dadzie et al. | Evaluating the Benefits and Limits of Multiple Displacement Amplification With Whole‐Genome Oxford Nanopore Sequencing | |
Liu et al. | Epigenetic segregation of microbial genomes from complex samples using restriction endonucleases HpaII and McrB | |
WO2024249933A1 (en) | Identification of marker sequences for rapid classification of microbial pathogens through machine learning analysis of genome assemblies | |
Gao et al. | Integrated identification of growth pattern and taxon of bacterium in gut microbiota via confocal fluorescence imaging‐oriented single‐cell sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24816619 Country of ref document: EP Kind code of ref document: A1 |