BMI705 Lecture1

Information about BMI705 Lecture1

Published on October 15, 2007

Author: Mertice

Source: authorstream.com

Content

Slide1:  IBGP/BMI 730 Biomedical Informatics Director: Prof. Kun Huang Slide2:  What is Bio(medical)-informatics? bio·in·for·mat·ics : the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied in molecular genetics and genomics. Source: Merriam-Webster's Medical Dictionary, © 2002 Merriam-Webster, Inc. Slide3:  Myth1 : Bioinformatics is about genomics Nucleotide – DNA, RNA, … Genome – Sequences, chromosomes, expressed data, … Protein – Sequences, 3-D structure, interaction, … System – Gene network, protein network, TFs, … Other – Masspec, microarray, images, lab records, journals, literatures, … The goal is to understand how the system works. Slide4:  Myth2 : Data vs. Information Data Nucleotide – DNA, RNA, … Genome – Sequences, chromosomes, expressed data, … Protein – Sequences, 3-D structure, interaction, … System – Gene network, protein network, TFs, … Other – Masspec, microarray, images, lab records, journals, literatures, … Information Genotype Phenotype Genotype-Phenotype relationship SNPs Pathways Drug targets Getting data is “easy”, extracting information is hard! Slide5:  Myth3 : Computer is intelligent Pros Repeated work Accurate storage Precise computation Fast communication … Cons Cannot generalize No real intelligence … The results must be reviewed and validated by biologists. In addition, biologists must have some understanding of how computer processes data (algorithms) – that’s why we need to learn bioinformatics. Slide6:  Biology – Biomedical informatics – System biology Biomedical Informatics Slide7:  Biology Domain knowledge Hypothesis testing Experimental work Genetic manipulation Quantitative measurement Validation System Sciences Theory Analysis Modeling Synthesis/prediction Simulation Hypothesis generation Informatics Data management Database Computational infrastructure Modeling tools High performance computing Visualization System Biology Prediction! Slide8:  What information do we want to extract? Slide9:  The Theme of Modern Biology Slide10:  Where does large data come from (who to blame)? High-throughput techniques Fred Sanger Nobel prize in chemistry in 1958 "for his work on the structure of proteins, especially that of insulin" Nobel prize in chemistry in 1980 "for their contributions concerning the determination of base sequences in nucleic acids" Slide11:  High-throughput techniques DNA Sequencing 1970’s – Nobel prize 1980’s – Ph.D. thesis Early 1990’s – Major research projects Late 1990’s to now - $20 Slide12:  Human Genome Project The Beginning (1988) Cold Spring Harbor Laboratory Long Island, New York Slide13:  June 26, 2000 at the Whitehouse Slide14:  Initial Analysis of the Human Genome Slide15:  What information do we want to extract? Total genetic difference (# of bases) is 4% 35 million single base substitutions plus 5 million insertions or deletions (indels) The average protein differs by only two amino acids, and 29% of proteins are identical. Genotype – Phenotype relationship!!! Slide17:  Phenotype mRNA level Protein expression Protein structure Cell morphology Tissue morphology System physiological functions Behavior … Slide18:  High-throughput techniques High throughput protein crystalization Mass spectrometry Microarray High throughput cell imaging High throughput in vivo screening … Slide20:  “A key element of the GTL program is an integrated computing and technology infrastructure, which is essential for timely and affordable progress in research and in the development of biotechnological solutions. In fact, the new era of biology is as much about computing as it is about biology. Because of this synergism, GTL is a partnership between our two offices within DOE’s Office of Science—the Offices of Biological and Environmental Research and Advanced Scientific Computing Research. Only with sophisticated computational power and information management can we apply new technologies and the wealth of emerging data to a comprehensive analysis of the intricacies and interactions that underlie biology. Genome sequences furnish the blueprints, technologies can produce the data, and computing can relate enormous data sets to models linking genome sequence to biological processes and function.” Slide21:  How to extract the information? Computational tools Building the databases Perform analysis/extract features Data mining Classification/statistical learning Visualization/representation Biological information!!! Slide22:  What we are going to do: Search the databases Perform analysis Present output Be a salient user! Slide23:  What we are going to teach: Genomics Proteomics Microarray analysis Other aspects Ontology Imaging informatics System biology Machine/statistical learning Visualization Data sources (databases) Available tools Major issues in using the databases and tools Other resources Slide24:  Jump Start for Bioinformatics Biology PubMed GenBank Slide25:  Review of Biology Central dogma Slide26:  Review of Biology Operon Slide27:  Review of Biology mRNA, cDNA, exon, intron Slide28:  Review of Biology Codon, reading frames Sequence – open reading frame (ORF) – amino acids Six possible reading frames instead of three !!! (Why) In eukaryotes there is usually only one reading frame and is often the longest one. An ORF starts with an ATG(Met) in most species and ends with a stop codon (TAA, TAG, or TGA). 5'                                                   3'    atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa  1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa     M   P   K   L   N   S   V   E   G   F   S   S   F   E   D   D   V   *   2  tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat      C   P   S   *   I   A   *   R   G   F   H   H   L   R   T   M   Y   3   gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata       A   Q   A   E   *   R   R   G   V   F   I   I   *   G   R   C   I  Slide29:  Review of Biology Protein folding and structure Slide30:  Databases GenBank www.ncbi.nlm.nih.gov/GenBank/ EMBL www.ebi.ac.uk/embl/ DDBJ www.ddbj.nig.ac.jp Synchronized daily. Accession numbers are managed in a consistent way. AceDB DDJP DNA JJPID MIPS PHRED PIR PROSITE RDP TIGR UNIGENE … Slide31:  Resources Local: OSU library Web: PubMed JSTOR (http://www.jstor.com) http://www.expasy.org http://www.genecards.org http://www.pathguide.org/ Slide32:  Resources – What’s out there? Slide33:  PubMed – Entrez PubMed : http://www.pubmed.gov, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi PubMed training : http://www.nlm.nih.gov/bsd/disted/pubmed.html Entrez : http://www.ncbi.nlm.nih.gov/Database/index.html Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration. Slide34:  Entrez Databases Slide35:  Literatures Examples: E2F3 Retinoblastoma Constraints: automatics vs. manual Save: Tutorial at http://www.nlm.nih.gov/bsd/viewlet/myncbi/saving_searches.swf Slide36:  Literatures Slide37:  Literatures Slide38:  Literatures Examples: E2F3 Retinoblastoma Constraints: automatics vs. manual Slide39:  Literatures Slide40:  Nucleotide Gene Genome Sequence mRNA cDNA SNP Name Accession number GI number Version number Alias Slide41:  Accession number, GI number, Version accession number (GenBank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to GenBank. The GenBank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field. GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases. GI number is NOT GeneID. Slide42:  Example : E2F3 Slide43:  Example : E2F3 Slide44:  Data Format FASTA (.fasta file) >gi|33469954|ref|NM_000240.2| Homo sapiens monoamine oxidase A (MAOA), nuclear gene encoding mitochondrial protein, mRNA GGGCGCTCCCGGAGTATCAGCAAAAGGGTTCGCCCCGCCCACAGTGCCCGGCTCCCCCCGGGTATCAAAA GAAGGATCGGCTCCGCCCCCGGGCTCCCCGGGGGAGTTGATAGAAGGGTCCTTCCCACCCTTTGCCGTCC CCACTCCTGTGCCTACGACCCAGGAGCGTGTCAGCCAAAGCATGGAGAATCAAGAGAAGGCGAGTATCGC GGGCCACATGTTCGACGTAGTCGTGATCGGAGGTGGCATTTCAGGACTATCTGCTGCCAAACTCTTGACT GAATATGGCGTTAGTGTTTTGGTTTTAGAAGCTCGGGACAGGGTTGGAGGAAGAACATATACTATAAGGA ATGAGCATGTTGATTACGTAGATGTTGGTGGAGCTTATGTGGGACCAACCCAAAACAGAATCTTACGCTT GTCTAAGGAGCTGGGCATAGAGACTTACAAAGTGAATGTCAGTGAGCGTCTCGTTCAATATGTCAAGGGG AAAACATATCCATTTCGGGGCGCCTTTCCACCAGTATGGAATCCCATTGCATATTTGGATTACAATAATC TGTGGAGGACAATAGATAACATGGGGAAGGAGATTCCAACTGATGCACCCTGGGAGGCTCAACATGCTGA CAAATGGGACAAAATGACCATGAAAGAGCTCATTGACAAAATCTGCTGGACAAAGACTGCTAGGCGGTTT GCTTATCTTTTTGTGAATATCAATGTGACCTCTGAGCCTCACGAAGTGTCTGCCCTGTGGTTCTTGTGGT ATGTGAAGCAGTGCGGGGGCACCACTCGGATATTCTCTGTCACCAATGGTGGCCAGGAACGGAAGTTTGT AGGTGGATCTGGTCAAGTGAGCGAACGGATAATGGACCTCCTCGGAGACCAAGTGAAGCTGAACCATCCT GTCACTCACGTTGACCAGTCAAGTGACAACATCATCATAGAGACGCTGAACCATGAACATTATGAGTGCA AATACGTAATTAATGCGATCCCTCCGACCTTGACTGCCAAGATTCACTTCAGACCAGAGCTTCCAGCAGA GAGAAACCAGTTAATTCAGCGGCTTCCAATGGGAGCTGTCATTAAGTGCATGATGTATTACAAGGAGGCC TTCTGGAAGAAGAAGGATTACTGTGGCTGCATGATCATTGAAGATGAAGATGCTCCAATTTCAATAACCT TGGATGACACCAAGCCAGATGGGTCACTGCCTGCCATCATGGGCTTCATTCTTGCCCGGAAAGCTGATCG ACTTGCTAAGCTACATAAGGAAATAAGGAAGAAGAAAATCTGTGAGCTCTATGCCAAAGTGCTGGGATCC CAAGAAGCTTTACATCCAGTGCATTATGAAGAGAAGAACTGGTGTGAGGAGCAGTACTCTGGGGGCTGCT ACACGGCCTACTTCCCTCCTGGGATCATGACTCAATATGGAAGGGTGATTCGTCAACCCGTGGGCAGGAT TTTCTTTGCGGGCACAGAGACTGCCACAAAGTGGAGCGGCTACATGGAAGGGGCAGTTGAGGCTGGAGAA CGAGCAGCTAGGGAGGTCTTAAATGGTCTCGGGAAGGTGACCGAGAAAGATATCTGGGTACAAGAACCTG … >gi|4557735|ref|NP_000231.1| monoamine oxidase A [Homo sapiens] MENQEKASIAGHMFDVVVIGGGISGLSAAKLLTEYGVSVLVLEARDRVGGRTYTIRNEHVDYVDVGGAYV GPTQNRILRLSKELGIETYKVNVSERLVQYVKGKTYPFRGAFPPVWNPIAYLDYNNLWRTIDNMGKEIPT DAPWEAQHADKWDKMTMKELIDKICWTKTARRFAYLFVNINVTSEPHEVSALWFLWYVKQCGGTTRIFSV TNGGQERKFVGGSGQVSERIMDLLGDQVKLNHPVTHVDQSSDNIIIETLNHEHYECKYVINAIPPTLTAK IHFRPELPAERNQLIQRLPMGAVIKCMMYYKEAFWKKKDYCGCMIIEDEDAPISITLDDTKPDGSLPAIM GFILARKADRLAKLHKEIRKKKICELYAKVLGSQEALHPVHYEEKNWCEEQYSGGCYTAYFPPGIMTQYG RVIRQPVGRIFFAGTETATKWSGYMEGAVEAGERAAREVLNGLGKVTEKDIWVQEPESKDVPAVEITHTF WERNLPSVSGLLKIIGFSTSVTALGFVLYKYKLLPRS Slide45:  Data Format Other formats NBRF/PIR (.pir file) Begin with “>P1;” for protein sequence and “>N1;” for nucleotide. GDE (.gde file) Similar to FASTA file, begin with “%” instead of “>”. Slide46:  Exercises Question 1 - Database search Find the following genes in GenBank. Write down their accession numbers, GI number, chromosome numbers: Rb1 (human), Rb1 (mouse), Rb1(rat), Rb1(dog), Rb1(bovine) Find the protein sequences for the above. Present them in FASTA format. Note: find the most close ones (e.g., if both Rb1 and Rb are present, choose Rb1). Question 2 – Gene information search Find the function and alias for the following genes: PTEN, Col4A1, MMP9 and WASP. Reading – Entrez tutorial http://www.ncbi.nlm.nih.gov/entrez/query/static/help/entrez_tutorial_BIB.pdf

Related presentations


Other presentations created by Mertice

Lec11 Algae
01. 01. 2008
0 views

Lec11 Algae

carter200603
04. 10. 2007
0 views

carter200603

Jutta Immanen Poyry
09. 10. 2007
0 views

Jutta Immanen Poyry

symbiosis2
11. 10. 2007
0 views

symbiosis2

street children denver handouts
12. 10. 2007
0 views

street children denver handouts

252b lecture1
15. 10. 2007
0 views

252b lecture1

CROP Cosmic Ray History 2005
15. 10. 2007
0 views

CROP Cosmic Ray History 2005

Lecture14 Lin
16. 10. 2007
0 views

Lecture14 Lin

Chapter31
22. 10. 2007
0 views

Chapter31

posters zwickel
22. 10. 2007
0 views

posters zwickel

Chain reactions
16. 10. 2007
0 views

Chain reactions

HistoryTalk4
23. 10. 2007
0 views

HistoryTalk4

73
24. 10. 2007
0 views

73

ps 22
09. 10. 2007
0 views

ps 22

Matthew Bick Panama Canal
25. 10. 2007
0 views

Matthew Bick Panama Canal

1 Introduction Y050823
02. 11. 2007
0 views

1 Introduction Y050823

TORNADO DRAGON
02. 10. 2007
0 views

TORNADO DRAGON

qcd3
27. 09. 2007
0 views

qcd3

T Maruyama
09. 10. 2007
0 views

T Maruyama

Cloud Formation
05. 01. 2008
0 views

Cloud Formation

2002interimresults
07. 01. 2008
0 views

2002interimresults

Islamic Empires 2
07. 01. 2008
0 views

Islamic Empires 2

Module54 Aggregate Project Plan
07. 01. 2008
0 views

Module54 Aggregate Project Plan

change prof
27. 09. 2007
0 views

change prof

InformationSystem
17. 10. 2007
0 views

InformationSystem

AFS explanation
04. 01. 2008
0 views

AFS explanation

Crovella
05. 10. 2007
0 views

Crovella

dwyer SAC 2005 v2
23. 10. 2007
0 views

dwyer SAC 2005 v2

loi entreprise
23. 10. 2007
0 views

loi entreprise

sovietep
31. 12. 2007
0 views

sovietep

CROP Ionizing Detectors
13. 10. 2007
0 views

CROP Ionizing Detectors

RUSSIA Sergey Tikhonov
26. 11. 2007
0 views

RUSSIA Sergey Tikhonov

510 shortPRESEN
10. 10. 2007
0 views

510 shortPRESEN

leethao
19. 02. 2008
0 views

leethao

SelfInjury
20. 02. 2008
0 views

SelfInjury

MCOR 384 Presentation Okinawa
26. 02. 2008
0 views

MCOR 384 Presentation Okinawa

Lecture19 Uranus Neptune
15. 11. 2007
0 views

Lecture19 Uranus Neptune

Concrete
29. 02. 2008
0 views

Concrete

Chapter 1 Intro
04. 03. 2008
0 views

Chapter 1 Intro

Forum Casablanca
24. 10. 2007
0 views

Forum Casablanca

fulbright3
10. 03. 2008
0 views

fulbright3

Lecture One
13. 03. 2008
0 views

Lecture One

Objective1
20. 03. 2008
0 views

Objective1

Estrategia Nacional de Comercio
22. 10. 2007
0 views

Estrategia Nacional de Comercio

TRANS WP29 132 inf09e
25. 03. 2008
0 views

TRANS WP29 132 inf09e

prohibiteddogs
19. 11. 2007
0 views

prohibiteddogs

Ch 08
09. 04. 2008
0 views

Ch 08

phase out
11. 04. 2008
0 views

phase out

nossdav99pk
16. 04. 2008
0 views

nossdav99pk

qos3
17. 04. 2008
0 views

qos3

RACEM
23. 11. 2007
0 views

RACEM

StuMbr Benefits
18. 04. 2008
0 views

StuMbr Benefits

Tim SBS
22. 04. 2008
0 views

Tim SBS

Module12
28. 04. 2008
0 views

Module12

5 1
11. 10. 2007
0 views

5 1

Be an RT in NC
02. 05. 2008
0 views

Be an RT in NC

452 lecture7
02. 05. 2008
0 views

452 lecture7

5 10
30. 10. 2007
0 views

5 10

Ito
09. 10. 2007
0 views

Ito

sae intro
29. 12. 2007
0 views

sae intro

lecture001
29. 09. 2007
0 views

lecture001

UMDNJ Connell oct29
21. 10. 2007
0 views

UMDNJ Connell oct29

060619 OIF Foisel iPOP2006 01
09. 10. 2007
0 views

060619 OIF Foisel iPOP2006 01

For Every Season 2003
13. 11. 2007
0 views

For Every Season 2003

AMCARO Mineral Ltd
22. 10. 2007
0 views

AMCARO Mineral Ltd

casablanca 1
23. 10. 2007
0 views

casablanca 1

Mito Cinese
24. 10. 2007
0 views

Mito Cinese

CharetteAASL
03. 10. 2007
0 views

CharetteAASL

nfb social impact fr
11. 03. 2008
0 views

nfb social impact fr

Development aid
23. 12. 2007
0 views

Development aid

SFDVpresentation
30. 10. 2007
0 views

SFDVpresentation

ITEK Alvento
02. 11. 2007
0 views

ITEK Alvento

Paraphrasing New
15. 10. 2007
0 views

Paraphrasing New

yurtdisi sunu
23. 11. 2007
0 views

yurtdisi sunu

soldering progress
12. 10. 2007
0 views

soldering progress

Agustin Carstens
10. 04. 2008
0 views

Agustin Carstens

small mediumschools
08. 10. 2007
0 views

small mediumschools

Congres RMRA 07 Marrakech 2007
24. 10. 2007
0 views

Congres RMRA 07 Marrakech 2007

overview 2004 MF Meeting v2
29. 10. 2007
0 views

overview 2004 MF Meeting v2