Published on October 16, 2007
BioSci D145 Lecture #4: BioSci D145 Lecture #4 Bruce Blumberg ([email protected]) 4103 Nat Sci 2 - office hours Tu/Th 3:30-4:30 (or by appointment) phone 824-8573 TA – John Ycaza ([email protected]) 4351 Nat Sci 2, 824-6873, 3116 check e-mail and noteboard daily for announcements, etc.. If you do not have ready access to e-mail or the web speak with me ASAP Please use the course noteboard for discussions of the material I will post all questions received via e-mail on the course noteboard If you object to your question being posted please indicate this clearly in the message.. lectures will be posted on web pages after lecture http://blumberg.bio.uci.edu/biod145-w2007 http://blumberg-serv.bio.uci.edu/biod145-w2007 mRNA frequency and cloning: mRNA frequency and cloning mRNA frequency classes classic references Bishop et al., 1974 Nature 250, 199-204 Davidson and Britten, 1979 Science 204, 1052-1059 abundant 10-15 mRNAs that together represent 10-20% of the total RNA mass > 0.2% intermediate 1,000-2,000 mRNAs together comprising 40-45% of the total 0.05-0.2% abundance rare 15,000-20,000 mRNAs comprising 40-45% of the total abundance of each is less than 0.05% of the total some of these might only occur at a few copies per cell Normalization and subtraction: Normalization and subtraction How to identify genes that might only occur at a few copies per cell? Normalization - process of reducing the frequency of abundant and increasing the frequency of rare mRNAs Bonaldo et al., 1996 Genome Research 6, 791-806 Subtraction - removing cDNAs (mRNAs) expressed in two populations leaving only differentially expressed Sagerström et al. (1997) Ann Rev. Biochem 66, 751-783 alter the representation of the cDNAs in a library or probe Normalization and subtraction: Normalization and subtraction Normalization - reducing abundant, increase rare mRNAs - normalization should bring cDNA abundunce to within 10x rarely works this well Typically, abundant genes reduced 10x, rare ones increased 3-10x Intermediate class genes do not change much at all Approach make a population of cDNAs single stranded - tester hybridize with a large excess of cDNA or mRNA to Cot½ =5.5 driver Cot½ value is critical for success of normalization 5-10 optimal, higher values NOT better Normalization and subtraction (contd): Normalization and subtraction (contd) Approach (contd) various approaches to make driver use mRNA - may not be easy to get make ssRNA by transcribing library ssDNA from gene II/ExoIII treating inserts from plasmid library PCR amplification of library best approach is to use driver derived from the same library by PCR rapid, simple and effective other approaches each have various technical difficulties see the Bonaldo review for details. Normalization and subtraction (contd): Normalization and subtraction (contd) What are normalized libraries good for? EST sequencing gene identification biggest use is to reduce the number of cDNAs that must be screened good general purpose target to screen subtracted libraries are useful but limited in utility Drawbacks Not trivial to make Size distribution of library changes Longer cDNAs lost Normalization and subtraction (contd): Normalization and subtraction (contd) Subtraction - removing cDNAs (mRNAs) expressed in two populations leaving only differentially expressed Sagerström et al. (1997) Ann Rev. Biochem 66, 751-783 +/- screening St. John and Davis (1979) Cell 16, 443-452. Hybridize the same library with probes prepared from two different sources and compare the results example - hybridize normal liver cDNA library with probes from normal and cancerous liver Colonies or plaques that are expressed in target tissue (tumor) compared with control are picked Why aren’t all colonies labeled in normal tissue? probe Normalization and subtraction (contd): Normalization and subtraction (contd) +/- screening (contd) Advantages Relatively simple approach Doesn’t require difficult manipulations on probes Disadvantages Housekeeping genes often appear to be differential Sensitivity less than subtracted screening +/- screening typically requires >10 fold difference in expression levels using standard methods not widely used any longer BUT microarray analysis is really just a refined version of +/- screening Normalization and subtraction (contd): Normalization and subtraction (contd) Subtractive screening - Sargent and Dawid (1983) Science 222, 135-139. Make 1st strand cDNA from a tissue and then hybridize it to excess mRNA from another larger Cot½ is best, >20 at least – WHY? remove double stranded materials -> common seqs make a probe or library from the remaining single stranded cDNA It takes a long time to remove rare common sequences Normalization and subtraction (contd): Normalization and subtraction (contd) Subtractive screening (contd) benefits sensitive can simultaneously identify all cDNAs that are differentially present in a population good choice for identifying unknown, tissue specific genes drawbacks easy to have abundant housekeeping genes slip through multistage subtraction is best in effect normalize first, then subtract libraries have limited applications may not be useful for multiple purposes Normalization and subtraction (contd): Normalization and subtraction (contd) rule of thumb make a high quality representative library from a tissue of interest save subtraction and other fancy manipulations for making probes to screen such libraries with unlimited screening easy to use libraries for different purposes, e.g. the liver library hepatocarcinoma cirrhosis regeneration specific genes Slide12: Nobel Prize in Chemistry 1980 Walter Gilbert (Harvard) & Frederick Sanger (MRC Labs) (Sanger also won Nobel in 1958 for protein sequencing) DNA sequence analysis DNA sequencing = determining the nucleotide sequence of DNA Two main methods shared Nobel prize in 1980 Chemical cleavage – Maxam and Gilbert Enzymatic sequencing (based on polymerization reaction) How many others have won 2 Nobel prizes? In the same field? Slide13: Marie Sklodowska Curie 1903 in Physics – (radioactive effect) 1911 in Chemistry – (radium and polonium) Linus Pauling 1954 in Chemistry (nature of chemical bond) 1962 in Peace (crusade to ban atmospheric nuclear testing) John Bardeen 1956 in Physics (transistor) 1972 in Physics (superconductivity) Frederick Sanger 1958 in Chemistry (protein sequencing) 1980 in Chemistry (DNA sequencing) Only people to have won 2 Nobel Prizes Husband Pierre, daughter Irene, and son-in-law Frederick Joliot also won as did son-in-law Henry R. Labouisse (UNICEF) Slide14: Curies 1903 Marie, Pierre in Physics, 1911, Marie – in Chemistry 1935 Irène Joliot-Curie in Chemistry Braggs 1915 in Physics - Sir William Henry Bragg, (Sir) William Lawrence Bragg Bohrs 1922 in Physics - Niehls Bohr 1972 in Physics - Aage Bohr Kornbergs 1959 in Physiology or Medicine – Arthur Kornberg 2006 in Chemistry – Roger Kornberg Parent-children Nobel Prizes Siegbahns 1924 in Physics – Karl Manne Siegbahn 1981 in Physics – Kai Siegbahn Thompsons 1906 in Physics – Sir Joseph John Thomson 1936 in Physics – Sir George Paget Thomson Von Eulers 1906 in Chemistry - Hans von Euler-Chelpin 1970 in Medicine - Ulf von Euler Slide15: Tinbergens 1969 in Economics - Jan Tinbergen 1973 in Medicine - Nikolaas Tinbergen Sibling Nobel Prizes Spousal Nobel Prizes Curies 1903 in Physics – Pierre and Marie Curie 1973 in Chemistry – Irene Joliot-Curie and Frederic Joliot Coris 1947 in Physiology or Medicine – Carl and Gerty Cori DNA sequence analysis: DNA sequence analysis Maxam and Gilbert One of the first reasonable sequencing methods Very popular in late 70s and early 80s VERY TEDIOUS!! Totally superceded by dideoxy sequencing now DNA sequence analysis (contd): DNA sequence analysis (contd) Dideoxy sequencing – Sanger 1977 Virtually all sequencing is done this way now Requires modified nucleotide 2’3’-dideoxy dNTP DNA polymerase incorporates the ddNTP and chain elongation terminates Original method used 4 separate elongation reactions Products separated by denaturing PAGE and visualized by autoradiography DNA sequence analysis (contd): DNA sequence analysis (contd) Dideoxy sequencing (contd) – Sanger 1977 Dideoxy NTPs present at ~1% of [dNTP] Each reaction has identified end In principle, all possible chain lengths are represented varies by [dNTPs], [ddNTPs], [primer] and [template] and ratios DNA sequence analysis (contd): DNA sequence analysis (contd) Slide20: Trace files (dye signals) are analyzed and bases called to create chromatograms. Chromatograms from opposite strands are reconciled with software to create double-stranded sequence data. Automated DNA sequence analysis How to improve throughput of sequencing? Incorporate fluorescent ddNTPs, separate products by PAGE Base calling and lane calling issues Key advance was capillary sequencers Separate DNA in a thin capillary instead of gel Very accurate, no tracking errors, much more automation friendly Automated DNA sequence analysis: Automated DNA sequence analysis Capillaries vs gels Capillaries much faster – higher field strength possible Fully automated = higher throughput Slide22: Applied Biosystems PRISM 377 (Gel, 34-96 lanes) Applied Biosystems PRISM 3700 (Capillary, 96 capillaries) PCR – polymerase chain reaction amplification of DNA: PCR – polymerase chain reaction amplification of DNA PCR is most routinely used method to amplify DNA Exponential amplification of DNA by polymerases – Saiki et al, 1985 2n fold amplification, n= # cycles 35 cycles = 235 = 3.4 x 1010 fold Originally used DNA polymerase I Needed to add fresh enzyme at every cycle because heat denaturation of template killed the enzyme Not widely used – too painful to do manually Nobel Prize to Kary Mullis in 1993 for deciding to use Taq DNA polymerase for PCR He was middle author on paper! Slide24: Hot water bacteria: Thermus aquaticus Taq DNA polymerase Life at High Temperatures by Thomas D. Brock Biotechnology in Yellowstone © 1994 Yellowstone Association for Natural Science http://www.bact.wisc.edu/Bact303/b27 PCR – polymerase chain reaction amplification of DNA (contd) Cycle sequencing – fusion of PCR and fluorescent ddNTP sequencing: Cycle sequencing – fusion of PCR and fluorescent ddNTP sequencing http://www.dnalc.org/ddnalc/resources/animations.html Combine PCR amplification with dideoxy sequencing – cycle sequencing Linear amplification of template in the presence of fluorescent ddNTPs When nucleotides are used up reaction is over Separate on capillary electrophoresis instrument Advantages Fast, single tube reaction Works with small amounts of starting material Disadvantages Still need to prepare high quality template to sequence Cost and time Many sequencing centers spend time, $$ on template prep Automation requirements Isothermal amplification – the solution to template preparation: Isothermal amplification – the solution to template preparation How to make template preparation faster, easier and more reliable? Eliminate automation requirement, amplify starting material in some other way Φ29 DNA polymerase (aka TempliPhi) http://www1.amershambiosciences.com/aptrix/upp01077.nsf/Content/autodna_templiphi_intro Enzyme has high processivity and strand displacement activity Isothermal reaction produces huge quantities of DNA from tiny amount of input More efficient than PCR (no temp change, no machine, no cleanup) Modern DNA sequence analysis: Modern DNA sequence analysis Cycle sequencing Virtually all DNA sequencing today is done by cycle sequencing with fluorescent ddNTPs ABI Big Dye chemistry Template preparation still tedious for small scale TempliPHi used in genome centers (obviated need for most automation) Capillary sequencers predominant form of technology in use DNA sequence analysis: DNA sequence analysis Landmarks in DNA sequencing Sanger, Nicklen and Coulson. Sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463-5467 (1977). Sanger, F. et al. The nucleotide sequence of bacteriophage ΦX174. J Mol Biol 125, 225-46. (1978). Sutcliffe, J. G. Complete nucleotide sequence of the Escherichia coli plasmid pBR322. Cold Spring Harb Symp Quant Biol 43, 77-90. (1979). Sanger et al., Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162, 729-73. (1982). Messing, J., Crea, R. & Seeburg, P. H. A system for shotgun DNA sequencing. Nucl.Acids Res 9, 309-21 (1981). Anderson, S. et al. Sequence and organization of the human mitochondrial genome. Nature 290, 457-65 (1981). Deininger, P. L. Random subcloning of sonicated DNA: application to shotgun DNA sequence analysis. Anal Biochem 129, 216-23. (1983). Baer et al. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 310, 207-11. (1984). (189 kb) Innis et al. DNA sequencing with Taq DNA polymerase and direct sequencing of PCR-amplified DNA Proc. Natl. Acad. Sci. 85, 9436-9440 (1988) DNA sequence analysis (contd): DNA sequence analysis (contd) Landmarks in DNA sequencing (contd). 1995 - Haemophilus influenzae (1.83 Mb) 1995 - Mycoplasma genitalium (0.58 Mb) 1996 - Saccharomyces cerevisiae genome (13 Mb) 1996 - Methanococcus jannaschii (1.66 Mb) 1997 - Escherichia coli (4.6 Mb) 1997 - Bacillus subtilis (4.2 Mb) 1997 - Borrelia burgdorferi (1.44 Mb) 1997 - Archaeoglobus fulgidus (2.18 Mb) 1997 - Helicobacter pylori (1.66 Mb) first bacterium sequenced, human pathogen smallest free living organism first Archaebacterium Lyme disease first sulfur metabolizing bacterium first bacterium proven to cause cancer DNA sequence analysis (contd): Landmarks in DNA sequencing (contd) 1998 - Treponema pallidum (1.14 Mb) 1998 - Caenorhabditis elegans genome (97 Mb) 1999 - Deinococcus radiodurans (3.28 Mb) 2000 - Drosophila melanogaster (120 Mb) 2000 - Arabidopsis thaliana (115 Mb) 2001 - Escherichia coli O157:H7 (4.1 Mb) 2001 – draft Human “genome” 2002 – mouse genome 2002 – Ciona intestinalis 2003 – “complete “human genome 2004 – rat genome 2006 – Human “genome” complete sequence of all chromosomes Many more genomes underway, check JGI, Sanger and other web sites resistant to radiation, starvation, ox stress DNA sequence analysis (contd) Primitive chordate Pathogenic variant of E. coli DNA Sequence analysis: DNA Sequence analysis Complete DNA sequence (all nts both strands, no gaps) complete sequence is desirable but takes time how long depends on size and strategy employed which strategy to use depends on various factors how large is the clone? cDNA genomic How fast is sequence required? sequencing strategies primer walking cloning and sequencing of restriction fragments progressive deletions Bidirectional, unidirectional Shotgun sequencing whole genome with mapping map first (C. elegans) map as you go (many) DNA Sequence analysis (contd): DNA Sequence analysis (contd) Primer walking - walk from the ends with oligonucleotides sequence, back up ~50 nt from end, make a primer and continue Why back up? Need to see overlap to be sure about sequence you are reading DNA Sequence analysis (contd): DNA Sequence analysis (contd) Primer walking (contd) advantages very simple no possibility to lose bits of DNA restriction mapping deletion methods no restriction map needed best choice for short DNA disadvantages slowest method about a week between sequencing runs oligos are not free (and not reusable) not feasible for large sequences applications cDNA sequencing when time is not critical targeted sequencing verification closing gaps in sequences DNA Sequence analysis (contd): DNA Sequence analysis (contd) Cloning and sequencing of restriction fragments once the most popular method make a restriction map, subclone fragments sequence advantages straightforward directed approach can go quickly cloned fragments often useful otherwise RNase protection, nuclease mapping, in situ hybridization disadvantages possible to lose small fragments must run high quality analytical gels depends on quality of restriction map mistaken mapping -> wrong sequence restriction site availability applications sequencing small cDNAs isolating regions to close gaps DNA Sequence analysis (contd): DNA Sequence analysis (contd) nested deletion strategies - sequential deletions from one end of the clone cut, close and sequence Approach make restriction map use enzymes that cut in polylinker and insert Religate, sequence from end with restriction site repeat until finished, filling in gaps with oligos advantages Fast, simple, efficient disadvantages limited by restriction site availability in vector and insert need to make a restriction map DNA Sequence analysis (contd): nested deletion strategies (contd) Exonuclease III-mediated deletion cut with polylinker enzyme protect ends - 3’ overhang phosphorothioate cut with enzyme between first cut and the insert can’t leave 3’ overhang timed digestions with Exonuclease III stop reactions, blunt ends ligate and size select recombinants sequence advantages unidirectional processivity of enzyme gives nested deletions DNA Sequence analysis (contd) DNA Sequence analysis (contd): DNA Sequence analysis (contd) Nested deletion strategies Exonuclease III-mediated deletion (contd) disadvantages need two unique restriction sites flanking insert on each side best used successively to get > 10kb total deletions may not get complete overlaps of sequences fill in with restriction fragments or oligos applications method of choice for moderate size sequencing projects cDNAs genomic clones good for closing larger gaps Small-scale sequence analysis – how is it practiced today? Primer walking ExoIII-mediated deletion with primer walking Genome sequencing: Genome sequencing The problem Genome sizes for most eukaryotes are large (108-109 bp) High quality sequences only about 600-800 bp per run The solution Break genome into lots of bits and sequence them all Reassemble with computer The benefit Rapid increase in information about genome size, gene comparisons, etc The cost 3 x 109 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 106 reactions for 1x coverage! Need both strands (x2), need overlaps and need to be sure of sequences ~107-108 reactions/runs required for a human-sized genome About $1-2 per reaction these days. Genome sequencing (contd): Genome sequencing (contd) Shotgun sequencing NOT invented by Craig Venter Messing 1981 first description of shotgun Sanger lab developed current methods in 1983 approach blast genome into small chunks clone these chunks 3-5 kb, 8 kb plasmid 40 kb fosmid jump repetitive sequences sequence + assemble by computer A priori difficulties how to get nice uniform distribution how to assemble fragments what to do about repeats? How to minimize sequence redundancy? Genome sequencing(contd): Genome sequencing(contd) Genome sequencing(contd): Genome sequencing(contd) Genome sequencing (contd): Genome sequencing (contd) Shotgun sequencing (contd) How to minimize sequence redundancy? Best way to minimize redundancy is map before you start C. elegans was done this way - when the sequence was finished, it was FINISHED mapping took almost 10 years mapping much too tedious and nonprofitable for Celera who cares about redundancy, let’s sequence and make $$ why does redundancy matter? Finished sequence today costs about $0.50/base Genome sequencing (contd): Genome sequencing (contd) Mapping by fingerprinting Mapping by hybridization Traditional (map first) vs STC (map as you go along) mapping: Traditional (map first) vs STC (map as you go along) mapping The human genome: The human genome In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs Celera -> 39114 Ensembl -> 29691 Consensus from all sources ~30K Number of genes C. elegans – 19,000 Arabidopsis 25,000 Predictions had been from 50-140k human genes What’s up with that? Are we only slightly more complicated than a weed? How can we possibly get a human with less than 2x the number of genes as C. elegans Implications? UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002 The human genome: The human genome The answer – Gene sets don’t overlap completely Floor is 42K 85,793 UniGene Clusters (from EST and mRNA sequencing Down from 105,680 last year and 128,826 previous year) = 42113 Genome sequencing(contd): Genome sequencing(contd) Whole genome shotgun sequencing (Celera) premise is that rapid generation of draft sequence is valuable why bother trying to clone and sequence difficult regions? Basically just forget regions of repetitive DNA - not cost effective using this approach, genome is alleged to be 90% finished rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% problems sequence may never be complete as is C. elegans much redundant sequence with many sparse regions and lots of gaps. Fragment assembly for regions of highly repetitive DNA is dubious at best “Finished” fly and human genomes lack more than a few already characterized genes The human genome: The human genome How finished is the human genome sequence? Draft sequence to high coverage Chromosome by chromosome finishing now Chr 22 – 1999 Chr 21 – 2000 Chr 20 – 2001 Chr 15 – 2003 Chr 6,7,Y-2003 Chr 13,19 -2004 May 2006 – all finished Genome sequencing (contd): Genome sequencing (contd) Knowing what we know now – how to approach a large new genome? Xenopus tropicalis 1.7 Gb (about ½ human) BAC end sequencing Whole genome shotgun Gaps closed with BACS 8 x coverage by end of 2004 Finishing dependent on additional funding Genome sequencing: Genome sequencing DOE – Joint Genome Institute http://www.jgi.doe.gov/ Numerous advances in sequencing technology Increased pass rate from ~70% to > 90% Lowered cost nearly 3 fold Total (3/99-1/29/07) 140.546 Billion 232,940,658 92% 661 Useful software for molecular biology (contd): Useful software for molecular biology (contd) NCBI – www.ncbi.nlm.nih.gov main information and analysis resource indispensable resource Useful software for molecular biology (contd): Useful software for molecular biology (contd) NCBI – Blast – how to find similar genes www.ncbi.nlm.nih.gov/BLAST/ Useful software for molecular biology (contd): Useful software for molecular biology (contd) Why pay Celera?