2004L9Stat246

Information about 2004L9Stat246

Published on February 24, 2008

Author: Nivedi

Source: authorstream.com

Content

HMM in crosses and small pedigrees, cont.:  HMM in crosses and small pedigrees, cont. Lecture 9, Statistics 246, February 19, 2004 Another problem: reconstructing haplotypes:  Another problem: reconstructing haplotypes The problem here is to reconstruct the childrens’ haplotypes as in the figure, from marker data on both the children and the parents. The Lander-Green HMM:  The Lander-Green HMM Recap. The states of the Markov chain are the inheritance vectors. At any locus on a chromosome, the entry in the inheritance vector for a non-founder are 0 if the parental variant passed on at that locus was grandmaternal, and 1 otherwise. Consider a two-parent two-child nuclear family and suppose that the mother and father are 1/2 and 3/4, respectively, while the first (girl) child is 1/3 and the second (boy) is 2/4. Then the inheritance vector is of length 4, v = (vgm , vgp , vbm, vbp), where gm represents the girl’s maternal meiosis, gp her paternal meiosis, and so on. What are the assigments for v at the marker? We don’t know which of the mother’s alleles 1 and 2 came from her mother and which came from her father, but we can arbitrarily declare that it was the one she passed on to her daughter, and similarly for the alleles 3 and 4 of the father. With this assignment, we find that v = (0, 0, 1, 1), because in each case, the boy received alleles from his parents different from his sister’s. In fact the specification of the paternal and maternal chromosomes of a founder is completely arbitrary, and we’ll mention later how this can be turned into a symmetry which speeds up the calculations. Transition probabilities:  Transition probabilities Now suppose that the same family has genotypes 1’/2’, 3’/4’, 1’/3’ and 2’/4’ at a locus near the first one. If the recombination fraction between the two loci is r, and r is small, then we might expect the inheritance vector v’ at the second locus to coincide with v = (0,0,1,1). But what if the genotypes were 1’/2’, 3’/4’, 1’/3’ and 2’/3’, respectively? This suggests that v’ = (0,0,1,0), with the 10 in the boy’s paternally inherited chromosome denoting a recombination. How do we weigh up these competing possibilities? As with the mouse chromosomes, we need a transition matrix P(r) connecting adjacent inheritance vectors. The form of P is as in the mouse case, namely, a tensor power of the 22 matrices having 1-r on the diagonal, and r on the off-diagonal elements, here P = R(r)4 . In general, it is an 2nth tensor power, where n is the number of non-founders. Thus we have our states and our transition probability matrix, and hence our (product) Markov chain. To complete the specification of our HMM, we need observations and the associated emission probabilities, and an initial distribution. Alternative representation:  Alternative representation The purpose of inheritance vectors is to describe the possible patterns of gene flow through a pedigree. Once ordered pairs of alleles are assigned to founders, the 0s and 1s in the inheritance vectors specify the alleles that are passed from parent to offspring down the pedigree. As mentioned in the last lecture, an alternative representation of this gene flow is via what is known as a descent graph, see Lange’s book, ch 9. A recent paper gave an even more economical representation of what is needed, via what the authors called a sparse gene flow tree. I leave those interested to consult Abecasis et al, Nature Genetics 30 2002: 97-101. There (in the program Allegro) the efficient matrix multiplication that we will describe shortly is replaced by a sparse matrix-vector multiplication algorithm more general that the one we will give. Observations (phenotypes):  Observations (phenotypes) We now turn to our observations. The data we have on our (small) pedigree will generally consist of genotypes at many marker loci, and perhaps additional disease or other phenotype data, see the pedigree on p.2, where black filling of a square or circle indicated that the person is affected by some specified disease. In the case of interest to us here: reconstructing haplotypes, we’ll assume that there are just marker data. Suppose the unordered pairs of alleles (i.e. genotypes) at marker locus t come from a set (t). Then for f founders and n non-founders, our observations come from (t)f(t)n. Here I could have simply written the (f+n)th power, but it is convenient to keep founders and non-founders notationally distinct. As with the mouse crosses, we could add in ambiguity and missing data “observations”, but for simplicity we won’t do so here. Since we are mainly interested in marker data, let’s denote a typical (vector) observation at locus t by mt. Emission probabilities:  Emission probabilities Referring to our general discussion of HMM in the previous lecture, we now need to specify the equivalent of the emission probabilities q(i,j,k; t) at each locus t. These probabilities are generally functions of the current and previous state, but here they just depend on the current state, vt , and take the form q(vt ,mt ; t). We want this to be q(vt ,mt ; t) = pr(observations at t = mt | inheritance vector at t = vt ), but right now vt just describes gene flow: we haven’t got started. To complete our description, let us write at = (at,1 ,at,2 ….,at, 2f-1 ,at, 2f ) for the assignment of an ordered pair of alleles to each founder. Suppose that the frequency of allele at,h is pt,h . Under the population equilibrium assumptions we have previously mentioned, pr(at ) = ∏h pt,h . Finally, we sum over all at (in practice, over all at compatible with the observed phenotypes). This gets us started towards defining q(vt ,mt ; t). Penetrances:  Penetrances Having got started, note that alleles at for the founders and an inheritance vector vt gives us a set of ordered genotypes gt = (at ,vt ) at locus t by following the flow. We are almost there. The observations on each individual in the pedigree can now be assigned probabilities given their (ordered) genotypes. This last step involves the terms we have previously called penetrances - probabilities of observed phenotypes given genotypes - and a non-trivial assumption: that the probabilities of the different individuals’ phenotypes are conditionally independent given their (ordered) genotypes, and depend on the pedigree’s (ordered) genotypes only on their own (ordered) genotypes, i.e. pr(mt | gt ) = ∏i pr(mt,i | gt,i ), where the product is over individuals i in the pedigree, and the terms in the product are penetrances. Putting all this together, our HMM emission probabilities have the form q(vt ,mt ; t) = ∑ pr(at )pr(mt | at ,vt ) = ∑ ∏ pt,h ∏i pr(mt,i | gt,i ). Pedigree calculations:  Pedigree calculations The HMM is fully specified as soon as we give an initial distribution for the states, and this is simply the uniform distribution: each inheritance vector is assigned initial probability 2-2n. We now turn to carrying out the calculations efficiently, so that we can deal with as large small pedigrees as possible. Since the HMM formulation of pedigree analysis was introduced by Lander and Green in 1987, there have been a number of improvements. Looking back over nearly 20 years of these, the arms race comes to mind. Details of the first human version Crimap were never published, but it would have used the standard HMM simplifications we will review shortly. In 1995 Genehunter came on the scene, Kruglyak et al (1995,1996), and this had a few new tricks. Then Genehunter+ incorporated founder phase symmetry and what Kruglyak and Lander (1998) called Fourier analysis (on {0,1}n). The next improvement was Allegro (Gudbjartsson et al 2000), whose tricks are yet to be revealed to the world, and the most recent and fastest is Merlin (Abecasis et al, 2002), which uses sparse gene flow trees and some ideas from coding theory. The race will go on! The forward equations:  The forward equations It is time to present the standard HMM formulae, these going back to the original work of Baum et al, c1970. We will revert to the general HMM notation for this discussion, that is, our HMM has components (Xt, Yt), with Xt “hidden” and Yt observed. With Baum et al, define the forward quantities 1(i) = pr(Y1,X1 = i), where i is an arbitrary state of X1 , and t+1(j) = pr(Y1,Y2,…..,Yt+1 , Xt+1 = j) = ∑i pr(Y1,Y2,…,Yt ,Yt+1 , Xt = i, Xt+1 = i) = ∑i pr(Y1,Y2,…,Yt , Xt = i)  pr(Xt+1 = j |Y1,Y2,…Yt , Xt = i)  pr(Yt+1 |Y1,Y2,…,Yt , Xt = i, Xt+1 = j ) = ∑i t(i)p(i, j; t)q(i, j,Yt+1 ; t+1). When these are all computed up to T, simply sum over states i of XT to get pr(Y1,Y2,….., YT) = ∑iT(i). The work is in evaluating the sum, and the q terms. We look at the sum first. Evaluating the sum (a):  Evaluating the sum (a) In our small pedigree problem, T is the number of marker loci along our chromosome, which might be anything from 20 to 1,000, and the transition probability matrices p are all 2n2n, where n is the number of non-founders. For small small pedigrees, such as our nuclear family, this is not a problem, but as we try to push the limits of this algorithm, the evaluation of products of such matrices by conformal vectors (which is what our sum is) becomes limiting. How can we speed up the calculation of such products? It has been known for decades, perhaps centuries, that the application of NN matrices which are (equivalent to) tensor powers to N1 vectors can have the O(N2) multiplications reduced to O(N logN). I say centuries because it is said that Gauss used a form of the fast Fourier transform (FFT), and all these speed-ups are similar in spirit to the famous (and many times discovered) FFT. In our 2n2n case, the speed-up goes back to Yates in the 1930s, who used it to carry out the calculations for 2n factorial experiments. The details are in my class notes Stat 260, Week 8, Appendix to the Tuesday lecture. Evaluating the sum (b):  Evaluating the sum (b) The other term needing attention in the sum is the emission probability q(vt, mt; t), which we have defined as a potentially large sum over ordered pairs of alleles for all founders. A lot of effort has been devoted to finding ways of reducing the calculations To be completed The backward equations:  The backward equations In a manner quite analogous to that which gave us the forward equations, we can define T (i) = 1, and t-1 (i) = pr(Yt,Yt+1,…..,YT | Xt-1 = i), and find that it reduces to = ∑j t(j)p(i, j; t-1)q(i, j,Yt ; t). Exercise: Prove this. Note that Y = (Y1,Y2,…..,YT ), then pr(Y, Xt = i) = t(i)t(i), and hence pr(Y) = ∑i t(i)t(i) for any i. Exercise: Prove that pr (Xt = i | Y) = t(i)t(i) / ∑i t(i)t(i) , and that pr(Xt+1 = j | Xt = i, Y) = p(i, j ; t )q(i, j, Yt+1; t+1)t+1(j) / t(i) . Exercise: Reconstructing the haplotypes:  Reconstructing the haplotypes At last we come to our initial problem. To be completed…. Slide15:  Founders: individuals without parents in the pedigree. Non-founders: individuals with one or more parent in the pedigree Slide16:  Founders: individuals without parents in the pedigree. Non-founders: individuals with one or more parent in the pedigree Slide17:  Founders: individuals without parents in the pedigree. Non-founders: individuals with one or more parent in the pedigree

Related presentations


Other presentations created by Nivedi

7 habits of highly effective ir
24. 10. 2007
0 views

7 habits of highly effective ir

Intro to Middle East
23. 10. 2007
0 views

Intro to Middle East

Ulysses
01. 10. 2007
0 views

Ulysses

chasm
02. 10. 2007
0 views

chasm

english version
03. 10. 2007
0 views

english version

griffin BGP tutorial
07. 10. 2007
0 views

griffin BGP tutorial

eurjap2004pres
09. 10. 2007
0 views

eurjap2004pres

aboriginal art
10. 10. 2007
0 views

aboriginal art

Galloway
15. 10. 2007
0 views

Galloway

room
15. 10. 2007
0 views

room

keymicrobial rosenberg
23. 10. 2007
0 views

keymicrobial rosenberg

franklin
15. 10. 2007
0 views

franklin

JavaVsDotNET
21. 10. 2007
0 views

JavaVsDotNET

strategies maximize
29. 10. 2007
0 views

strategies maximize

rehder050407
10. 12. 2007
0 views

rehder050407

Chuck Bedsole
25. 10. 2007
0 views

Chuck Bedsole

WA1 2 Redmond
29. 10. 2007
0 views

WA1 2 Redmond

2005 Hand Washing Findings rev
30. 10. 2007
0 views

2005 Hand Washing Findings rev

ch5slides
07. 11. 2007
0 views

ch5slides

Turkey CoalRestructuring
26. 11. 2007
0 views

Turkey CoalRestructuring

chapter 28
23. 12. 2007
0 views

chapter 28

Crisis Management Lecture 2
29. 12. 2007
0 views

Crisis Management Lecture 2

Lecture 25
16. 10. 2007
0 views

Lecture 25

DERMATOLOGY QUIZ ANSWERS
05. 01. 2008
0 views

DERMATOLOGY QUIZ ANSWERS

Maeve Foreman
07. 01. 2008
0 views

Maeve Foreman

revNotes1stMC
04. 10. 2007
0 views

revNotes1stMC

Schuster
11. 10. 2007
0 views

Schuster

IWGT comet final2
30. 10. 2007
0 views

IWGT comet final2

mccay4
12. 10. 2007
0 views

mccay4

Presentazione ICE TUNISI
23. 10. 2007
0 views

Presentazione ICE TUNISI

Kapitel4
24. 10. 2007
0 views

Kapitel4

CAARI Lab 00
12. 10. 2007
0 views

CAARI Lab 00

Taipei August 05
19. 10. 2007
0 views

Taipei August 05

LOWBACKPAIN2
16. 02. 2008
0 views

LOWBACKPAIN2

berne
28. 02. 2008
0 views

berne

PAM
23. 10. 2007
0 views

PAM

slides bird flu
30. 03. 2008
0 views

slides bird flu

A105 003 Sky
13. 11. 2007
0 views

A105 003 Sky

hunter lovins ifm07
30. 10. 2007
0 views

hunter lovins ifm07

Psicotrópicos
24. 10. 2007
0 views

Psicotrópicos

IntroToRealOptions
16. 04. 2008
0 views

IntroToRealOptions

QUMRAN COMPRESSED
14. 02. 2008
0 views

QUMRAN COMPRESSED

third exam review
18. 04. 2008
0 views

third exam review

McGraw Hill
22. 04. 2008
0 views

McGraw Hill

posp72 0
11. 10. 2007
0 views

posp72 0

7 Metlin
11. 10. 2007
0 views

7 Metlin

cshcn
07. 05. 2008
0 views

cshcn

ls3 d2 room22
30. 04. 2008
0 views

ls3 d2 room22

Open GL
02. 05. 2008
0 views

Open GL

physics of accelerators
02. 05. 2008
0 views

physics of accelerators

thuy
02. 05. 2008
0 views

thuy

maryland 080202
19. 02. 2008
0 views

maryland 080202

CONFECA2005
22. 10. 2007
0 views

CONFECA2005

20061201 afrinic5 report
27. 03. 2008
0 views

20061201 afrinic5 report

Roe Summer Lecture
13. 10. 2007
0 views

Roe Summer Lecture

opa
22. 10. 2007
0 views

opa

break this glass update
07. 01. 2008
0 views

break this glass update

GGunn Feb05
12. 03. 2008
0 views

GGunn Feb05

cronqvist
31. 12. 2007
0 views

cronqvist

jcdl
02. 10. 2007
0 views

jcdl

jje
10. 10. 2007
0 views

jje

milcho
15. 10. 2007
0 views

milcho

Masahiro Satake session 6
09. 10. 2007
0 views

Masahiro Satake session 6

TIANS 2003 final
11. 03. 2008
0 views

TIANS 2003 final

mariana
22. 10. 2007
0 views

mariana

Schulthess X1Review Feb2004
15. 10. 2007
0 views

Schulthess X1Review Feb2004

april croatia
26. 03. 2008
0 views

april croatia

200793142735
03. 01. 2008
0 views

200793142735

CultureClashWestPoint
17. 10. 2007
0 views

CultureClashWestPoint

notifications
10. 10. 2007
0 views

notifications

Daar
16. 10. 2007
0 views

Daar

IGEM2006 Imperial Powerpoint
01. 01. 2008
0 views

IGEM2006 Imperial Powerpoint

booklet osijek2006
18. 03. 2008
0 views

booklet osijek2006

12km MM5 Issues Mar8 9 2005
29. 10. 2007
0 views

12km MM5 Issues Mar8 9 2005