share
interactive transcript
request transcript/captions
live captions
download
|
MyPlaylist
ALEX BUERKLE: One is that I have-- and you looked at earlier a commands.r file and a couple of JAGS files that are in the lessons that I'm teaching. I updated those since you were last working at some point. And so please make sure that you're working with a fresh copy of that. So if you downloaded it before, while I'm getting set up, why don't you download another one? There is a chance that one of them didn't change, but might as well have the fresh one.
The other thing that I wanted to say is, at the beginning, I said that there was a molecular ecology paper that Zach and I had written about how low you should go in terms of sequence coverage. If you read that or if you've read it already, you'll recognize that these population allele frequency models that I was talking about earlier are described in that paper. In the supplement to that, there is JAGS code and code for simulating allele frequencies. It's similar to what we're going to use in a little bit.
But if you want to be able to work with that analysis and understand how low coverage works and how it affects genotype probabilities, there is code there. It's not exactly the same code as we'll use today, because we'll be addressing slightly different questions, but you should be well-enabled to be able to use that in the future.
The third thing that stems from a question or a conversation that I was having with somebody is remind you that I don't really expect that many of you will ever, ever need to program a beta binomial or be able to estimate these things or specify them yourselves. We're trying to take a peek under the cover to understand how these things work. But if you're using Structure or if you're using many of the estimation programs for population genetics, this is what's happening underneath.
And so it gives you some idea. And like I said at the beginning, it's like reading the nutrition label on food. We can understand what choices are being made. Are we assuming beta 1, 1 and no information sharing? Or are we sharing information in a way that we like?
And what are the trade offs associated with that? I said I was going to get organized. I'll get organized now, if you guys could pull down those copies of files and we'll pick up where we left off.
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Pardon. So on the Materials page, there are kind of two main web pages. There are two tabs there at the top. The materials-- I think I'm towards the top, Berkeley lessons there. And there is commands.r, locusdiversity.jags, and LocusFmodel, I believe are the three files.
AUDIENCE: Make sure you refresh the page.
ALEX BUERKLE: Yeah, you might need to refresh the page.
OK, are you guys ready to get started then? We had talked about three different allele frequency models. The last one we talked about just had introduced the possibility that we weren't observing genotypes directly, but instead that we had allele counts at a locus for an individual and that those were our data. We want to modify that model now and write a new multi-locus model for allele frequencies and diversity again with data that involve genotype uncertainty with population pool data.
So let's start out as we have before in terms of specifying the model, where we're going to specify the posterior probability distribution of all the parameters we're going to estimate given the data. So it's going to be our vector of allele frequencies. I'm going to write q for the frequency of that allele in our pool and theta as our three parameters.
We're going to have read counts for a pool and the different alternative states now. So that's going to be just like our x before, the read counts of the alternative states, but they're going to come from a bar-coded pool of individuals. We know how many reads we have at that locus. And we have to have an idea of the number-- we have to know the number of individuals that we put into our pool, because that's going to be a level that's going to affect our sampling along the way.
So what parameter is missing in here relative to the last one? What have we taken away?
AUDIENCE: Genotypes.
ALEX BUERKLE: Genotypes. There is no more genotypes. So in a way, you could say this is not a model for genotype uncertainty. It's not even a model for genotypes. We just have reads that affect the frequency of that state in the pool.
And that frequency of that state in the pool is going to be affected by its frequency in the population that we sampled from. So we can start specifying our model further. We're going to take a product across loci again. And we're going to have our read data at that locus. And it's just going to be a function of the frequency of that allele state in the pool and the number of reads that we have at that locus.
So that's a new part. That's going to be multiplied by-- I'm sorry. Yeah, that's fine. Product is probability of x given q and n. And we're going to multiply that by the probability of q, the frequency of that allele state in the pool. And that's going to be conditional on a couple of things.
So if we have DNAs that we've pooled in an equal-molar concentration in a very, very careful way to try to make sure that each individual is represented by the same number of chromosomal copies in the pool of DNA-- you've extracted it. You very carefully quantified how much DNA was in each tube. You know how terrible that is, but you're doing it any way.
And you then pipette those in equal-molar concentrations into a tube. That's going to be a sample from the true population. And it's going to be affected by the number of individuals that went into that pool.
The more individuals that you sample, the more it should reflect the true population allele frequency. One individual is not a population pool. And it would be a bad indication of population allele frequency.
100 individuals going into a pool is going to tell you more about the population allele frequency. So whatever we do in terms of parameterizing this, we should take into account the number of individuals that went into the pool. And having more individuals should decrease our variance.
So what we're going to do is we're going to de-- this is going to be a beta distribution. And we're going to be a little bit fancier than we've been so far with beta distributions. We're going to have two terms that make up our parameter.
We're going to have 2n as part of it. And that's going to be multiplied by p, the allele frequency in the population. That's going to be our parameter.
So the expectation for our allele frequency in the pool is just what it is in the population, p sub j. We're going to scale that with a scalar, which is our intensity of sampling that we did. And remember, we thought about betas a bunch today.
As beta parameters get big, the variance goes down. We tighten up around it. When we go below 1, that's when we have the highest variance.
So if we sample a bunch of individuals, this parameter will get big, 2n, the 2n allele copies that we sampled into our equal-molar concentration population pool. So that's how we're going to parameterize. That will be a beta.
And it will look like this. I'll leave myself a little bit more space. That will be a beta with our two parameters as 2n pj. And the alternative will be 2n 1 minus pj.
And we need to keep going. I haven't finished. We need a hierarchical prior for p sub j, which is just our theta, and then a hyperprior for probability of theta. That part is the same as previous. So let's talk a little bit more about this.
This is a common way in which we can work with the beta. We can have a point estimate, like an expectation. And we can apply a multiplier by it to scale it in terms of our variants.
And we did a bunch of examples of those, that series of different betas that we did with 1, 1, the expectation was 0.5. And we went to 10, 10 and 100, 100. Those all had the same expectation of 0.5, but we reduce the variance. And so that's-- we're taking advantage of that feature of the beta here, that we can decompose the expectation and have a scalar.
Now, n is not a parameter here, is it? It's fixed. We know how many individuals went in there. So there doesn't need to be some type of hyperprior for n.
Someone asked me a question about that, though. If you're working with-- and it has come up multiple times, because there are people who do marine biology and they don't know how many individuals go into their sample. There is a chance you might be able to estimate n. But I don't know whether you could estimate it from a sample. It would be a cool thing to try out. You probably can't.
So that's our beta. How do you feel about that? Doing OK? Questions? So that's a way we can go from population allele frequency to pool frequency. Then we have this pool frequency.
And we're going to sample reads from that. What type of distribution is that going to be? So this is our conditional prior on q. What about our likelihood? What type of distribution is that going to be?
AUDIENCE: Binomial?
ALEX BUERKLE: It's a binomial, right? We're doing Bernoulli trials. Each allele copy that we-- or each read that we sample out could be one state or another in this, the way we've done this. We're dealing with biallelic SNPs. It can be one state or another.
And we have a certain number of reads at that locus. And so it will just be binomial with our parameter qj and the number of reads at that locus. I'm sorry, not given, comma.
So that's the model specification. There is some fancy things that you can do. And you can integrate out all the uncertainty in q and to make this a little bit more compact. But this is a fine way to specify the model. This is all the steps. And it illustrates the mechanism of building up from a prior for allele frequencies at a locus.
We have those allele frequencies in the population, those parameters. And we then have pool frequencies. And we're then drawing reads stochastically from that pool. Any questions about that? Yes?
AUDIENCE: q was an estimate of the genotype?
ALEX BUERKLE: It is an estimate of the frequency of that state in the pool. And so there were a lot of genotypes that went into it, some heterozygotes, some homozygotes. Each one of those allele copies went into the pool. So in a heterozygote individual, it contributed equally to those. And that's why the whole equal-molar concentration is going to be important, because the sequencer is going to sample out of that pool reads according to their molar concentration.
You've lost track of individuals. You pooled them. And so anything that skews the frequency of individual chromosomal copies by, like, having low DNA concentration in there, will skew your estimate of q. So it's important that all n individuals are represented equally or that you're OK with your allele frequencies being slightly wrong.
Generally, in all cases, I recommend barcoding individuals. There is a great interest in pool seq. And Christian Schluter has written a number of papers. We've written models that can use pools.
But under all circumstances, I will avoid using them because of this problem, is that you don't observe-- you lose track of individuals. You lose other information. But there is-- technical problems can mislead you about the allele frequencies.
AUDIENCE: So p is the real frequency in the population. q is like the allele frequency and your [INAUDIBLE].
ALEX BUERKLE: Yeah, tube, and you're hoping that reflects what was in the population. And it will to the extent of that variants from the 2n individuals that went in there, as long as all n individuals are in there equally. OK, I wanted to just comment briefly on graphs of these models. In that paper from this year, we drew little pictures of them. And they can be fairly useful. So let's draw four pictures for these four models that we drew, that we've done just now as a bit of review.
So there was the first model that we did where we set alpha and beta equal to 1. We fix those. Those were parameters that were important for understanding the allele frequency at a locus. And then we had genotype at the bottom. So that was our first model.
And in each case, this distribution was beta p given alpha and beta. And this distribution was binomial for the genotype given p and n. And we're n equals 2 at this level, because we're talking about the two diploid copies in an individual.
The second model that we had, we relaxed this assumption of alpha and beta equals 1. And instead we wrote alpha equals theta and beta equals theta. And we made those depend on a hyperprior for theta.
And that hyperprior had a uniform distribution. The rest of the model was the same. It was a model for p and for g.
The third model, we then had genotype uncertainty. And so all of these top things were the same. We just layered one thing on top or on bottom of it, depending on how you look at the world, where genotypes are a function of a low frequencies in the population. And then your reads are a function of the underlying genotype.
And those reads are sampled with a binomial, given the genotype and the number of reads that you have for that locus. So that figure is, I think, in that 2013 paper. It might be useful for you to just organize it.
One of the things I thought we'd do quickly as a review is, I didn't draw the picture for the fourth model. So can you-- they're lines. I know it's not that complicated to draw them in series. But why don't you draw the analogous model, the acyclic linear graph for that population pool model? Take a couple of minutes to do that. And then we'll look at our answers.
Shall we look at this together? So the first levels here were the same, right? We used our flexible beta with theta estimates. And that continued all the way down to the allele frequency and the population.
The next level there, instead of having genotypes, we had the pool allele frequency. And then we had data, which were our reads. So we just need to line up next to that what the different distributions were. And the one that has changed or the one that needs to be clarified is this beta, where we have parameter p times 2n and 1 minus p times 2n.
That's the distribution at that level. And then the final one is, again, a binomial for our reads. But it just has-- it's a binomial for our reads that's given q and our number of reads at that locus in the pool.
Sorry, that's a little bit sloppy. But I think we wrote it cleaner before, so the beta with 2n times p and 2 times minus 1 minus p. And then we're doing binomial sampling from the read pool. Does that make sense?
I think these graphs can be useful as a little exercise. And we're not going to have time to do it today. But as you go and read the specification of models like in this case of structure, I think it's very useful to maybe draw one of these for yourself.
In particular, what we don't have here that's obvious is any branching or different arrows coming in from different directions. And it can be useful for organizing your thinking and recognizing what the dependency is to go in, oh, I'm now doing genome-wide association mapping. What does one of these graphs look like for that type of model?
And it would be a great service if people who wrote the papers to begin with, if that's what you're going to do, would put those figures in, I think. Although, I say that. I've never done it except for in this one paper. But I now recognize the value of them, I think, for being able to peer into what's going on.
All right, ready to move on? I am ready for topic number three, which is F models for population distribution. So I'm just going to say something very briefly, that the amount of genetic differentiation or the allele frequencies in population, the amount of differentiation between populations is obviously influenced by all kinds of evolutionary processes, drift, demographic histories, all these things that we're interested in.
And so we can use parameter estimates to shed some light on those things. It's a step to do some statistics, to estimate these parameters, to try to connect them to underlying process. That's why we're interested in this thing, this measurement of population differentiation. So I'm going to be very brief there, because I think everybody has an appreciation of why you might want to measure population differentiation.
So let's talk a little bit about different ways of quantifying differentiation or choices with FST. It's one of the commonly-used statistics for quantifying differentiation. And it's a measure of the variance in allele frequencies among populations.
So maybe that sounds mundane, but it's actually pretty profound. It's a measure of the extent to which frequencies are homogeneous and the same in all of our population or they vary from one another. And so we want some type of statistic that does a good job of summarizing that variance and actually has information about it. Not all of them do.
And in general, there is ambiguity in what we mean when we say "FST." Not all people mean the same thing when they write FST. In fact, a lot of people write FST and they're not estimating FST.
They're using some estimator of FST. They're using GST. Or they're using something else. So we need to be fairly clear about what we're using.
But the ambiguity that's probably the most consequential is that there are actually two different ways of looking at this. One is that it's a deterministic statistic. So it's a deterministic transformation of the allele frequencies to give you that parameter. That should be an a right there. So it's a deterministic statistic or summary of allele frequencies.
What's this definition? If you're thinking of FSTs and the different ones that you know about, which ones or which family of FSTs is just a deterministic transformation of the allele frequencies? Anybody know?
I'll write it. And then we'll see if you guys recognize it. If you've seen this, you have an equality of H sub T H sub S over H sub T. A lot of textbooks have that in it, that that is FST.
That is a transformation of the allele frequencies, because allele frequencies are the heterozygosities. That's how you get those. And it's transforming those by taking the difference of two terms and dividing it by another term.
So it's a transformation. It's deterministic. If you have some observation of the allele frequency, you have FST. That is a value.
So you might be wondering, well, what is the alternative? It's that you could consider it a random effects evolutionary parameter instead, because it's possible that the same evolutionary parameter, the same population size, if we did evolution multiple times, would lead to different allele frequencies. It's a stochastic process with an expectation, evolution is. Drift is the main thing that's going to affect allele frequencies.
And so if you ran evolution multiple times, you-- with the same evolutionary parameter, you would get different allele frequencies. So the ones that we observe are a random observation, a stochastic observation, a deviate from an underlying probability distribution. And we want to characterize that evolution.
We want to understand that evolutionary parameter that tells us something about the demographic history. And that type of approach to things underlies using coalescent estimators of FST. Or it also is the idea behind Weir and Cockerham's theta. It's also the idea behind various F models. My pen is somewhat-- it's flagging somewhat.
So in the first case, if we have uncertainty about FST, it's only going to reflect our uncertainty in allele frequencies, the finite sample of individuals that we had. But if we have a random effects model, we want to have both a reflection of our finite sample that we took, but also recognize that that's just a random draw of that. That is, that population that we studied is one instance of many possible outcomes of the same evolution. Does that make sense?
So and the argument in favor of the random effects is just that. It's that, we don't think that if we had that same population start at the same place and we watched it for the same number of generations under the same conditions, that we would necessarily get the same allele frequencies. When we see allele frequencies, there is some uncertainty about which process gave rise to those. How much FST? How much drift gave rise to them? And so we want to have some uncertainty in terms of the evolutionary stochastic process that gave rise to those allele frequency differences, not just our sampling.
This, you can calculate in Excel. But it doesn't incorporate your uncertainty about the evolutionary process that gave rise to it. So that's our motivation to think some about F models. But Weir and Cockerham's theta or other estimators incorporate that perspective as well as a random effects model.
So let me give you an overview of F models. I'm going to grab another pen. Thanks. There is one.
So the cut to the chase a little bit, we are-- an F model, what it says is that allele frequencies have a beta distribution. So it's consistent with other things I've been saying today. So we have a beta distribution of population allele frequencies.
And it's going to be a beta distribution with parameters pi and theta and 1 minus pi and theta. This is a different theta than we were using before. There are a limited number of characters. But there is reasons to choose that.
This doesn't say anything about FST yet. But theta equals 1 over FST minus 1 does. So this type of parameterization of the beta should look familiar. We were just doing this a moment ago, where we decompose the parameter into two terms, a pi type of term and some type of scalar that's next to it.
And so what we have here is a pi, an expectation. So this pi, you can think of as the expectation. And the reason that works is, remember that the mean or expectation of a beta is alpha over alpha plus beta.
If you do that with these two things, you will see that you will solve for pi. The theta or the scalar goes away. So if you want to do that for yourself and do a little bit of arithmetic, at one point, you will be adding pi plus 1 point-- 1 minus pi.
And so that will go away because that will then be 1. They cancel out. And you'll divide by theta. And you'll just be left with pi.
So our expectation for allele frequencies in a population is pi. And so you can think of that pi as either the allele frequency in the migrant pool in an infinite island setting or the allele frequency in some ancestor. OK, so that's where we should be. And then the scalar is the amount of drift from that or the amount of evolution, whatever caused the difference in allele frequencies. It's some type of distance and some type of variance that would be expected under that parameter.
OK, so how does that work? If theta is really big, are allele frequencies in the population going to match the common ancestor or be far away from it?
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: I'll try.
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Yeah, so if theta is really big-- that's our scalar thing in our beta that we played with this morning. If that's a big number, are our allele frequencies in the population going to match pi? Or are they going to be far away from it? What's our expectation for the variance?
AUDIENCE: [INAUDIBLE].
ALEX BUERKLE: They're going to be right on top of it. So if you have a beta of 1,000, 1,000, you're right on top of the expectation. But if they're really low, you could have had a lot of drift from it. And that's where FST comes in, because that theta-- so if FST is close to 0 and we do 1 over that, that becomes an enormous value.
0 FST expects us to be right on top of the expectation, whereas if FST is big, then we can have much more variance from the ancestral allele frequency or from the allele frequency in the common migrant pool. So that's cool, right? It's all hanging together for us.
So this F model arises in those two circumstances. And I should say, it arises approximately blah, blah, blah, in pop-gen terms. It arises approximately under those two conditions, the infinite island model and under the circumstance of divergence from a common ancestor.
I'll say a little bit more about those, but let's write one more note, that when those don't hold, FST measured this way is still a measure of differentiation. And so it's still a measure, what I mean by that, it's still a measure of the variance in the allele frequencies in your sample. It just might be that it's not, strictly speaking, under that circumstance.
So for example, one of the ways that this condition could not hold-- let's say you have 10 populations and you have hierarchical structure within them so that five of them are more closely related to one another in one pool. Five or more closely, one in this clade. And then those have some hierarchy in the structure. Then the infinite islands and the single migrant pool doesn't hold. Nevertheless, FST is still a good measure of variance in allele frequencies across those 10 populations.
AUDIENCE: But I don't understand why [INAUDIBLE] distribution is the best distribution for FST.
ALEX BUERKLE: It's not the best distribution. It just happens to be that the F model posits that it is beta distributed. And this is the relationship. It's not the best. It is a way of modeling genetic differentiation.
And so it's the one I'm going to present today, because it gives rise to a number of attractive features. But I would not argue that it is the best or the only, by any means. It does have the attractive feature that it's a random effects model and that we're integrating over all the different evolutionary processes that could give us the same allele frequency shifts.
And so it has some attractive features. Was that addressing your issue? OK.
So I'm not the only person who is excited about the F model. It's been used before. And honestly, most people who have used it didn't realize that they did unless they use, for example, Foll and Gaggiodi's F model, where it's explicitly called out as one.
Where people probably don't know that they've used it is that it's in structure. And so the Pritchard et al. 2000 paper has a model that they called the correlated allele frequencies. It's in the appendix.
Lots of people use it. And they use this character, a little f. But it clearly has an FST interpretation. So it exists in other software. And we've put it in software that we've used.
What I'd like to do is to use R for a moment to look at how allele frequencies and FST interact. And so if you go-- I asked you to download a new copy of the commands.r. In there at this section for where we are, 3.2, this number four, go look for the code there. There are two plots to do. And what I'm asking you to do is to look at how allele frequency and FST interact.
We want to look at pi equals 0.1 so that we have a rare allele or an intermediate frequency allele. And we want to combine that with FST that's very small so that we would expect to not have very much drift from that, whatever pi is. And more interestingly, probably, let's put a big FST in there, 0.4.
So what I have in the code for you is an intermediate allele frequency with high and low FST. I've already modeled that for you. But go ahead and try it with an alternative allele frequency as well.
So pull up that code. Make a plot of what will the beta distribution look like with FST of 0.5, and those two different ancestral frequencies, and the different FSTs. And I'll pull it up on my computer as well so that we can look at it there.
Is everybody getting that to work? Do you guys know what I'm talking about? OK. That's what we're-- I'm going to see what I can do.
Have you guys had a chance to paste that in? So let's look at that together. So those are the two curves that I gave you code for. So let's think of what those two are. Both of them have an expectation of 0.5.
We put in pi of 0.5, meaning that the ancestral or the common migrant pool is 0.5. And then we're applying a scalar term to that. It affects our variants for allele frequencies that we expect due to the evolutionary process.
So we have an ancestral allele frequency of 0.5. The blue line came from a distribution or an FST with 0.01, an FST that we would all consider to be small. If you saw that in a paper, you would think, that's a small FST. An FST of 0.1 will distort allele frequencies from their ancestral frequency or relative to the common ancestral pool that much. That's our probability distribution for allele frequency in the descendant population.
And I lost track of this, but we can think of this. When I'm thinking about descendant from the common ancestral population, I'm thinking pi, the ancestral allele frequency. And we have samples from it that are our p's for our individual populations. And this FST is going to reflect, give us an indication of how much evolution occurred between that ancestral population and our descendant populations.
We can have this much variance in allele frequencies from a tiny FST, which gives you an idea of, oh, well, yes, we knew that drift was stochastic, and we knew that it had a big outcome, a big effect on allele frequencies, and we knew that the coalescent leads to really wide variation under the same circumstances. This is that same phenomenon.
Now, what about if FST is 0.4, which is remarkably high? But between species, you certainly could have something like that. Or maybe you have a really widely distributed species.
It could be 0.4. What does that do in terms of where we sit in sample populations relative to the ancestral population frequency? We can be very far from it. We have high probability of going to fixation or loss for individual alleles relative to that. And there is very, very high variance expected amongst replicants, amongst populations.
The same type of thing would go for loci if we had different FSTs, or a common FST for the organism and we were sampling different loci, we might expect huge ranges of variation amongst loci when it's that large. FST and the greater amount of evolution increases the variance enormously. And it does it in that parametric way that we've now learned about, where we have pi times theta. So if theta is big, the variance will be low. So that's one of the reasons we played with the beta distribution so much.
Let's contrast that briefly with a slight alternative. Where would we need to change things to do this exact same thing, but for low allele frequencies? We would just need to change all these 0.5's to 0.1's, and look at when we have a common allele and a rare allele, how does that play out then?
I think I got both things copied over. It's worth taking the y limits off, I think is the issue. So that's with lots of drift. This was a little less.
So now we've shifted things to be we have an allele frequency of 0.1 in our ancestor. So there is a common allele and a Rare allele. And We could do the same thing with 0.9. It would be symmetrical, right? So we have a common allele and a rare allele. And we've applied those two levels of FST.
With very low FST, we will tend to still sit on top of that ancestral expectation. But with low, we're almost always going to fix for the rare allele. And in general, if you play with some of these things and adjust different FSTs, is what you will recognize and the reason I gave you this code is that you can see that because allele frequencies are rarely intermediate, or at least that's my view in the world, that very often, there will be very little information about FST from gathering sequence data.
The more that they're intermediate, the more-- to me, that's a much more neutral way of saying it. The more intermediate your allele frequencies are, the more in the common ancestor, the more opportunity there is to learn about FST. But the closer you are up against 0 and 1 for your alternative alleles, the less opportunity there is to reconstruct that variance parameter about the evolutionary process. So are the new questions about F models? Yes?
AUDIENCE: So just two types of [INAUDIBLE] measuring FST directly versus these kind of [INAUDIBLE]?
ALEX BUERKLE: Yeah, so this is giving-- is reflecting our uncertainty about FST. It stems both from our finite sample of individuals that gives us rise to uncertainty in allele frequencies. And it recognizes that different FSTs can lead rise, give rise to the same allele frequencies.
It's a probabilistic process, evolution is. And so that we could have a set of p's-- or I'm sorry, a set of data that come from these that tell us a whole bunch about the allele frequencies within them. And so we would know those with high confidence. We just sample enough individuals.
And we have three estimates of this. But there is another level of stochastic process they operate in. The allele frequency differences that we see between these are just one sample, or in this case, three samples from an evolutionary history.
And that if you reran this same demographic history to lead to three populations again, you would get different allele frequencies and different FSTs. And so this allows us to incorporate our uncertainty about that, because we know it's real. We know that this is just a sample.
And so for example, if you go and sample three populations, that's much less information about the amount of evolution or the amount of drift relative to the common migrant pool than 50 would be. And so you want to characterize and incorporate information about that level of uncertainty as well. I mean, this you're-- not alone in terms of not understanding this at all. And most empirical papers, people are hard pressed to tell you which type of test they use. They can just tell you the software that they used to calculate FST.
And then you tell them, well, it's the Weir and Cockerham's theta that you calculated. And then they wouldn't be able to-- many people wouldn't be able to tell you that that's a random effects model that incorporates this uncertainty in the same way, that those population of allele frequencies are stochastic samples from that evolutionary process.
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: So the loci that have intermediate allele frequencies in the intermediate pi's so that they're near 0.5, you can play with these numbers and show that you're going to tend to be able to learn more about FST for things that have intermediate allele frequencies. This is related to the point that microsatellites are not very good for estimating FST, because you have an allele frequency for each of the alleles. And all of them are small, because there are a bunch of them. And so you can't move very far between them. Anyway, it's somewhat related.
OK, we want to take a coffee break pretty soon. But I want to add a point. And I'm going to have to curtail it a little bit relative to what I had planned.
But the next thing that I wanted to talk about is that-- and this is one of the reasons to like the F model, is that you can--
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: Thank you. I do that by doing this. I get much more in practice once the semester starts. That we can define a locus-specific F model-- and the reason we'd be interested in this is that we think that loci can have different FSTs. So FST can vary among loci.
FST could, of course, also vary among populations. What we've been talking about so far is all of our replicate populations being samples from the same FST so that they share one FST. That's not necessarily an assumption that we need to make.
We could instead say, we want to estimate the vectors to all of them relative to the common ancestor. Similarly, when we write this locus-specific model, we can relax the assumption that all individual loci have the same FST. But instead, we can say that they come from a distribution themselves and study the distribution for individual loci.
Ideally, in a world that's impossible, we would like to be able to calculate locus-specific FSTs within populations and have that nestedness. And various people have found that that's not possible, because you don't have information. And so you can either calculate locus-specific FSTs with replication across populations. Or you can calculate population-specific FSTs. And all of your loci give you common information about that.
So what I'm going to focus on today is just the locus-specific angle. But you could also have a population-specific FST in an analogous way. So what we're going to be thinking about is a group of populations-- maybe a pair, is what people end up often doing-- a group of populations that share locus-specific FST. So this group could be a pair. And I'm going to try to think about what to here.
It's going to be analogous to what we had and what I sketched out just a minute ago when I told you that an F model posits that FST or that an F model is that allele frequencies are beta distributed. We're now just going to index those. And then we're not going to require that there is a single theta, but we're going to index that by loci. We're going to allow them to vary from one another.
And so we're simply going to write a conditional prior on allele frequencies that has that pi and theta, but that's now going to be across loci, beta distributed, with an allele frequency for that locus in the migrant pool or common ancestor and a separate theta term. I'm sorry, this should be a product. So it's pi times data and then 1 minus pi times theta i, just like we had before, but now we're indexing them.
And so there should be some choices that are apparent to you. We're now going to have to have some decision. This is a big poi, a product across them. This is a little pi. I'm recognizing that that is ambiguous when projected onto a screen. It looks beautiful in my notes. That's a big pi. That's a product across all the individual loci.
So we need to think about information sharing, don't we? To what extent will allele frequencies at different loci share information? What have we been doing in terms of modeling? What was the prior probability distribution for allele frequencies at a locus, typically? L
It was a beta, right? We used that allele-frequency spectrum. So we can share information about that pi sub i in that way.
How will loci share information about FST or this theta here? What can we do there? What would be some choices about information sharing?
They could come from a normal distribution with the average FST and some variants around them. We could do that. There are problems with that. That's an unbounded distribution.
What's the scale for FST? 0 to 1. And we've seen how flexible it is. It can be asymmetrical. We can look at the variants and calculate it. So we're going to have a prior on FST or on this theta that itself is a beta distribution.
Alex Buerkle, associate professor of evolutionary genetics at the University of Wyoming, gives an overview of F-models for population differentiation, as part of a population genomics workshop at Cornell University, July 23-24, 2013.
Population genomics involves sampling, financial, and bioinformatics trade-offs, so proper experimental design requires understanding probability, sequencing technologies and evolutionary theory and how they relate to research trade-offs. The workshop, "Next Generation Population Genomics for Nonmodel Taxa," explored the strengths and weaknesses of different approaches to genome sequencing and bioinformatics when studying population genomics in nonmodel species.
Go to the workshop website for information associated with these videos including lecture notes, descriptions of exercises, and computer code. This website is a site for on-going learning about methods for population genomics.