share
interactive transcript
request transcript/captions
live captions
download
|
MyPlaylist
[SIDE CONVERSATION] SPEAKER 1: OK, so let me introduce Alex Buerkle from University of Wyoming. Hello?
ALEX BUERKLE: This feels like class, now.
[LAUGHTER]
SPEAKER 1: Come on-- sorry, I'm the taskmaster. So Alex is in the Botany Department there, even though he works on a lot of animal systems. And I saw Alex's videotaped lecture from a molecular ecology meeting last year. And he was making some, I think, really good points about strategies for sampling.
And so it was [? great, ?] he also was a speaker in the AGA Speciation Symposium. So I was very pleased that he was willing to be one of the instructors here. So without further ado--
ALEX BUERKLE: OK, thanks-- so I have a variety of tasks today. One is, I'm really thinking about introducing some applications today, so to motivate, because I'm going to be teaching a bunch tomorrow. And so I want to give you some applications of some things that we've done with models and next generation sequence data.
I get very excited thinking about the methods and the things that Matt were just talking about, because we've developed methods, and messed around with all kinds of things in the lab. And I think a lot about those, but I'm not going to talk about them. And I'm not going to talk about how you prep data. Others are going to talk about that.
I'm going to assume you have sequence data. And so all the teaching that I'm going to do is going to assume you have some kind of sequence data from some type of source, some type of library probably. And so the sequence data gets you into a world of uncertainty. And so my title is how I suspended worrying, I gave up worrying, and embraced uncertainty, and started working on models that dealt with the uncertainty explicitly.
And it's really changed things for me. I was a new professor at the University of Wyoming and working on non-modal taxa. And it was really fairly frustrating to do marker-based work with 10 or 20 microsatellites that are impossibly difficult to score and almost impossible to estimate their allele frequencies, because there are 15 alleles. And I only have 20 samples from a population. That is not a good situation.
And so I did a lot of modeling instead. And so but now I'm in a position to be able to gather data. And so I'm going to present some data today that illustrates some of this stuff. And then at the end, I'm going to talk about-- I'll give you an outline in a bit.
Most of my work is interested in speciation. And so the reason I was invited was to talk in the speciation workshop that just happened the last two days. And so I apologize to the few of you who were there as well. A couple of slides will overlap between those talks. But I'm really interested in trying to understand the processes that result in the formation and maintenance of species, and gathering data and models, and generating models that allow us to make progress in that area. That's really the big thing that we do. But we do lots of conservation projects. We do lots of really applied things that are funded by USDA or BLM as well along the way, but that's kind of the basic science objective.
This is the slide that overlaps with the speciation symposium. I use hybridization as a tool to understand genetics of speciation. And so the framework there is that of geographic range is fragmented. And those genomes diverge in those isolated populations by mutation and all the different evolutionary processes that shift allele frequencies.
And then populations come back into secondary contact. And that secondary contact, one person, smart person has referred to that as a species collider. And so you're colliding two species into one another. And I'm looking at the particles that fly off and the interesting patterns that emerge in the species collider.
And so we've done work to understand the genetics of boundaries between hybridizing species. And that's the type of work that I'll present today. And I also work on homoploid hybrid speciation, because one of the possible outcomes of secondary contact is that something new is spun off. It could be a polyploid hybrid species.
But homoploid hybrid species are possible. And with better data, we're recognizing that it happens more frequently than we might have thought, including in animals, which is novel to some zoologists. And I'm particularly interested in recognizing repeated components in both of those categories, things that are consistent amongst multiple hybrid zones or inconsistent, and to recognize what might be some of the drivers of that, or things that are consistent between multiple bouts of homoploid hybrid speciation or inconsistent amongst them and see whether we can learn something from those, that comparative work that we can do.
So today I'm going to focus on two questions. One is, to what extent do genetic regions with a potential history of divergent selection while organisms were in allopatry also contribute to hybrid unfitness or reproductive isolation between species? OK, and we're going to do that in two different systems, in butterflies and in birds.
And then the second question that I'm going to answer is more directly applicable to the teaching I'm going to do, which is, what can hierarchical Bayesian models do for you? So I'm going to make an argument for why you should care about this stuff I'm going to try to teach tomorrow. And I hope I succeed in that. I'll at least illustrate it.
So the first organisms that we're going to talk about are Lycaeides butterflies. They're very charismatic. If you spend time in the West in the summer, you will definitely see them fly. It's a big group of butterflies. Not all of the blues that you will see are Lycaeides.
But they occur across Western North America. And in Northwestern Wyoming, Southern Montana-- so these are Rocky Mountains here-- two forms come into contact, Lycaeides idas as a nominal species and Lycaeides melissa. We think this is secondary contact. And we have hybrid populations in Jackson Hole. And so we can use that as an opportunity to examine the secondary contact and understand what is the nature of the species boundaries.
I work in this system. I got into it through my student Zach Gompert, who was my graduate student until last year and now has a faculty job, and a number of other really great collaborators. They're lucky enough to be out in the field right now, in fact, up in Jackson catching butterflies. But I'm here. I'm representing us here. We made a deal. So that's the first organism I'm going to talk about.
The second organism I'm going to talk about, maybe surprisingly, since I was introduced as being in the botany department, are manakin birds. I did my PhD on birds. I'm legit, in terms of being able to talk about them. I didn't lose my qualifications by taking a job in a botany department.
In any event, they hybridized. They're two different species that hybridized. And some really important work was done in this hybrid zone. And we're actually just taking samples that Robb Brumfield collected for his PhD and re-analyzing. And so we're using DNA that was in the freezer and getting genotyping by sequencing data and applying it to that same data set.
And so Tom Parchman was a postdoc in my lab. He just left. He has a job at the University of Nevada Reno now as a faculty member. And Zach was involved. And then there were a number of collaborators who were actually the bird biologists in this.
And I probably need to introduce manakins. They're charismatic. They're on the covers of books. But they hybridized. They have lots of really cool biology, but we want to understand the same issues that we're interested in in the butterflies, and how they interact in this hybrid zone, and how the loci behavior in the hybrid zone compares to species differentiation.
I think I should say, since this is a non-model workshop, these qualify as non-models. And so that's why I thought they would be fine to present here.
So our first thing that we need to think about, we're going to think about isolation-- or differentiation that happened while species were in isolation, so the differentiation between species. And there are a variety of ways to measure differentiation. The way that I'm going to be using today and the one that I'm going to be teaching you about is called an F model.
And FST is the common metric that we think of in terms of genetic differentiation, but there are actually lots of different definitions of FST. We're using one that you might not have heard of, an F model. Maybe you have if you use BayeScan or something like that. But this is something actually that dates back to [? Wright ?] but we haven't been doing it, so [? sule ?] [? Wright. ?] It dates back to the 1940s and '50s, but we just haven't been doing it. But this definition has been around.
And what we're going to do is, we're going to quantify variation in FST amongst loci to detect outliers. So this is common, but we're going to use a slightly different method than maybe what you're familiar with. We're going to estimate a genome-wide FST, so a parameter for the whole genome.
When I say a parameter, we're going to have a probability distribution for the parameter. Each of our loci is going to be a sample or an instant-- or a deviant sampled from that distribution, an instance of information from that common distribution. And those individual locus-specific FSTs are going to provide us information about what we should expect in terms of locus-specific allele frequencies in a population. And then the allele frequencies in a population should tell us something about the genotypes that we observe.
And so this is a hierarchical model. We will have genotypes. They will give us information about the allele frequencies, which will give us information about the locus-specific FST, and ultimately, the genome-wide FST.
And so that's one of the things we're going to do tomorrow. And we're going to-- that's why I asked you to install JAGS on your computer. So that puzzle was solved for you. So I'm going to be using that today.
The way that looks when you're thinking about measuring FST with an F model, is that you're assuming that allele frequencies at a locus are beta distributed. And we're going to be talking a lot about the beta distribution. I recognize that it's probably unfamiliar. It was unfamiliar to me until two years ago.
Imagine it as a distribution that is truncated by 0 and 1. That's good. We're thinking about allele frequencies. Being truncated by 0 and one has-- but that's unusual for distributions. There aren't very many distributions that are truncated by 0 and 1.
It has two parameters, a first parameter and a second parameter. In R, they are smartly called shape1 and shape2. In math texts, they're called alpha and beta. Really, a beta distribution has a beta parameter? That's problematic.
Anyway, so we have two parameters that dictate the shape of this distribution. Betas can be very flexible. They can be flat between 0 and 1. They can be unimodal with some intermediate frequency. Or they can be like the allele frequency spectra that we're seeing in [? Annie's ?] talk, where they're cup-shaped, where there is a common allele and a rare allele. And that's the common situation.
The parameters come from-- we're going to decompose, in fact. This multiplication or gives us the first primary. We're going to decompose it into an expectation, which is the mean allele frequency, and a psi, a variance. And that variance is related to FST.
And so what we're going to say is that the allele frequency at a locus is a function of both this common expectation and some variance term. And it's related to FST. That definition emerges in two cases, one, in an infinite island model with gene flow, where the pi represents the allele frequency in the common migrant pool.
OK, so if you're talking about marine organisms that have broadcast dispersal, it's the allele frequency in that migrant pool. Or it also-- and this might be easier to think about-- it emerges in simultaneous divergence of multiple populations from a common ancestor. And there, pi is the ancestral allele frequency. And we're thinking about this divergence from that common ancestor. They're evolving away from that.
And the more evolution there has been, the bigger FST will be, which means that there will be more variance amongst loci, whereas if they're tightly constrained to be right on top of the ancestral frequency, they're right there on top of the expectation as if that ancestral and the descendant populations are the same, there will be low variance. And we'll talk lots more about that. And I'll illustrate more. But this gives you an introduction to the fact that this is different in terms of thinking about FST. And it allows us to combine information from loci quite nicely.
So here's our non-model organism. We had 17,000 SNPs from genotyping by sequencing type data, or RADs, or whatever you want to call them. We had restriction sites. And we sequenced off the ends.
And we identified SNPs. We filtered them. We get an average FST for the whole genome. And this is the distribution. So here are all the different values of FST for the loci and the quantile in the genome-wide distribution where they fell.
Where were they in that parametric distribution for the whole genome? Were they right on top of the median? Were they above or below?
And what we see is that we have some loci that are exceptionally high relative to the parametric distribution. They're improbable. They have high quantile values relative to the mean expectation from that parametric distribution.
And we can define outliers by various different criteria. That wasn't actually the objective in this study, but it is a method by which you can evaluate how likely is that locus to have come from the genome-wide distribution. So those are the data from Lycaeides.
We'll do the same thing for manakins. They're more differentiated. Their average FST for the genome is 0.2, which is kind of high. And we have variation. We have some loci that are exceptionally very, very high differentiation.
And we can define outliers again. So we've quantified differentiation between the species. That's all we've done. I don't really care about the outliers in this context. We're really just looking at the range of variation and getting locus-specific estimates.
But we often are put in a situation where we want to interpret the FST and understand, well, what could be the cause of these outliers. And typically, what we're thinking about, or the explanation that you see for high FST outliers in the literature is that you have selection against immigrant alleles, and so that an individual allele that moves into the other population is disfavored, and so that you have high differentiation at those loci, whereas at neutral loci, there is no consequence of being a migrant. Or there is divergent directional selection in the populations that leads to exceptional differentiation at those loci.
But there are a number of other potential explanations for these. And some of them are not very easy to discount. So there is a decent possibility that there is balancing selection all over the genome, that the typical situation is background selection. It's purifying things away and removing variance, and that that's what characterizes allele frequencies and their differentiation across most of the genome, that those are really held fairly similar to one another.
And the outliers might be the neutral loci that are free to vary and become differentiated from one another in divergent populations. It might be worth thinking about. So those are at least three of the selection possibilities.
And there were some interesting things that were said at the genetics symposium, that selection is not all the same, that we shouldn't just think of selection as divergent. And so I've mentioned different forms of selection. But of course, selection could be of different intensity. And that would have-- and of variable intensity, and that would have consequences.
The other things that will affect allele frequencies are all the other evolutionary processes. Some of those we don't know very much about like, recombination hotspots in deserts and the amount of linkage to other sites under selection. That all is going to constrain allele frequencies different ways. I'm particularly excited about learning about more about gene conversion and what that does to allele frequencies.
It looks like in meiosis, gene conversion might be, in some organisms, very, very common. OK, I'm sorry, I shouldn't have said that. This is class. How many of you don't know what gene conversion is?
Really, that's it? I would have thought at least half of you. So a bunch of you know what gene conversion is. Do you feel comfortable with it?
AUDIENCE: I'd love an explanation.
ALEX BUERKLE: You'd love an explanation, cool, because people usually think about gene conversion in the context of molecular evolution. And there were these people who were doing fancy molecular evolution that was all in model organisms that talked about this, but the rest of us didn't care about it in population genetics. But it turns out that, when chromosomes pair, and they find one another through a homology search, for proper disjunction, they have to have crossovers. You know this from basic genetics.
And so, on average, for any chromosome pair in any diploid organism, you have, on average, one crossover per chromosome pair per meiosis. But there are a bunch of other points on the chromosome where they met. And it looks like, at least in arabidopsis, that 100 of those involve gene conversion work. The allele copy from one chromosome just gets copied over into the other chromosome in tracts of 300 to maybe 3,000 base pairs.
And so it's, instead of being a single crossover where you go from one ancestry to the other, it's actually a double crossover that's very small, which means that we wouldn't have been able to see it until we start sequencing at very, very high density within the genome. But that's going to do stuff to allele frequencies. And so that could lead to-- that was a long aside, which is just like class, about gene conversion. So we need to think about that and take it into account when we're comparing rates across the genome. And of course, mutation rates could be variable as well. So that was all about differentiation.
Let's think now about the species collider in populations where we actually have mixed ancestry, where we have hybrids. And I like using this figure from a paper that we did on house mouse a few years ago. And what I'm showing you are raw data. And these are, I think, 39 SNPs across 19 autosomes.
And what we're looking at is ancestry in a house mouse hybrid zone. And we've ordered individuals in terms of their ancestry. So on the right here you have the proportion of the musculus ancestry, or so one species or the other.
So at the top, we have one of the species in the species pair. And at the bottom, we have one the other species in the species pair. Dark green is both allele copies are of one ancestry. Light green are both allele copies are of the other ancestry. And the medium green are interspecific ancestry, so they have combined alleles.
This was cool because, at the time, these were SNPs. And they were diagnostic SNPs. So we didn't have to do any modeling to infer ancestry. And it gives us a bird's-eye view of ancestry variation in a hybrid zone.
And we were struck at the time by the amount of variation that there is amongst loci. Maybe this is broadly accepted at this point. But for a long time, we really thought about reproductive isolation as applying to a lot of the genome in big blocks and that we should see consistent patterns amongst many of the loci in the genome, that reproductive isolation was a feature of the species. But this looks like reproductive isolation depends on where you look in the genome, because evidently, if you go on some autosomes, you can take the ancestry from one species and it introgresses very, very far into the other species' background, as if it has no fitness consequence at all as long as you're homozygous of that one ancestor.
And the opposite is true as well. So this heterogeneity was striking. And so this was just raw data.
The thing that we have done is try to develop statistical models for that ancestry variation across loci so that we can both compare across loci and make a comparison amongst them to understand which of these loci are not comparing that are not consistent with the rest of the genome. And we also wanted to be able to compare amongst hybrid zones. I'm not going to talk about that today, but it's a really interesting problem to go to multiple transects like they do in the crickets and compare amongst multiple hybrid zones, and see the extent to which reproductive isolation is the same in one geographic location or the other.
So we worked on models that allow us to build what we call genomic clines. And so what I have here is hybrid index. And so that's the proportion of ancestry from one genome or the other.
So this is an axis of variation between one species and the other. And then what I have on the vertical axis is the probability of ancestry at a single locus. And so if every locus is behaving as the rest of the genome, it should be on that dotted line, just on that one-- it's not dotted, in this case-- one-to-one line, on the solid black line, in this case.
And in fact, there are common models that assume this structure, a model that many of you have perhaps used assumes that every locus is on that one-to-one line. And so what one of our contributions was to relax that assumption. Not all loci are obeying that one-to-one line. We had empirical data that suggested that they didn't. And so we relaxed that, and so that we can have excess ancestry in one direction or the other.
And we can also have shifts in the rate of transition between the two species. They can become steeper or more shallow. And so those are the pictures of those. We have a model for that. It's just a linear-- it starts with the one-to-one line and distorts it with two parameters, an alpha parameter and a beta parameter, where the alpha parameter looks at where the center is, how much excess ancestry is there in one direction or the other, and the beta parameter is the slope. So those are the models, or that's the equation for the line.
We didn't know what to fully expect in terms of the shape of those clines, of those genomic clines under different types of selection. They are not the Barton clines that you might be familiar with that are clines in space, where there was explicit population genetic theory for the consequences of different types of selection in hybrids. And so we did simulations to understand how different types of genetic architecture in reproductive isolation would distort allele frequencies or ancestries in a hybrid zone.
And what we found, which is maybe counter-intuitive, or at least it's counter than what you would think from a spatial cline, is that genomic center parameter is the most responsive to selection. And the easiest way to illustrate that is that, if heterozygotes are killed, that's selection. And it would be pretty good species var. Like, if they're totally dead, then you don't have hybrids. But if they're mostly dead, most of the individuals are dead, what's going to make up for those heterozygotes that are missing two alleles at a locus?
AUDIENCE: Homozygotes?
ALEX BUERKLE: Yeah, one or the other one, so either the one species' ancestry homozygote or the other species' ancestry homozygotes will make up for the missing heterozygotes. You have a population. You sampled hybrids. There is going to be something in them. So if the heterozygotes are missing mostly, it's going to be one species' ancestry or the other. And so you'll get distortions like this from under dominance for selection against them. It's not going to shift the slope.
And the other thing that we were struck with and was a little bit disappointing is that the allele frequency shifts from all kinds of different selection are pretty modest in hybrid zones. Another way of saying that is that it's really hard to make a species boundary. So I can talk more about that, but not today. So let's show you some of the model output. We have genomic cline centers for the Lycaeides butterflies. There is variation in the one parameter, the one that we thought would vary more. The beta parameter did not vary very much. And so I've plotted here, for a representative sample of the loci, the variation in ancestry L relative to the hybrid index. And so we get variation in the alpha parameter for Lycaeides.
For manakins, we get both. Both alpha and beta varied quite a bit. And so here is some of the variation. And I can plot all of the cline parameters here. And so here are little cartoons of the cline parameters. So down here, this would look like this type of ancestry variation in the hybrid zone. This would look like this excess ancestry. So we're going between the extremes, where there is one species ancestry almost fixed in the hybrids versus the other species ancestry almost fixed within the hybrids. And we had more SNPs with the manakins. I forgot to mention that earlier.
And if you're following the bird, well, you know that people are sequencing lots of birds. We found out as we were doing this project that somebody had sequenced manakins. And so we were able to put all of our SNPs on to the assembly post hoc and look at variation in our evolutionary parameters along, in this case, the biggest scaffold. They're fairly variable, both for the FST and our alpha and beta parameters.
So it doesn't look like neighboring SNPs on the scaffold. Some of these SNPs, are within the same 100-base-pair read, are behaving very much like one another. This is a picture. And it's certainly not for all 59,000. So we can use things like spatial autocorrelation of parameters to look at the drop off, and so the lag like you would look at in ecology, and the similarity between SNPs at different distances.
And so for SNPs that are within the same read, we have some correlation, but if we even go out to 1,000 base pairs, the correlations drop off very, very rapidly. And so the associations of evolutionary parameters across the genome are evidently being parsed fairly. Finally, it doesn't drop off to zero until you go to entirely unlinked things. So there is some similarity between things on scaffolds, but it drops off pretty rapidly.
So now we need to put these two things together, the genomic clines analysis and the FST, to figure out whether the extent to which things that are highly differentiated between the species are also unusual in the hybrid zone. And so we put these two things together and quantify the association with correlations. And so we have a correlation between FST and the alpha parameter in our Lycaeides.
And we get a correlation. So that's the correlation of alpha and FST. And it's significant, probably mostly because we have lots of points, of course. But there is a non-negligible correlation, which struck us as kind of cool and surprising. We didn't really know what to expect-- so that high FST loci also behave weird in the hybrid zone.
If you recall, in manakins, we had variation in both alpha and beta. And so we can look at the correlation individually. And we see correlations with both of them, and maybe cooler to look at them in combination in this 3D plot, which is kind of cup-shaped. I'm not sure how well it renders. It looks OK. I think it looks like basically your hand is up-- so that the high FSTs are at the margins of the alpha and beta combinations.
So there is an association there as well. So there is this potential-- or one interpretation of this is that, loci with a potential history of divergence selection between the species also affect hybrid fitness. This is, I guess, noteworthy, because it's genomic evidence for this. Lots of people have done experiments to show that divergence selection has contributed to phenotypes that should lead to reproductive isolation or contribute to it. This is kind of cool population genetic evidence in two non-model taxa, that at least some-- that divergence selection may contribute some to the evolution of reproductive isolation.
But there is a lot of independents. We didn't explain that much of the variation. There is clearly lots of other evolutionary processes operating within the background species that lead to the variation in FST.
There are obviously other reasons why loci might vary from the genome-wide expectation other than hybrid fitness. One of them is drift. These particular methods aren't particularly sensitive to drift, but we need to learn more about spatial structure within hybrid zones and include that as well.
I guess just a couple of implications-- reproductive isolation can have a complex genetic basis. It's cool to think about speciation genes, but I don't think they can explain that level of variation between mice. It's cool to map sterility genes in mice, but we have a huge amount of variation in introgression within mouse, where we actually know something about sterility phenotypes.
We looked in these two non-model taxa. We see huge amounts of variation. So I think that the genetic basis of reproductive isolation is likely to be very, very complex. And to say something like a speciation gene is to do disjustice. Is that a word? Misjustice, something--
AUDIENCE: Injustice.
ALEX BUERKLE: Injustice, thank you, injustice-- I did injustice to the language. Yeah, it just doesn't make sense, I think, to think of a speciation gene and to think that that might be something that characterized reproductive isolation. The other thing I have noted before is that population differentiation does not equal reproductive isolation, particularly in these secondary contact situations. You can measure FST, but it isn't a measure of reproductive isolation. The evidence for that in my talk was the independence of those two things, that loci don't behave as according to their FST parameters.
And then we were struck by the remarkable heterogeneity across the genome in terms of our parameters. People worry a lot about linkage disequilibrium and the extent to which closely spaced markers will tell us the same thing. That made that worry go away from me. And the other thing that we really need to work on is that, this is all a statistical description of pattern. I haven't tried to infer any or done very much infer a process. But there's a lot of decoupling between process and pattern. You know this. Lots of different patterns can-- I'm sorry, processes, can give the same pattern. And so when we see pattern in statistics, working backwards, it's a challenge. And there is a lot of progress to be made in that area.
So I realize I'm rocking along kind of fast, but I wanted to get to this. The point I want to make here is to answer, what can Bayesian models do for you? And part of it is that, even if you didn't stay for the rest of the workshop, if I could tell you this, that models, Bayesian models are widely understood and misrepresented-- there actually was a science paper commentary fairly recently by a mathematician, like, in the last several weeks. This is kind of bad, because they talked all about the priors.
And maybe your discussions and your teaching or whatever, that you've had exposure to Bayesian models before, has been very worried about what priors do you have. I'm sorry. I have 200 million reads. I'm not that worried about the influence of the priors on my conclusions. I am going to overwhelm the prior with my data. So that's flippant, but you guys are going to get the picture pretty quickly of how I think about things.
The other reason you should like Bayesian models is that, if you like probability-- and [? Shawn ?] has heard this before-- you like Bayesian probability, because likelihood is not probability. It's based on probability, but Bayes' theorem gives you a probability for your parameters given the data. And it has an equation. And over on the left side, it is a probability.
In likelihood, it's cool. And it can lead to important inference. And it will lead to the same maximum likelihood estimates typically, same point estimates, but it doesn't lead to a probability. It's proportional to it. And there are no methods to do anything fancy to turn it into a probability.
With Bayesian methods, even if you can't calculate this, there are methods to ensure that you will obtain samples from your posterior probability distribution. So if you like probability-- and I think pretty much everybody does-- you're kind of stuck with Bayesian models, as far as I'm concerned.
Hierarchical models, generally the reason we want to think about those is that, higher up in the hierarchy, or lower down depending on how you think about a hierarchy, who's on top, interesting parameters will emerge, like the FST that I estimated, or the alpha and the beta parameters, or something that I'm not going to get to talk about are, if you're interested in genetic mapping in a non-model organism. A lot of times we're not interested in the particular SNPs that are involved, but we're interested in estimates of the genetic architecture. And we want to put constraints on, oh, how many loci are involved?
Is it 10 to 50? Or is it 10 to 1,000? Because if it's 10 to 50, you might be able to understand something about function. If it's 10 to the 1,000 or 10,000, maybe not, if you can't constrain it any more than that.
And so you can get interesting estimates higher up in the hierarchy in Bayesian models. And you can do hierarchical models in likelihood land, but they don't give you true estimates of your uncertainty. So if you're solving for the marginal uncertainty for a particular parameter, you have to hold all the other parameters at their maximum likelihood estimates to solve for the uncertainty in this parameter in most ways that people do likelihood. So that was a disclaimer. So there are ways around it, but that's not typically how people do things.
The reason I got on into this is because we have this low-coverage sequencing. So all the stuff that I talked about previously with the butterflies and the birds were fairly low-budget projects. So the butterfly project that I talked about, that was my graduate student Zach Gompers PhD work. And that was funded by DDIG.
And so you know, DDIGs are not that big. These are not genome projects. But he paid for all of that out of his sequencing budget. And we did more. So it's pretty low budget.
But my other students are getting funding to work on fish, or spruce, or something like that, as applied projects, are doing-- scoring tens of thousands of loci for thousands of individuals now. They want me to correct this, because they're all doing over 2,000 individuals for the population genetic surveys. And we're operating in the range of 1x coverage. And so we're obviously working in the world of uncertainty. I'm happy to have one read from a locus for an individual.
And we're going to put that into models. And the models that I showed you for FST, and the Bayesian genomic clines, or for genetic mapping, or any of those things can handle that. And so we have to operate with genotype uncertainty.
So we build hierarchical models where we have parameters, let's say a locus parameter and a genome-wide parameter, but now genotype is a parameter. And depending on what your background is, you may have seen genotype as a parameter before. People impute genotypes all the time, even in crop genetics, and even in model organisms.
There are certain SNPs that are not observed. And they reconstruct them. That's why there are a lot of these haplotype methods that exist, is to be able to impute unobserved states. And so we're going to estimate genotype as a parameter where we have uncertainty. And we'll get a probability distribution for it.
So we'll have sequence data. We'll have the probability of our sequence data given genotype. We'll have probability of genotype given the locus-specific parameters, probability locus-specific given the genome, and then a hyperprior for that parameter. And so we're going to build those up.
For sequence data, all we need to do is, for a particular genotype, we're going to use multinomial. So let's say you're dealing with SNPs. So there are only three possible categories that exist for a SNP if they're two alleles segregating, so biallelic SNPs is what I'm assuming.
So there are three. You'd have the two homozygotes and the heterozygote. Those are your three categories in the multinomial. And so you're going to-- it's a multinomial distribution with the number of reads that you observed.
And notice I included zero as a possibility. So if you never observed any reads for that individual, you can integrate over your uncertainty for genotype. And maybe, as Andy was saying, neighboring loci might affect your estimate there, but it doesn't mean you have to drop that locus just because you didn't have loci. Of course, it's not going to give you any information. That individual isn't. But it allows you to work at low coverage.
And so one read is better than zero. And I'll show you, actually, that one read is a lot of the information that you can have at a locus. And then there is immediately diminishing returns beyond one read. And so we have these conditional priors, then, for genotyping.
And those conditional priors can be interesting things, like allele frequencies at a locus in a population. That turns out to be very useful, because that's what you're going to put in maybe into a clustering algorithm, or calculating FSTs by various means. We actually get a allele frequency spectra across all loci in the genome.
So the allele frequency spectra that Andy was presenting, those are for single loci. Typically they can also be collapsed across many loci, but we get actually a parametrized not a summary statistic approach to calculating allele frequency spectra. Five minutes?
SPEAKER 1: Yeah.
ALEX BUERKLE: Thanks, and then those F models, and FST, and the clines-- so let's illustrate this a little bit. So we can imagine that we have genotype. And I'm going to just assume here that we know genotype just for simplicity.
So we've scored genotypes by some traditional method, and so that a genotype is a function of the allele frequency in the population and the two draws, the two allele copies for that individual. You've observed both of them. So in that case, for an individual, you can get 0, 1, or 2 as the output of that binomial. That's that in discrete distribution.
And we can put a prior on allele frequency. This is this beta prior. And like I told you, it's alpha and beta. In R, it's shape1 and shape2. We can fix those at 1, 1. And a beta with 1, 1 is flat.
And we could go about estimating allele frequencies in this context. And what we would be doing is saying that, with our prior, that all your frequencies are equally likely. An alternative way of doing things is, let's relax this assumption of alpha and beta equal to 1, 1, but instead, let's put a parameter there that we allow to vary. We're going to keep it symmetrical so that they're the same two parameters.
If those two numbers aren't the same, alpha and beta, then the distribution can shift to the left or the right. But let's estimate those from the data. And then you can get a prior for allele frequencies at a locus that tells you something about the population. It looks u-shaped. Or it looks hump-shaped.
So a u-shaped distribution is one common allele and one rare allele is the prior probability. And you can share information across all your loci to get a very good estimate of that theta parameter. And it's very flexible in terms of, like I said, these are all different beta distributions with different parameters. If you go below 1, it becomes u-shaped. If it's above 1, it becomes unimodal and hump-shaped, where intermediate allele frequencies would be common.
So that seems nice. We would like to be able to estimate that parameter. A second example comes-- let's just look at the first chromosome of humans. This is GC content along there. The blue line is the average.
If it were 50/50, it would be up here in this [? gray ?] line. So the first chromosome is GC-biased. We could do similar things with the recombination rate. And that shows some of the heterogeneity. We might be interested in that variation in the genome, where we're looking within the genome to find exceptionally high GC-content regions relative to others.
Well, what we could we use as a model? So we have data. So maybe you haven't dealt with GC data. So these are the sequences you're looking at, in this case, a 100-KB window, so 100,000 base pairs. You're looking at the number of bases that were GC or AT.
So you have two different categories. So our data, the counts of nucleotides that are one or the other, will binomially distributed. So we could take a binomial for GC, where we have 100,000 base pairs. [? Those are ?] [? our ?] numbers of samples. And we're just going to count that.
For our binomial, we could use a naive prior again of 1, 1, where all of them are equally likely. What do you think? If you take 100,000 draws from a binomial distribution-- think about coin tossing. Do 100,000 of those. How likely are you going to hit very close to 0.5?
AUDIENCE: Very likely.
ALEX BUERKLE: You're very, very likely. You have a lot, a lot of draws. So your expectation of the variance, if you expect 0.5, 0.5, is very, very small. And so if we use this type of approach, we would think that every GC point on here was unlikely from a binomial distribution.
Well, that's not very satisfying. We know they came from that genome. We've just parametrized the expectation by the overall GC content for the whole genome. But we have 100,000 draws from that, none of them are likely, because it ranges up to 0.6. It's really unlikely to get 0.6 with 100,000 draws from a parameter-- from a distribution with a parameter of 0.41.
So let's relax that assumption that there is a homogeneous rate of p across the whole genome. And instead, let's break it apart into alpha and beta parameters. And this time, let's break it up in the same-- similar way that we're going to do with FST, where there is an expectation, so this pi, for the rate, and then a variance term.
So we'll estimate the variance from the genome itself. So this gives us our expectation for the beta. And this will be the variance.
And then we'll get hyperpriors on those. And let's estimate those as well. When we do that, we get this gray window here for the 95%-credible interval for GC content across the genome.
Given the variance that we see that we know is inherent, if we estimate that variance, we still find some portions of the genome that are still exceptionally variable relative to that intrinsic variance. And that's really what we would be interested in. These are exceptional relative to chromosome 1. So maybe that illustrates things for you as well.
I think the future for hierarchical Bayesian models, if you use them, it has the potential to give us interesting, high-level parameters for genomes, populations, trade architectures. The modeling of sequence data with uncertainty actually turns out to be mathematically very simple. I showed you already, we can model things with a multinomial. So there is no reason that we can't use this.
There are reasons that it hasn't caught on. Well, there are lots of reasons it hasn't caught on. One of them is that there are real computational challenges associated with this.
So the problem is that, you have, let's say, 100,000 SNPs that you've scored. And you have 2,000 individuals. And now you're going to estimate a parameter for every locus and individual combination. And you're going to do that in an MCMC chain.
So I don't know about you, but lots of people complain about things that run for longer than a day. And this will literally fill hard drives. And so we have to come up with new computational approaches to solving that problem. And those are going to be real for some time. And so actually, the data handling for sequences in your 200 million reads actually starts to look small as a problem comparing to estimate-- parameter estimates, or make parameter estimates.
And so I talked a little bit about some examples here of hybrid zones and population differentiation, what they can tell us about reproductive isolation and the maintenance of species barriers. And I hope I did a little bit to illustrate the benefits of hierarchical Bayesian models.
I want to acknowledge Zach, my former graduate student, and Tom-- and I'm super, super happy that they have jobs now-- and my collaborators on the Lycaeides and the manakins, and funding that helped with this work, thanks.
[APPLAUSE]
SPEAKER 1: We started about 15 minutes late. We're just going to shift everything 15 minutes, which means there is 13 minutes for questions for Alex.
ALEX BUERKLE: I can just start talking. I actually had some comments on [INAUDIBLE].
SPEAKER 1: Please, talk about it.
ALEX BUERKLE: But I'll let you go [INAUDIBLE].
SPEAKER 1: [INAUDIBLE] get it started.
AUDIENCE: Yeah so, the manakins' FST was kind of high.
ALEX BUERKLE: Right.
AUDIENCE: Do they violate infinite sites? And if they do, how does that affect the modeling of sequence uncertainty when you may actually not have just two alternatives of hets?
ALEX BUERKLE: So I don't know the answer to the question, whether they model infinite-- violate infinite sites, but I don't know what you mean for sure by that.
AUDIENCE: Well, so what's the probability of having sort of multiple different allele? Like the model you have right now is two alleles segregating. What if there is a third? [INAUDIBLE]
ALEX BUERKLE: We've eliminated all loci at which we saw more than--
AUDIENCE: Oh, you [INAUDIBLE] through those first?
ALEX BUERKLE: Yeah, so that's a general comment-- and this relates some to Matt-- is that, at the beginning, and maybe it wasn't clear, is, I was saying, I'm assuming you're starting with sequence data. And by that, I mean you've done your best with assembly, your best with filtering out errors and recognizing. And you're using state of the art in every way, including maybe filtering out loci for which you have too few individuals that have read data, because the more individuals that don't have read data, the higher probability is that you have a dropout allele. But if you apply that type of criterion, you can screen-- there are all kinds of little steps that actually don't take long and are a little Perl things that you're going to write or have written to dump data that violates some assumption. But that wasn't all of your question.
AUDIENCE: No, that was really it. That was really it, is if data in-- say you're trying to model some degree of sequence uncertainty. And so one part of that for the model is that you have [INAUDIBLE] alternative states. And so--
ALEX BUERKLE: Yeah, and so it's totally generalizable. In our first project with this, and one of our papers, just developing methods were, like, we re-implemented AMOVA for genotype uncertainty. We made it for any number of haplotypes at a locus and all the possible heterozygotes between them.
And so I talked about the binomial and the beta. Those are specific cases of distributions for two things, but the multinomials, the generalization for any number of things and the generalization of the beta for any number of things is the Dirichlet. And you saw that. If you've already-- I'm sure you've read the Structure paper at some point, the Pritchard et al. paper, 2000. They mention the Dirichlet distribution, but that just collapses down to the beta for two alleles.
AUDIENCE: I'm just curious. For those of us who are interested in utilizing Bayesian frameworks and models but may be thoroughly intimidated by the process of even starting it, do you have any resources you would recommend as a good starting point for learning about-- or learning about--
ALEX BUERKLE: Software or the model?
AUDIENCE: --learning about the software, specifically in the realm of bioinformatics, and maybe a little bit of the theoretical background to that as well.
ALEX BUERKLE: So I'm going to try to teach some of the theoretical background. So I'm not going to teach you guys how to use any software, which may be a disappointment. And I apologize for that.
But even if you're calling-- let's say you're using BCFtools in the Samtools set. Their model for calling SNPs is a Bayesian model. But learning about what they've specifically done in particular software and what the alternative choices are can be cryptic at this point.
I said yesterday in my talk that genomics has been disruptive to our lives. Part of the difficulty is that, even for human work, documentation has just catch-- I don't know when it's going to catch up to the software. Even though there are hundreds of people working on the problem of assembly, and variant calling, and that kind of thing, very often with top-of-the-line software, it can be opaque, even if you know what you're doing, for Bayesian methods. So that's kind of a downer message.
AUDIENCE: So essentially, the best we can probably do at this point is educate ourselves on the theoretical background and then just take it on faith?
ALEX BUERKLE: Well, I mean, unless you're going to dive into the C code and see how they translated their model specification into C--
AUDIENCE: [INAUDIBLE]
ALEX BUERKLE: But sometimes you actually have to look in the C to see what comments they wrote in the code, because that's where it's documented. I think the dust will settle. And that's why I-- so this is my argument for teaching you some the theoretical stuff is that, well, we're going to learn about transcriptomes and we're interested in sequence capture, I have accepted the fact that we will all be sequencing everything at some point, and so that we better be learning for the long haul.
And that's true, because I started my career right when PCR and sequencing became feasible. And before that, people were sequencing each nucleotide in four separate tubes. And seriously, there were still people doing that in the lab. And so I'm not that old.
And so that will happen to you. Technology will totally, totally change. What we are doing right now in 10 or 20 years will look so funny.
But like Matt said, we need answers now. You might be working with an agency. Like one of my people one of the people I work with, she's working on animals that are restricted to aquifers in Texas that are drying up. And they want to know, well, which of these population sources should we use and bring into tanks?
So you better want to have-- you want to have a good assembly. You want to make sure that you've called alleles correctly, and that kind of thing. But you have to do it now provisionally.
And I guess that's the other thing. I mean, this is just a general comment. We're going to make mistakes. And I've made mistakes. I know I have. And I'll make more. But I'm OK with that, because we have to get answers. And they're going to be provisional ones.
AUDIENCE: With the low sequencing coverage like the one-- I guess I could see how that [? wouldn't ?] work [? very ?] [? well ?] for different population parameters and allele frequencies. But how does it apply to individuals? And can you get an individual's genotype from that very well? And were those, like, barcoded individuals, or were those pooled?
ALEX BUERKLE: Yeah, so these are all barcoded individuals in the stuff that I talked about. So I had individual-based barcodes. And I think we have eight plates of barcodes. And so we do up to 760 into 68 individuals per lane.
But your question before that was not whether they were barcoded or not. How well does it work for individual-based data? And how well do you recover the likely genotype of an individual, is one way to think about that question. And so let me illustrate that with this figure.
Suppose your allele frequency spectrum looks like the thing on the right. I guess I might have to go like that and point like this. It looks like this so that there is a common allele and a rare allele. And your first read from an individual is the common allele.
This is the big thing. The big contrast is that, I'm thinking about individuals as coming from populations. I know where we collected them. And I can characterize the population allele frequency.
And so I know something about what to expect for an individual. And so, if I've seen one read that is that allele, it's very likely that-- and it's the common allele, and I've estimated the allele frequency from individuals and there is a common one [INAUDIBLE], the second allele is probably the same thing. And I can put a probability on that.
What a lot of the models are doing is that they're assuming we're completely naive about where an individual came from and what population it was in. And in some cases, that's true. You don't know how to characterize the allele frequency that you observe.
But I still think we can do better about information sharing amongst loci to make up for the fact that we don't need a completely naive model for individuals' genotypes. We've seen more than the individual reads. We can do some type of hierarchical model that allows us to share information so that genotypes end up being fairly good. And so even for something like genetic mapping, where you want to associate phenotype with individual SNPs, in the mapping world, people are recognizing that actually, low coverage is better. More individuals is better.
I'm sorry, low coverage in itself is not better. It's only better if you make up for it with a bunch of individuals, because in mapping, for example, you want to capture a bunch of recombination. And you have information sharing and tracts of ancestry that tell you where the recombination events likely were. And a lot of the genotyping software, genetic mapping software actually takes, as input, genotype probabilities.
AUDIENCE: You mentioned early on that the beat is a good useful distribution if there is an [INAUDIBLE] or divergence from a common ancestor. I'm wondering about other [? correlation ?] models.
ALEX BUERKLE: That's a really good question. And I just-- I was kind of hurrying. It arises in those populations genetic contexts. And in fact-- I didn't say this-- in those population genetic contexts, the parameter that you get here, this alpha and this beta-- here I wrote it as theta-- it actually is, in those contexts, an estimate of 4n mu. It's a real estimate of that population genetic parameter. In any case, it's a fine statistical summary of the allele frequency variation. So the beta is fine. We think allele frequencies vary between 0 and 1. And we can summarize the frequency distribution of different variants. It's just that, in those two populations genetic contexts, this definition arises. And it has a clear population genetic interpretation when there is a balance between drift and mutation. Then it is equal to 4n mu.
AUDIENCE: So it's just the interpretation of it?
ALEX BUERKLE: It's just the interpretation. It's a fine statistical summary, in any case. And it gives you an idea of the-- it's a summary. It's a parameter for an allele frequency distribution. So it's interesting.
So the earlier talk on allele frequency, [? basically ?] they're all different estimators of theta. There is [? Waterson's. ?] There's pi. There are all these different things. This is the definition of theta that [? Wright ?] knew about. But we very rarely had data to estimate it.
And so I think that's why it's has been rediscovered multiple times, is that now, we're getting data to estimate it across the genome. And it's very useful in a applied context to have this estimate, so that if you have 50 populations across the range of a rare plant-- it's probably not rare if there are 50 populations-- 50 populations across a plant species, you can compare theta across the range. And so it gives you a very nice synthetic high-level parameter in the Bayesian hierarchy that summarizes genetic diversity. It summarizes the allele frequency spectrum, how much heterozygosity do I have, do I have intermediate allele frequencies, in a nice way that is, I think, going to lead to a really nice comparison amongst lots and lots of different organisms, because it will all have been measured in the same way.
AUDIENCE: Having genes [? like ?] probabilities rather than known allele frequencies, can you still pick up the state and throw it into any conventional population genetic program? Or are the analyses more specific?
ALEX BUERKLE: I'm sorry, [INAUDIBLE].
AUDIENCE: [INAUDIBLE]
[LAUGHTER]
ALEX BUERKLE: I'm sorry, [INAUDIBLE]. It's pretty early in the morning already to be this-- some. So for example, one of the things you can do with structure that people have done in the literature is, if you're OK with low coverage, you can just take one read, one random read from every individual, and make pseudohaploid data, and put it in the structure. And honestly, when you have 10,000 loci, you really don't need that second locus-- second allele copy to estimate something like an admixture proportion.
They're just ridiculously confident compared to anything you ever did with microsatellites or things where you had difficulty estimating allele frequencies, because one read from every individual in a lot of individuals will give you a lot of information about allele frequencies, which is what structures a model for allele frequencies. So you can do that.
Yeah, other than that, there are not very many pieces of software yet. We've implemented the Bayesian analysis of molecular variance. We have F models that are written in C code. And this Bayesian genomic clines is in C code. And so it exists, but it will take a while for the dust to settle.
And so the thing you can absolutely do in a very standard way is you-- the principal components that Andy mentioned, you can absolutely do those on genotype probability. You do not need to know genotypes with certainty for that. You just take the point estimate for genotype for that individual. Rather than it being 0, 1, or 2, now it's a continuous variable between 0 and 2. That's fine.
And you put that in PCA. And I don't think that will lead to any problems at all. And we're writing more programs. So we've re-implemented structure for next generation sequence data.
SPEAKER 1: You could also take a threshold approach, right? I mean, it's a little dumbed down way of-- if you've gotten the probabilities, it's not the way you would want to do it.
ALEX BUERKLE: Thank you for--
SPEAKER 1: But in order to use the software that's available, just take your probabilities [? than ?] were [? ran at ?] 95%. And those are your known genotypes.
ALEX BUERKLE: Thank you for reminding me of that.
SPEAKER 1: Of course.
ALEX BUERKLE: That's the easy way forward, is just, you just threshold. And you should get rid of a lot of your loci where you can't maybe-- you don't have enough information or something. But that's fine for a lot of analyses. It will get forward. I think that's one thing to overcome and one thing to come out of the workshop. You might be sad about wasting 80% of your reads from a lane, or 95%, but if you get an answer, you should not be sad.
[LAUGHTER]
SPEAKER 1: Well, we should--
AUDIENCE: I just had one question.
SPEAKER 1: OK, last one.
AUDIENCE: So I agree that using low coverage sequencing of many, many individuals can be pretty accurate for allele frequencies, but I don't know that-- for diploid organisms, I don't think you can confidently call diploid genotypes with low-coverage data. And there are methods that use [INAUDIBLE], so like [? confidence-based ?] test, anything that [INAUDIBLE], which may give you more power than only using allele frequencies. And so I was wondering if you could comment on thinking about how much coverage you need for different types of questions.
ALEX BUERKLE: Yeah, and so I've mostly, and I'm going to talk mostly about models for allele frequency. But in relation to another question, I really think that I'm OK. Yeah, I mean, if you have-- I really don't think that there's an imperative to call genotypes with certainty. I think it's OK to have a point estimate of genotype, and if you're still uncertain about that, to then have a method that can accommodate that uncertainty.
You're absolutely right that, if you have tracts of ancestry, you can share information-- or actually, so you have multiple [? loci ?] that you know their linkage relationship, you can do information sharing with that, but that's not precluded by having genotype uncertainty depending on what type of model you're thinking about. And so I see us revisiting most of the classical models, like linkage disequilibrium, and finding ways to estimate linkage disequilibrium. Of course, linkage disequilibrium is just another model for allele frequencies, so I don't see any problems there.
For haplotype block things, those are already things that deal with genotype uncertainty, because often, SNPs aren't observed, and so you're trying to impute them from your known set of ancestral haplotypes. And there are Bayesian methods in human genetics that actually have this-- there is this infinite haplotype models, where they'll actually reconstruct haplotypes probabilistically as you need them in the model. So I really see there-- it's silly in a way, but we're going to have to revisit a lot of the models that exist, just my thought.
SPEAKER 1: Not silly at all, I mean, computational tools [INAUDIBLE]. But we should move on. Thank--
[APPLAUSE]
Alex Buerkle, associate professor of evolutionary genetics at the University of Wyoming, introduces participants to applications of next generation sequence data, as part of a population genomics workshop at Cornell University, July 23-24, 2013.
Population genomics involves sampling, financial, and bioinformatics trade-offs, so proper experimental design requires understanding probability, sequencing technologies and evolutionary theory and how they relate to research trade-offs. The workshop, "Next Generation Population Genomics for Nonmodel Taxa," explored the strengths and weaknesses of different approaches to genome sequencing and bioinformatics when studying population genomics in nonmodel species.
Go to the workshop website for information associated with these videos including lecture notes, descriptions of exercises, and computer code. This website is a site for on-going learning about methods for population genomics.