Software for Analyzing Population Genetic Data
Adam Porter lab page
Here are some programs I wrote for analyzing data from real and simulated populations, most of which have shown up in my publications.
Right now there are three programs, some supporting algorithms, and the code library I use.
SexLinkedFstats
This program handles the calculation of Wright's F-statistifcs for sex-linked loci and haplodiploid organisms. It handles both codominant and dominant markers, and uses both least-squares and restricted maximum likelihood methods for each. You can use the program to
1) calculate F-statistics from a data file you supply;
2) make up a data file of sex-linked data with specified F-statistics;
3) run a simulation to try out the effects of different sampling patterns on sex-linked/haplodiploid F-statistics.
The underlying population-genetic and statistical theory for this program is in manuscript form:
Porter, AH. (MS submitted.) F-statistics for sex-linked and haplodiploid codominant and dominant markers: models and estimators.
Please email me of you need to see a copy prior to publication.
The file you want to check should
be in the same folder as the program;
The file you want to check needs
a filename with no spaces in it.
SexLinkedFstats MacOS 8-10.x (Carbon)
SexLinkedFstats Windows (not compiled yet)
Sample data set Use this to see the data formatting
IslandModelTest
This program implements a parametric bootstrap to determine whether genotypic data show patterns that deviate significantly from Wright's (1931, 1969) neutral infinite-island model. Under this model, allele frequencies in different populations settle into an equilibrium shape described by a multivariate beta-distribution. The parameters of this beta-distribution are the allele frequencies of the pooled set of populations (Q), and the standardized variance in allele frequencies (Fst). Random samples of allele frequencies are repeatedly drawn from this beta-distribution to generate the null distributions of Fst for each locus. The observed Fst scores for each locus are compared to this null distribution to see if they fall outside the 95% confidence limits. The analysis will therefore reveal which loci show significant deviations from the average, multilocus pattern.
The null distribution includes sampling variation from two
sources, the sampling pattern of the original data (the numbers
of populations and individuals sampled) and the estimation error
around the parameters used to create the beta-distribution, namely
the observed values of Q, Fst and Fis. To handle the first source
of error, the parametric beta-distribution is resampled following
the same sampling pattern of the original data. The expected genotype
frequencies are constructed from these allele frequencies using
Fis, the within-population inbreeding coefficient, because this
also influences sampling variation. To handle the second source
of error, each replicate used to build the null distribution is
generated from a different realization of the beta-distribution,
created using unique values of Q, Fst and Fis. These values are
obtained in turn by calculating them from a standard bootstrap
sample from the original data set (resampling populations, then
individuals within populations).
Finding significance means that one or more of the island-model assumptions is violated, but it doesn't tell you which. It could be because the deviant locus is under natural selection, or it could be that the migration patterns are different from the island model's, or for a host of other reasons. It could even be that most loci are under similar selection regimes but the deviant locus is neutral! Independent studies are needed to properly ascribe a cause.
The program takes data in my own format (very similar to Swofford's
BIOSYS) or Arlequin
format. It supports only diploid genotypic data with codominant
loci. Soon I will put out versions for haploid and for dominant
data (such as RFLP or AFLP). Please lower any expectations you
have about the convenience of the user interface!
Fixed
in version 0.5: An interface bug that kept users
from opting for a bootstrap resampling scheme when calculating
the null Fst distributions. The bootstrap scheme is now a default
setting.
Please note that this is an alpha-release, and you may
encounter bugs, especially in the interface. In my experience,
most bugs arise when users present the program with data having
idiosyncracies that I didn't expect. I would appreciate
feedback before you give up on it!
IslandModelTest Carbon v0.5 (Macintosh)
IslandModelTest Win v0.5 (Windows)
IslandModelTest -- User's Guide (you'll be lost without it!)
IslandModelTest -- sample data (randomly generated, so no significant effects will be found. If you can't get this to read properly, it's quite possibly an issue involving end-of-line characters in text files, which differ on different platforms. Try either:
Running the file through my little reformatting program, WinMacNewLineFormatter, supplied below.
Opening the file in BBEdit and saving in the format of your favorite platform.
Source code (also requires AdamLibraries, below)
Supporting publication (PDF format)
Contact me if you can't get these to download properly.
Algorithm elements:
IslandModelTest relies on two main algorithmic components, one that draws allele frequencies from the island model's null distribution, and one that draws individual genotypes from an expected genotypic distribution.
This provides random draws of allele frequencies from a multi-allele beta distribution (=Dirichlet distribution), given Fst and a list of expected allele frequencies.IslandModelRandomAlleles (Macintosh)
IslandModelRandomAlleles (Windows)
ExpectedGenotype provides expected diploid genotype distributions, given a list of allele frequencies and an Fis value. Although Fis can range from -1 to 1, negative Fis values are actually constrained to be above Fis = -1 when allele frequencies are unequal (otherwise negative genotype frequencies can be returned). This algorithm takes the constraints into account. It is described in the appendix of the Molecular Ecology paper.
ExpectedGenotype (Macintosh)
ExpectedGenotype (Windows)
This program fits genotypic data to equilibrium cline models developed by Nick Barton. It uses a numerical maximum likelihood algorithm (an MCMC method aka a Metropolis-Hastings algorithm aka a biased random walk), and returns maximum-likelihood estimates and 2-unit support limits. It supports diploid, haplodiploid or sex-linked genotypic data using codominant loci.
ClineFit takes data in my own format (very similar to Swofford's BIOSYS) or Arlequin format. It requires that the location of each population sample be a number, placed as the last piece of information in the population's name. It also requires that you identify, as part of the name of each locus, the alleles that will be most frequent on the right side of the cline.
ClineFit gives you considerable flexibility in determining the models that you can use. You can fit:
--> clines with 2, 4, 6 & 8 primary shape parameters, including:
- center & width
- 4 parameters describing introgression tails on either side of the cline
- 2 parameters describing frequencies of asymptotic polymorphisms on each side of the cline, if they are not fixed.--> single or multiple markers
This last feature gives you considerable flexibility for hypothesis testing. For example, you can determine if one trait has a unique center (or other parameter value) by first estimating the shape with that trait's center estimated as unique, then estimating the model with that trait's center co-estimated with the remaining traits. That trait has a significantly unique center (at level alpha) if the twice the difference of the likelihoods of these two estimates is greater than the tabled value for level alpha in a chi-square distribution with 1 degree of freedom. Generally, the degrees of freedom is the difference in the number of parameters estimated in fitting the two models. The motivation for these sorts of likelihood tests is well-described in Hilborn & Mangel (1997), Ecological Detective (Princeton Monographs).
--> sex linkage and haplodiploidy
--> disequilibrium estimates, from which dispersal and selection estimates are obtained in 6- & 8-parameter models
--> models that omit parameters of your choice, such as the introgression tail on one side of the cline
--> models that omit such parameters for some traits but not others, under your control
--> models that combine parameters in ways that you control. For example, you can estimate a single center for all traits, or a unique center for each trait, or any combination in between.
If you want to measure only cline shape without underlying dispersal and selection parameters, you can also fit clines of
--> cytoplasmic markers (such as mitochondrial or chloroplasmic loci)
--> dominant markers (such as RFLP or AFLP)
ClineFit uses the method published in Evolution
(Porter et al. 1997), which in turn follows Barton's methods very
closely. Its main difference is that it fits genotypes to the
cline shape directly, rather than fitting transformed data to
a linearized model; this isn't such a big difference. I hope to
put out versions for quantitative traits, dominant
data (such as RFLP or AFLP), and analyses that incorporate cytonuclear
disequilibrium.
Please lower any expectations you have about the convenience of the user interface! If your data format deviates even slightly from the sample data, you might well run into error messages, crashes or plain nonsense.
ClineFit is an alpha release (really pre-alpha). I've used the main algorithms for a while now, but the interface might give bugs. It's possible that new data sets with extreme conditions will uncover inconsistencies that I haven't anticipated in the numerical estimators. It's a null hypothesis that any program is bug-free.
ClineFit_v0.2 MacOS 8-10.x (Carbon)
ClineFit_v0.2 Windows
sample cline allozyme data Use this data format. More on the formatting can be found above in the user guide for IslandModelTest. If you can't get this to read properly after combing it for formatting inconsistencies, it's likely an issue involving end-of-line characters in text files, which differ on different platforms. Try opening the file in BBEdit and saving in the format of your favorite platform.
Source code - v0.2. They rely on AdamLibraries20081116
User Guide -- not available yet. One thing: put your data file into a folder named ClineFitFiles. Then, the ClineFit program should be in the same directory as ClineFitFiles.
Obsolete versions:
ClineFit_v0.1 - has several interface bugs, and small rounding errors in the last digit of the output (the internal calculations were unaffected).
WinMacNewLineFormatter
This program is obsolete. Use BBEdit for this.
This text is just an archive:
This little program may clean up these
problems, formatting
files explicitly for the operating system of choice no matter
what the current end-of-line formatting is. Run the program
and choose the options you want. There are a few constraints,
since I don't specialize in programming interfaces. If it doesn\'t
work, then try pasting a text file into an email program, sending
it to yourself, and then pasting it back into your computer into
a new file.
The file you want to check should
be in the same folder as the program;
The file you want to check needs
a filename with no spaces in it.
WinMacNewLineFormatter MacOS 8-10.x (Carbon)
WinMacNewLineFormatter Windows
AdamLibraries
These are the core algorithms of my analyses and simulations,
in C++, which I compile using MetroWerks
CodeWarrior, which is unfortunately no longer supported. With
work, they are portable to the GCC compiler, and presumably others as
well. Eventually, as CodeWarrior begins to fail on the newer operating
systems, I'll have to switch too.
I wrote most of these algorithms myself, but some are based on public domain or restricted-use code. MemoryManager and the cumulative distribution functions are not mine. All my classes and algorithms are copywrited, and you may not use them without my permission. I will provide prompt written permission for most non-profit uses. But, you have to ask.
I update these libraries occasionally, whenever I develop a new algorithm that I think I might want to re-use later, and whenever I find a bug or undesirable feature.
If you find a bug, please let me know!
Updated: 10 February 2010, A. Porter