After being disabled and not supported for several months, ‘hierfstat’ (by Jerome Goudet) now has lots of useful (and fast) calculations of summary statistics, including expected and observed heterozygosity, Fst and Jost’s Dest.
In response to Greg’s suggestion, here’s some code to illustrate how I’ve used it (so far).
My data files are organised with individuals in columns and SNPs in rows (missing data are coded as NA). The first few columns contain information about the SNPs (contig, position etc.). I also read a text file containing the population assignments of the individuals in the form of a single column, ordered as they are in the data file. This is used to identify which columns belong to each populations, which are extracted to the dataframe “gen” before being converted from “GT” format to “34” format with the command above. Then you have to add a column containing the population assignments, converted to integers.
##################################################################### ## Setup ##################################################################### # Find input file prefix="radCOtoAnnML.grsa.full80-100perc.var2" complete=read.table(paste(prefix,".fullgenotypes.dat",sep=""),sep="\t",header=T,na.strings = c("NA","-"),stringsAsFactors = FALSE) # Specify other important information info=3 # number of information columns preceding SNP genotypes pop.names=c("D","N","I","A","P") # desired order for the populations pop.pairs=c("DN","DI","DA","DP","NI","NA","NP","IA","IP","AP") #desired order for pairs of populations listdir="~/rfiles/SNPanalysis/" poplist=factor(unlist(read.table(paste(datadir,"ecotyperaw.txt",sep=""))),levels=pop.names) # a text file containing a single column indicating the population assignment of each individual in the order of the columns in the SNP file subpoplist=as.factor(unlist(read.table(paste(datadir,"popraw.txt",sep="")))) # optional # Calculate population columns listcols0=lapply(levels(poplist),function(x)which(poplist==x)) listcols=lapply(levels(poplist),function(x)which(poplist==x)+info) datcols=info+1:length(poplist)
##################################################################### ## Single population statistics ##################################################################### # Set up frame to hold results paramlist=c("n","Ho","Hs","Fis") columnlist=paste(rep(paramlist,rep(length(pop.names),length(paramlist))),rep(pop.names,length(paramlist)),sep="_") tmpmat=matrix(nrow=nrow(complete),ncol=length(columnlist), dimnames=list(rownames(complete),columnlist)) tmpframe=as.data.frame(tmpmat) # Extract data columns gen=complete[,datcols] # Run for all columns, only extract single-population statistics kmax=ceiling(nrow(gen)/5000) for(k in 1:kmax){ n=5000 include=5000*(k-1)+(1:n) if(5000*k > nrow(complete))include=(5000*(k-1)+1):nrow(complete) gen=complete[include,datcols] for(i in 1:ncol(gen)){for(j in 1:4)(gen[,i]=gsub(c("A","C","G","T")[j],j,gen[,i]))} for(i in 1:ncol(gen))gen[,i]=as.numeric(gen[,i]) dat=as.data.frame(cbind(factor(poplist,levels=pop.names),t(gen))) tmplist=basic.stats(dat,diploid=TRUE,digits=4) tmpframe[include,]=cbind(tmplist$n.ind.samp,tmplist$Ho,tmplist$Hs,tmplist$Fis) } sumstats=tmpframe
I’m glad to hear they fixed this! thanks for letting me know. Could you post an example data frame of what the data should look like? That would be awesome.
Good idea Greg – I was being lazy.
This is great! Its nice to have a complete example of functioning code. thanks