POPSICLE- a software suite to determine population structure and to establish genotype-phenotype associations using Next-generation sequencing data |
MAIN INDEX ANALYTICAL PIPELINE CONTACT SYSTEM REQUIREMENTS POPSICLE Package Download Manual pages in HTML format Used Case |
Analytical
Pipeline to determine Population Structure The input to POPSICLE are the alignments of the short reads to genome of interest in binary alignment map format (Sorted by genomic position). POPSICLE takes these alignments and determines local and global ancestries Step 1: Determine somies using the utility FindSomies from POPSICLE package. See FindSomies page for more details
Step 2: Find Alleles using the findAlleles utility of the POPSICLE package. See FindAlleles page for more details
Step 3: Generate POPSICLE input using the somy file generated in Step 1 and allele files generated in Step 2. See GenerateInputFromAlleleFiles page for more details
Here, -i is the directory where allele files are present. These are the files generated in Step 2 of this pipeline. -j is the directory where the Single nucleotide Polymorphism (SNP files) generated using a utility such as samtools are placed. Alternately, one may use their own markers of choice. -k is the format of the snp files (VCF is the preferred format. Markers can be submitted in tab delimited format using "tab" tag). -l is the somies file generated using Step 1 of the POPSICLE pipeline Step 4: Remove loci that are not variant across the samples (optional). See RemoveInsignificantLoci page for more details
Here, -i is the POPSICLE input file generated using the utility GenerateInputFromAlleleFiles utility of POPSICLE pipeline (see Step 3). -o is the filtered output file after removing markers that dont pass the filter. -m is the factor that determines which markers are filtered out. If -m is set to 0.8 and if a marker has 80% of the samples with identical allele, it is filtered out. Step 5: Remove loci with lots of missing data. See RemoveLociWithLotsOfMissingData for more details
Step 6: Find divergence from baseline. The samples are compared against the reference at each marker position and a score is alloted based on the divergence from the reference. See FindDivergenceFromBaseline page for more details.
Step 7: Sample the baseline file generated in Step 6 to generate a smaller file with a few markers. This sampled file is used for faster processing such as for clustering.
Step 8: Convert the files generated in Steps 6 and 7 into .ARFF format
Step 9: Cluster the samples to find the major sample groups. The code generates min clusters to max clusters and assigns the samples to clusters based on the cluster size. It also gives a score associated with each cluster size.
Here, -i is the sampled ARFF file generated in Step 8. -o is the output clusters file that indicates which samples are assigned to which clusters and a score associated with such clustering. -n is the minimum number of clusters to which the samples are assigned. -m is the maximum number of clusters to which the samples are assigned. Step 10: find local ancestries using POPSICLE
Here, -i is the POPSICLE baseline file generated in Step 7. -j is the clusters file generated in Step 9. -o is the output ancestry file with local ancestry information. -k is the baseline ARFF file generated in Step 8. -l is the directory where temporary files are placed. Step 11: Polish the local ancestries (This is for plotting purposes only). Polishes the ancestry file generated in Step 10. The ancestry profiles with in the blocks specified are searched and all the current block is assigned the ancestry that is present in maximum number of blocks.
Here, -i is the ancestry file generated in Step 10. -n is the number of blocks that are searched for consistency. -n can be any odd number. If -n 5, the current block, two blocks before the current block and two blocks after the current block are searched and ancestry that is present in maximum blocks is reported. If no maximum ancestry is found, the current ancestry is retained. Step 12: Arrange the ancestries by their cluster assignments. (For plotting purposes only)
Step 13: Find genotype-phenotype associations java -jar LPDtools.jar FindGenotypePhenotypeBootstrap -i popsicleFile -j classLabels -o output -n iterations Here, -i is the local ancestry file generated using POPSICLEIntermediate utility of POPSICLE (Step 10), -j is the class label file containing two columns, the first column contains the sample names and the second column contains the phenotype information. -n is the number of bootstrap iterations decised. For most practical purposes, -n 50 is sufficient. -o is the output file with information regarding scores of each block and associated p-values. Step 14: Find annotations of the blocks java -jar LPDtools.jar findAnnotations -i GenotypePhenotypeAssociationFile -j gffFile -o outputFileWithAnnotations Here, -i is the output of genotype phenotype association file generated in step 13. -j is the gff file containing annotations and -o is the output file. |
CITATION: Jahangheer S. Shaik, Asis Khan and Michael E. Grigg, "POPSICLE: A Software Suite to Study Population Structure and Ancestral Determinants of Phenotypes using Whole genome Sequencing Data", submitted to PLoS special edition |