AnalyticalPipelinePopulationStructure

https://www.walmart.com/ip/Popsicle-Rainbow-Big-Stick-3-5-oz/198225061

POPSICLE- a software suite to determine population structure and to establish genotype-phenotype associations using Next-generation sequencing data

MAIN

INDEX

ANALYTICAL PIPELINE

CONTACT

SYSTEM REQUIREMENTS

POPSICLE Package

Download Manual pages in HTML format

Used Case

Analytical Pipeline to determine Population Structure
The input to POPSICLE are the alignments of the short reads to genome of interest in binary alignment map format (Sorted by genomic position). POPSICLE takes these alignments and determines local and global ancestries
Step 1: Determine somies using the utility FindSomies from POPSICLE package. See FindSomies page for more details

java -jar LPDtools.jar FindSomies -i directoryBAMfiles -o outputSomies.txt -m 1 -k chrSizes.txt

Here, -i is the directory where sorted aligned bam files are placed. -o is the output file containing somies of chromosomes in each of the samples. -m is the ploidy of the organism and -k is the file with 2 columns. The first column is the chromosome names and the second column is size of the chromosome.
Step 2: Find Alleles using the findAlleles utility of the POPSICLE package. See FindAlleles page for more details

java -jar LPDtools.jar findAlleles -i directoryBAMfiles -n 1 -o outputDirectoryAlleleFiles

Here, -i is the directory where sorted aligned Binary Alignment map files are placed. -o is the output directory containing the allele files. -n is the bp window. -n 1 finds allele composition at each base
Step 3: Generate POPSICLE input using the somy file generated in Step 1 and allele files generated in Step 2. See GenerateInputFromAlleleFiles page for more details

java -jar LPDtools.jar GenerateInputFromAlleleFiles -i directoryAlleleFiles -j directoryOfSNPfiles -o outputPopsicleFile.txt -k "vcf" -l somiesFile

Here, -i is the directory where allele files are present. These are the files generated in Step 2 of this pipeline. -j is the directory where the Single nucleotide Polymorphism (SNP files) generated using a utility such as samtools are placed. Alternately, one may use their own markers of choice. -k is the format of the snp files (VCF is the preferred format. Markers can be submitted in tab delimited format using "tab" tag). -l is the somies file generated using Step 1 of the POPSICLE pipeline
Step 4: Remove loci that are not variant across the samples (optional). See RemoveInsignificantLoci page for more details

java -jar LPDtools.jar RemoveInsignificantLoci -i popsicleInputFile -o OutputFilteredPopsicleFile1.txt -m 0.9

Here, -i is the POPSICLE input file generated using the utility GenerateInputFromAlleleFiles utility of POPSICLE pipeline (see Step 3). -o is the filtered output file after removing markers that dont pass the filter. -m is the factor that determines which markers are filtered out. If -m is set to 0.8 and if a marker has 80% of the samples with identical allele, it is filtered out.
Step 5: Remove loci with lots of missing data. See RemoveLociWithLotsOfMissingData for more details

java -jar LPDtools.jar RemoveLociWithLotsOfMissingData -i popsicleInputFile -o outputFilteredPopsicleFile -m 0.1

Here -i is the popsicle input file generated in Step 3 or filtered popsicle file generated in Step 4. -o is the output file generated after removing loci with lots of missing data. -m is the maximum missing data tolerable. Eg. if -m is set to 0.7, markers with 70% or more missing data are ignored.
Step 6: Find divergence from baseline. The samples are compared against the reference at each marker position and a score is alloted based on the divergence from the reference. See FindDivergenceFromBaseline page for more details.

java -jar LPDtools.jar FindDivergenceFromBaseline -i inputPopsicleFile -o outputPopsicleBaselineFile

Here -i is the input POPSICLE file generated in Step 3 or any of the filtered versions generated in Steps 5 or 6. -o is the output file.
Step 7: Sample the baseline file generated in Step 6 to generate a smaller file with a few markers. This sampled file is used for faster processing such as for clustering.

java -jar LPDtools.jar SampleBaseLineFile -i popsicleBaselineFile -o outputSampledBaselineFile -m 2

Here, -i is the baseline file generated in Step 6. -o is the sampled version of the file. -m is the factor that determines the number of markers that are to be retained per kb. If -m 2, then 2 markers every kb are retained.
Step 8: Convert the files generated in Steps 6 and 7 into .ARFF format

java -jar LPDtools.jar Convert2ARFFformat -i baselineFileFromStep6 -o outputArffFile
java -jar LPDtools.jar Convert2ARFFformat -i baselineFileFromStep7 -o outputSampledArffFile

Here, -i is the baseline files generated using Steps 6 and 7. -o are the output files in .arff format (See WEKA for details on arff file format)
Step 9: Cluster the samples to find the major sample groups. The code generates min clusters to max clusters and assigns the samples to clusters based on the cluster size. It also gives a score associated with each cluster size.

java -jar LPDtools.jar PerformKmeansClustering -i sampledARFFfile -o outputClustersFile -n minClusters -m maxClusters

Here, -i is the sampled ARFF file generated in Step 8. -o is the output clusters file that indicates which samples are assigned to which clusters and a score associated with such clustering. -n is the minimum number of clusters to which the samples are assigned. -m is the maximum number of clusters to which the samples are assigned.
Step 10: find local ancestries using POPSICLE

java -jar LPDtools.jar POPSICLEIntermediate -i inputPOPSICLEbaselineFile -j clustersFile -o outputAncestryFile -n blockSize -k baselineARFFFile -l temporaryDirectory

Here, -i is the POPSICLE baseline file generated in Step 7. -j is the clusters file generated in Step 9. -o is the output ancestry file with local ancestry information. -k is the baseline ARFF file generated in Step 8. -l is the directory where temporary files are placed.
Step 11: Polish the local ancestries (This is for plotting purposes only). Polishes the ancestry file generated in Step 10. The ancestry profiles with in the blocks specified are searched and all the current block is assigned the ancestry that is present in maximum number of blocks.

java -jar LPDtools.jar PolishPosicle -i inputAncestryFile -n blocks

Here, -i is the ancestry file generated in Step 10. -n is the number of blocks that are searched for consistency. -n can be any odd number. If -n 5, the current block, two blocks before the current block and two blocks after the current block are searched and ancestry that is present in maximum blocks is reported. If no maximum ancestry is found, the current ancestry is retained.
Step 12: Arrange the ancestries by their cluster assignments. (For plotting purposes only)

java -jar LPDtools.jar ArrangePopsicleByClustersFormed -i polishedAncestryFile -o outputPolishedArrangedFile -j clustersFile

Here, -i is the ancestry polished ancestry file generated in Step 11. -o is the file with samples arranged by their cluster membership as specified in the clusters file generated in Step 9
Step 13: Find genotype-phenotype associations
java -jar LPDtools.jar FindGenotypePhenotypeBootstrap -i popsicleFile -j classLabels -o output -n iterations

Here, -i is the local ancestry file generated using POPSICLEIntermediate utility of POPSICLE (Step 10), -j is the class label file containing two columns, the first column contains the sample names and the second column contains the phenotype information. -n is the number of bootstrap iterations decised. For most practical purposes, -n 50 is sufficient. -o is the output file with information regarding scores of each block and associated p-values.

Step 14: Find annotations of the blocks
java -jar LPDtools.jar findAnnotations -i GenotypePhenotypeAssociationFile -j gffFile -o outputFileWithAnnotations
Here, -i is the output of genotype phenotype association file generated in step 13. -j is the gff file containing annotations and -o is the output file.

CITATION: Jahangheer S. Shaik, Asis Khan and Michael E. Grigg, "POPSICLE: A Software Suite to Study Population Structure and Ancestral Determinants of Phenotypes using Whole genome Sequencing Data", submitted to PLoS special edition