Scripts used in GBS Pipeline

UGbS-Flex pipeline: Usage

The scripts used in the pipeline are given below.

PERJ.pl (written by Xuewen Wang)

Joins forward and reverse reads from paired-end data. See manual for further information.

EL.1.4.2.py (written by Peng Qi)

This script is applied on the output of the 'flash' program. 'Flash' merges overlapping paired-end reads, and places non-overlapping forward and reverse reads in the files Solo.1.fq and Solo.2.fq, respectively.

1. Removes reads from Solo.1.fq and Solo.2.fq that are shorter than a given size; if a forward read is removed, the reverse read is also removed.

2. Reverse complements the reverse reads.

3. Artificially joins the reverse complemented reverse reads to the 3' end of forward reads.

4. Adds As (Ns would be considered low quality bases) at the end of overlapping paired-end reads that were merged using 'flash' to make them the same length as the artificially joined reads.

FCT.pl (written by Debkanta Chakraborty)

This script is applied to the output of the 'cstacks' (Catchen et al. 2011).

Selects the tags generated by 'cstack' that are uniquely present in a specified number of accessions. Only consensus tags that consist of maximum one 'ustacks' tag per accession will be selected.

Usage: perl FCT.pl batch_1.catalog.tags.tsv N Z

N: minimum number of accessions that have contributed to a cstack tag

Z: maximum number of accessions that have contributed to a cstack tag (typically this will be the total number of samples analyzed by 'ustacks').

Ref_98_v1.1 (written by Peng Qi)

Filters the reference to remove tags with 98% similarity.

ASustacks.py (written by Peng Qi)

This script generates an artificial .fastq from a 'ustacks' .tags.tsv file. This is done for all samples. 'ustacks' (instead of 'cstacks') can then be used to generate stacks across samples. The 'AUstacks' approach requires less memory, runs faster and identifies more stacks common to multiple samples than 'cstacks' when read depth is high.

ASustacks_ref.py (written by Peng Qi)

This script generates a GBS reference from the ustack '.tags.tsv' output.

SNPs_ISL.pl (written by Debkanta Chakraborty)

This script needs to be run before the script 'Rm_adj_SNPs.pl' is run.

Usage: perl SNPs_ISL.pl input_file.vcf

The following output is generated: “more_than_one_SNPs_for_same_chromosome_number_in_same_line.vcf”

Rm_adj_SNPs.pl (written by Debkanta Chakraborty)

This script removes adjacent SNPs, which are more likely to result from misalignments, from the .vcf file generated by GATK. The script 'SNPs_ISL.pl' needs to be run to prepare the correct file format.

Usage: perl Rm_adj_SNPs.pl more_than_one_SNPs_for_same_chromosome_number_in_same_line.vcf

SNP_genotyper.v3.0.py (written by Peng Qi)

1. Scores the SNPs as A, B, H, C and D using a GATK .vcf file (either raw .vcf or filtered .vcf) as input; Only SNPs with a given read depth will be scored.

2. Consolidates all SNPs within a GBS tag (if a GBS reference is used) or within a physical distance of 500 bp (if a genome assembly is used as reference), and gives the resulting consolidated SNP a unique name ‘Tag_’ plus number. The output is a tab delimited text file (*.txt) that can be opened with excel.

SNP_genotyper_only.py (written by Peng Qi)

Same as SNP_genotyper.py, but without SNP consolidation. Scores the SNPs as A, B, H, C and D using a GATK .vcf file as input. Only SNPs with a given read depth will be scored.

SNP_selectByParents.py (written by Peng Qi)

Removes all SNPs that are not homozygous for different alleles in the parents (only applies to some biparental mapping populations). This script is run on the output of the 'SNP_genotyper.py' output.

SNP_cosegregation.py (written by Peng Qi)

Retains only the SNP with the most complete genotypic information if two or more SNPs are cosegregating. This script can be applied after ‘SNP_genotyper.py’, or it can be run independently from ‘SNP_genotyper.py’.

Example data set

An example data set can be downloaded here. Follow the commands in the boxes in the document 'UGbS-Flex pipeline: Usage' to run the pipeline with the example data set.

Slideshow