PgmNr 1225: The GCAD workflow for processing 5000 whole genomes and 11,000 whole exomes from the Alzheimer’s Disease Sequencing Project using Amazon cloud.Authors:
Y.-F. Chou 1,2; Y.Y. Leung 1,2; O. Valladares 1,2; A.B. Kuzma 1,2; L. Cantwell 1; L. Qu 1,2; H.-J. Lin 1,2; P. Gangadharan 1,2; Y. Zhao 1,2; J. Malamon 1; A.D. Sequencing Project (ADSP) 1; A. Naj 1; W.J. Salerno 3; G.D. Schellenberg 1,2; L.-S. Wang 1,2
Add to Schedule View Session Detail
1) Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA; 2) Institute for Biomedical Informatics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA; 3) Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX
Late-Onset Alzheimer's Disease (AD) is a devastating disease that will affect 16 million Americans by 2050. Established in 2016, the Genome Center for Alzheimer's Disease (GCAD) aims to coordinate the integration and meta-analysis of all available AD relevant genetic data with the goal of identifying AD risk and eventual therapeutic targets. It involves the collaboration of diverse scientists within the Alzheimer’s disease sequencing project (ADSP) and GCAD.
With the vision to minimize data heterogeneity and to efficiently process thousands of genomes, GCAD has built an Amazon cloud-based pipeline that is 1) fully designed and optimized for large scale production of WGS/WES data; 2) interactive with a tracking database that can display a variety of quality metrics instantly; and 3) developed in coordination with other NIH programs including TOPMed and the CCDG.
This pipeline takes an input FASTQ/BAM file and re-maps to the hg38 genome assembly using BWA. Supplementary alignments are retained for downstream SV optimization. It then performs duplicate marking, base recalibration, indel-realignment and quality score binning using GATK 3.7. Next, it performs SNV genotype calling using GATK. All individual gVCFs from WES and WGS are scaled and aggregated to a project level file by joint genotype call. The final output contains all possible sites for every genotype position, allowing for easy comparison across genomes from other sequencing projects.
Together with the pipeline, a live statistics page is built and supported by a database which stores >100 quality metrics on each sequenced sample. These include information on how the sequencing is performed, processing time, metrics on BAM quality and genome-wide statistics of the called SNPs and indels. On average, the workflow can process a complete genome in <30 hours. The pipeline is available through bitbucket (https://bitbucket.org/GCAD/SNV_pipeline/) and as an Amazon image (ami-e6df35f0).
To date, we have processed 5,000 WGS and 11,000 WES from 26 different cohorts of European, Caribbean Hispanic, and African American descent, which are deeply sequenced across 5 sequencing centers at mean depth 30X (WGS) and 75% of targeted regions at >20x coverage for WES. From the WGS data, 21M novel SNPs were discovered across these sequenced individuals. Among the discovered, 23% were singletons present in a single individual. 39.6M SNPs has reached 99.8% sensitivity tranche threshold, of which 42% of these are novel.