Researchers have access to new data on thousands of exomes. Here’s what we found.

Click here to see an interactive, searchable interface for the results described below.

At Helix, we believe that the best genetic insights come from deep dives into a person’s DNA. That’s why we’ve designed and implemented a proprietary Exome sequencing assay (Exome+) to capture both the medical exome and a comprehensive sequencing backbone, to go beyond microarray genotyping chip platforms that are commonly in use today. Combining the breadth of the medical exome with this sequencing backbone allows us to capture common variants that are found in large numbers of people, while also revealing many of the rare (and sometimes novel) protein-coding variants that make each of us unique.
This month, the UK Biobank released a data set containing whole exome sequences from 50,000 volunteers, each of whom has additionally provided information regarding thousands of traits (phenotypes) and donated their information to the research community. This was a big deal for genomics, and the first time that such a rich resource of exome and matched phenotype data has been available to researchers around the world. It provides huge opportunities for genetic discovery, especially for rare protein-coding variants.
Naturally, as one of the world’s largest producers of clinical exomes and an approved UK Biobank research organization, we dove in. This is the first blog (of many!) describing our work analyzing and comparing these data to the data we’ve gathered from our own research-consented customers. We’re kicking things off by highlighting a few of our results. We’ll be adding new blog entries and updating our findings as we uncover many more new insights.

Gene-based collapsing analysis

Starting with 1,317 UK Biobank phenotypes, we began with a gene-based collapsing analysis, an analytical technique tailored for studying rare variants. This is different than the analysis method known as genome-wide association study (GWAS), a standard method that is used with microarrays but performs poorly on rare variants found in the exome. If a variant is only ever seen in one or a few people, then it is difficult to identify a statistically significant association when comparing to many thousands of other people, as is done with GWAS. In contrast, because a single rare variant can have a high functional impact in an individual, grouping it together into a “gene unit” with single variants from other individuals is a way to gain enough statistical power to see a signal through the noise.

Gene-based collapsing analysis. A) First, variants in each gene are identified by sequencing in cases and controls. B) Variants that are predicted to be damaging—those that are rare and annotated as likely to affect the functionality of the gene, such as coding variants—are then selected for analysis, while variants that are common or do not have damaging annotations are excluded from further analysis. The exact parameters used to select variants can be flexible and tailored to each study. C) Finally, the number of cases with a qualifying variant in each gene is compared to the number of controls with a qualifying variant. This comparison of cases to controls produces one statistical result per gene instead of one per variant. (Image credit: Alex Buckley)

This is exactly the approach we took in our first perusal of the UK Biobank Exome data. Each person was coded as a 1 if they had a qualifying rare coding variant in a gene, and a 0 if they did not. Overall, we identified 66 associations in 42 genes with 40 phenotypes. On average, the population frequency for these identified variants was 0.005%.

Selected findings

1. Rare carriers of GP1BB variants affect mean platelet volume

We identified numerous associations with different blood and platelet phenotypes. Many of these associations make sense given what is already known about the biology of the genes involved. For example, GP1BB variants have been implicated in bleeding and platelet disorders (Kunishima et al 1997). However, our analysis shows, for the first time, that rare coding variants in this gene are associated with higher mean platelet volumes in the general population.

Shown is the mean platelet volume of the general population in the UK Biobank (non-carrier) and the individuals who carry rare coding variants in GP1BB (carrier). (Image credit: Alex Buckley)

2. Loss of function variants in SMAD6 associate with vision phenotypes

Some findings provide completely novel insights, such as the association of loss-of-function variants in SMAD6 with eye measurements. SMAD6 is a member of the SMAD family of signal transducers, inhibitors of BMP signaling. While variants in this gene have been recognized to increase susceptibility to craniosynostosis and aortic valve phenotypes (OMIM:602931), its suggested role in human eye development is novel. Support for this hypothesis includes the known role of BMP signaling pathways in many human development processes (reviewed in Wang et al 2014), including eye development, and a recent finding that the SMAD6 mouse homolog, Smad6, is essential for blood vessel function in the developing mouse retina (Wylie et al 2018).

Shown is the 6mm weak meridian of the general population in the UK Biobank (non-carrier) and the individuals who carry rare coding variants in SMAD6 (carrier). (Image credit: Alex Buckley)

3. TYRP1 variants and Blonde Hair

One of our novel findings is the association of TYRP1 variants with blonde hair color in those of British ancestry. TYRP1 made a big splash in 2012 when researchers found that a variant in this gene caused blonde hair in dark-skinned individuals of Melanesian ancestry from the Solomon Islands. This genetic variation is separate from the ones that are known to cause blonde hair in those of European ancestry and, until now, there has been no evidence that this gene also played a role in European hair coloring. The specific variant that causes blonde hair in Solomon Islanders, rs387907171, an arginine to cysteine substitution at amino acid position 93, was only found in 3 of the 40,648 individuals analyzed in our UK Biobank study. However, more than 30 other rare coding variants in this gene were found in 1% of the 4,671 British ancestry blonde individuals. Previous studies have shown that the Solomon Island variant is recessive, meaning that people require two copies of it to have blonde hair. In contrast, the majority of UK Biobank carriers did not have two rare coding variants in this gene, and further studies will be needed to characterize what combination of genetic variants led to blonde hair in these individuals. Importantly, the association between rare variants in this gene and hair color in those of European ancestry has not been previously identified because of the insufficiency of microarray genotyping chips for identifying these types of rare variant associations.

4. A launchpad for biological study

We identified several additional strongly implicated signals that have plausible explanations for their biological significance. These associations provide numerous opportunities for further research to test the validity of the findings and uncover their specific functional role in each trait’s biology.
As expected, LDLR variants were associated with high cholesterol in this study. This is a well-characterized gene with known variant associations to this quantitative trait, the hallmark of the most common mendelian disorder, Familial Hypercholesterolemia (FH). We found a variety of coding variants, some known, others novel, potentially expanding the causative variants that could be used for FH diagnosis.
Among the interesting novel associations, our study implicates numerous rare variants in what may be one of the first genes specifically linked to handedness, INTS10, a gene that is most highly expressed in pituitary and thyroid tissues. We additionally identified an association between more than a dozen rare variants in TACR3, a gene that is known to cause the recessive disease hypogonadotropic hypogonadism, and age at menarche. Finally, we found more than 20 rare variants in TNPO3, a gene previously implicated in limb girdle muscular dystrophy, to be associated with a higher incidence of self-reported stillbirths and miscarriages.

The first of many forays into a rich dataset

This analysis is one of the first large-scale analyses of rare variants in diverse phenotypes in the general population. While many of these associations could be inferred from previous research, no study has quantified their relationship with the traits in this manner. What’s especially interesting is that the vast majority of these associations are due to the combined effects of multiple rare variants, and their associations would not have been picked up with an analysis in which each variant is analyzed individually.
There are many other ways to group rare variants together, and we will be presenting these in future posts. We’ll also be exploring, together with our research partners such as the Healthy Nevada Project, how these results replicate and/or can be improved in additional populations. In addition, we’ll use these findings in combination with the common variants found with our Exome+ assay to create novel polygenic scores to account for contributions from both rare and common variants. In time, this means better and more meaningful insights in the apps from which our partners, and ultimately our customers, will benefit.
In the meantime, here is a summary of the 66 associations that achieved a statistical significance of p<1×10-6. To contribute to the scientific community, we provide the results of our analysis for download here, and technical details of the analysis below. If you want to explore the dataset yourself in more detail, we’ve created a searchable interface over the results. Comments or questions can be sent to

Blood ▼

Gene Trait(s) p-value Freq Model TUBB1 Platelet distribution width; Platelet count; Mean platelet volume; Platelet crit 4.70E-40 0.29% LoF; coding JAK2 Platelet crit; Platelet count 3.20E-19 0.34% coding COL4A4 Unspecified haematuria (ICD10: R31) 5.90E-15 0.45% coding KLF1 Red blood cell (erythrocyte) distribution width; Mean corpuscular haemoglobin; Mean corpuscular volume 1.40E-14 0.03% LoF; coding GP9 Mean platelet (thrombocyte) volume ; platelet count 9.80E-13 0.11% coding MPL Platelet crit; Platelet count 1.80E-12 0.31% coding IQGAP2 Mean platelet (thrombocyte) volume 2.10E-12 0.20% LoF; coding GP1BB Mean platelet (thrombocyte) volume ; platelet count 1.00E-09 0.04% coding TMPRSS6 Mean corpuscular haemoglobin; Mean corpuscular volume 2.90E-09 0.38% coding ITGA2B Platelet count 4.60E-09 0.41% coding ASXL1 Red blood cell (erythrocyte) distribution width 6.30E-08 0.12% LoF SEC23B Red blood cell (erythrocyte) distribution width 5.20E-07 0.32% coding CXCR2 Neutrophill count 5.50E-07 0.08% coding IFRD2 High light scatter reticulocyte count, percentage 6.40E-07 0.33% coding GRK6 Neutrophill percentage 8.80E-07 0.18% coding GFI1B Mean platelet (thrombocyte) volume 9.00E-07 0.14% coding CAD Mean reticulocyte volume 9.90E-07 0.01% LoF

Hair Color ▶

Gene Trait(s) p-value Freq Model SLC45A2 Hair colour (natural, before greying): Blonde / Dark brown 1.30E-23 0.22% coding TYRP1 Hair colour (natural, before greying): Blonde 9.80E-10 0.28% coding MC1R Hair colour (natural, before greying): Red 1.20E-09 0.30% coding TG Hair colour (natural, before greying): Dark brown 8.60E-07 1.30% coding

Vision ▶

Gene Trait(s) p-value Freq Model TCTN1 Age cataract diagnosed 2.10E-08 0.30% coding BDNF hypermetropia (far-sightedness) 9.20E-08 0.13% coding THBS4 6mm weak meridian (left); 6mm strong meridian (left) 9.30E-08 0.41% coding SMAD6 6mm weak meridian (right); 6mm strong meridian (right); 6mm weak meridian (left) 9.80E-08 0.05% LoF CRELD2 myopia (near-sightedness) 1.00E-07 0.07% coding CYB5D2 Age cataract diagnosed 1.30E-07 0.30% coding CHD8 cataract 5.70E-07 0.61% coding RNF31 3mm strong meridian (left) 5.90E-07 0.26% coding MFRP 6mm weak meridian (left) 9.10E-07 0.41% coding

Other ▶

Gene Trait(s) p-value Freq Model LDLR high cholesterol; taking cholesterol lowering medication 7.80E-17 0.30% coding PAPPA Height: standing, sitting 3.20E-10 0.03% LoF MYLIP Bleeding gums 2.10E-08 0.09% coding TNPO3 Ever had stillbirth, spontaneous miscarriage or termination 1.10E-07 0.12% coding TACR3 age at menarche 1.80E-07 0.13% coding TPMT Ever taken oral contraceptive pill 2.00E-07 0.24% coding ITGA11 Protein intake 2.90E-07 0.06% LoF INTS10 Handedness (chirality/laterality): Left-handed; Right-handed 3.30E-07 0.16% coding CATSPER1 Standing height 3.50E-07 0.13% coding FAM78A Mouth ulcers 4.70E-07 0.11% coding ITGA11 Energy (caloric) intake 5.30E-07 0.06% LoF PICK1 Hand grip strength (right) 5.60E-07 0.04% LoF HEPHL1 Type of tobacco previously smoked: Manufactured cigarettes 9.80E-07 0.53% coding

Detailed methods

Data processing was performed using Hail, using the FE version of the UK Biobank data (field 23160).
We performed simple gene-based collapsing analyses, where each person is coded as a 1 if they have one or more qualifying rare coding variants in a gene and a 0 if they do not, regardless of zygosity. This differs from a burden test where the total number of rare variants per person per gene is considered. We defined “qualifying” as coding (stop_lost, missense_variant, start_lost, splice_donor_variant, inframe_deletion, frameshift_variant, splice_acceptor_variant, stop_gained, or inframe_insertion) and not Polyphen or SIFT benign (Polyphen benign is <0.15, SIFT benign is >0.05). We used a MAF cutoff of 0.1%. We also ran a loss of function (LoF) model that only included LoF variants (stop_lost, start_lost, splice_donor_variant, frameshift_variant, splice_acceptor_variant, or stop_gained).
We used VEP for annotation with the Ensembl canonical transcript coding for variant consequence. Coding regions were defined according to Gencode version GENCODE 26.
To pass the MAF filter, the variant must be below that frequency cutoff in all gnomad populations as well as the strict British ancestry UK Biobank exomes (defined by UK Biobank as Genetic Ethnicity = Caucasian).
Qualifying variants were also restricted to the high-confidence regions of the genome, in this case defined by the Genome in a Bottle resource for NA12878.
We used the Neale lab modified version of PHESANT when transforming the quantitative traits to normality and breaking up the categorical traits.
Our analysis was restricted to those of strict British ancestry (n=40,468) and used BOLTLMM, which corrects for relatedness and population structure. It is meant for quantitative traits but works well for binary traits so long as there is an adequate number of cases who carry the variant to be analyzed. We used age and sex as covariates. Sex-specific traits, such as menopausal symptoms or enlarged prostate, were analyzed only in the appropriate sex.
We required at least 10 individuals to carry any qualifying variant in the gene of interest for analysis of that gene to proceed. For binary analyses, we only used phenotypes with at least 50 cases, and we set the carrier frequency cutoff so that at least 10 people would be expected to have a qualifying variant in the case (or lower-frequency) group. Given that most genes have <1% of individuals with a qualifying variant, this means that power for discovery is fairly low for conditions with fewer than 500 cases. In fact, 91 of the phenotypes analyzed had 0 genes with enough carriers of rare variants for statistical analysis to be reliably performed in the coding model. This number was even higher in the LoF model, where 714 phenotypes had 0 genes with sufficient carriers for analysis. Associations driven by fewer than this number of cases carrying qualifying variants are suspect (and plentiful), requiring larger datasets for confirmation. Even 10 carriers, accompanied by statistical significance, does not ensure a true finding, and results should be interpreted with caution when they are driven by relatively low numbers of carriers. For example, the Neale lab has suggested 25 carriers as the minimum requirement to avoid false findings. Finally, this parameter requires a more tailored cutoff when working with rare variants in rare traits. For example, a cutoff of at least 10 observed case carriers as opposed to 10 expected case carriers will make more sense in situations where both being a variant carrier and being a case are very rare events, but carriers are heavily enriched in the cases. As our analyses continue, we will adjust the parameters and methods used to identify additional associations.
Our rare variant analysis methods differed from those used in the BioRxiv preprint by Regeneron and GSK on these data, most importantly in the choice of MAF cutoff and the inclusion of a coding variant model. As a result, we overlap on some hits and differ on others. When we raised the MAF cutoff of our LoF model to 1% and relaxed the criteria for the minimum number of variant carriers, we identified the same rare variant associations presented in this previous work.
We generated QQ plots to visually check for test statistic inflation in our main analyses, which we did not find. Here are a few examples:

R31: Unspecified haematuria
(n=1,109 vs. 39,359, coding model)

x2395_3: balding pattern 3
(n=4,994 vs. 13,485, coding model)

This research has been conducted using the UK Biobank Resource under Application Number 40436.

Here’s how you can reference our work:
Cirulli, E.T. and Washington, N.L. Helix Blog, 27 Mar. 2019,