Projects in Complex Trait Genetics

Ben Voight January 11, 2024 February 09, 2026

Discovery and Analysis: Type-2 Diabetes and its Complications

I have a longstanding interest in dissection of the genetic basis of type 2 diabetes (T2D). A great deal of progress has been made in characterizing the basis of the disease, in no small part to the advent of genome-wide association studies (GWAS) and the scale to which data can be collected. My efforts over my career thusfar has contributed to data collection and analysis that has resulted in the discovery of hundreds of associated loci to date. While work continues to expand discovery of new loci for the disease, expanding especially towards characterization of genetic liability across populations, a great deal of work still remains “post” genome-wide association study. My fundamental drive is to identify genes and pathways that are therapeutically actionable, to expand the collection of treatments available, perhaps personally tailored to the sub-structure of genetic risk to T2D that an individual might carry. I believe genomics data collected in humans is central to that effort: identifying relevant leads, the therapeutic direction of effect, the ‘horsepower’ of specific human genetic targets, and so forth. As such, there are a range of GWAS and post-GWAS activities that are of active interest in the lab, with project invariably collaborative and large-scale with the T2D-GGI and MVP consortia. To enumerate a couple of areas interest:

Locus Discovery. Our most recent work has reported the assembly >2.5M participants for the largest multi-ancestry GWAS of T2D to date, which lead to 145 newly loci for disease and bring the number of distinct loci to 611 and signals to 1289. However, this still did not bring the complete collection of data to bear. Ongoing data collection continues with AllofUs, MVP, and potentially bringing in cohorts to cover population groups that were not as well represented in this new effort as would have been desired (e.g., from South Asia).
Signal Fine Mapping. Mapping causal variants is directly relevant to causal gene identification, understanding mechanisms, and disease progression. Signal fine mapping is the appropriate statistical method to identify multiple, statistically independent signals associated, and is important as it: (i) extends the heritability explained by genetics, allowing us to deduce the relevance of causal variants shared across – but also potentially unique to – ancestry groups, (ii) allows us to connect association signals to additional metabolic and progression traits directly, and (iii) provides additional leads to connect to a putative causal nearby gene. In our initial efforts for T2D-GGI, we applied fairly coarse (heuristic, but not quantitative) approach to capture putatively independent signals of association at each distinct T2D-associated locus. However, the toolkit of available approaches and data sets to deeply characterize the full spectrum of allelic heterogeneity comprehensively have never been better. I am particularly interested in application of these tools for T2D-GGI, to bring together true ‘in sample’ LD reference panels for large biobanks with summary data they contributed to meta-analysis to attack this problem.
Causal inference. Within the T2D-GGI, collaborations are ongoing to use the framework of Mendelian randomization (MR), QTL data, and multi-ancestry information to triangulate on variants and their predicted effects on nearby genes, to identify direction of effects and ‘horsepower’ of given perturbations of that gene to risk to T2D that might be expected. In addition, I am also interested in understanding non-linear effects of established causal risk factors and lifestyle factors in the context of T2D genetic liability. These questions can be addressed with the framework of MR, but require individual level data - which is available in MVP. Use of these data potentially further facilitates exploration of the role of causal risk factors for T2D across populations, and to relate those risk factors potentially to risk of complications of T2D.
Colocalizaton: Variant-to-Gene Mapping and multi-trait physiology. It is highly desirable to connect non-coding variants – the lion’s share of tagged association at T2D risk loci – with a cognate, putatively relevant gene in cis. This can be achieved computationally through varaint-to-gene mapping approaches, e.g. using variation associated with change in transcript abundance (eQTLs) in relevant tissues or cell types (e.g., beta-cells of pancreatic islets). We are interested in attacking the variant-to-gene mapping problem generally, through analysis of existing data as well as generating new QTL data sets for this purpose - but also in the context of credible sets identified from signal fine-mapping efforts. In addition, we are also interested in attaching the complete spectrum of physiologic associations at each signal of association. We are interested in this approach, to (i) help understand the traits that might be expected to be perturbed (which may be beneficial or antagonistic to therapeutic indications), (ii) as a vehicle to help characterize pleiotropy generally in the genome, and (iii) to potentially obtain leads for the potential tissue or cell type of action that may be physiologically implicated at the associated signal.
Genetic Sub-Typing via Trait Clustering. In our recent work, and consistent with reports from others, it is possible to group T2D signals by cardiometabolic and anthropometric trait associations at those loci. These clusters appear to map onto to distinctive functional genomic and tissue groupings, e.g., beta-cell biology vs. adipose / obesity biology. In our paper, we demonstrated initial evidence that these sub-types may also diffferent associate with risk of T2D complications, though clearly more work to fuel stratification are necessary. In additional collaborative work within the T2D-GGI, we are exploring and contrasting different approaches to clustering, and are interested in evaluating new approaches to clustering fueled by additional data or new ways to approach the problem (e.g., colocalization or fine-mapping). The objective is potentially to develop and identify types that predict downstream complications (e.g., CAD, Kidney disease, retinopathy, etc.), which could potentially be transformative if successful.

Discovery and Analysis: Metabolic-associated steatotic liver disease and its Progression

Through involvement with MVP, I have recently become interested in dissecting the genetic basis of metabolic-associated steatotic liver disease (MASLD, previously referred to as NAFLD) and its progression. Our efforts involve not just locus discovery for the trait (or proxy) and progression phenotypes, but also post-GWAS analyses, and is a collaborative activity with many investigators at Penn and the VA, including Kyong-Mi Chang and Marijana Vujkovic, among many others.

Locus Discovery. Following along our previously work, efforts continue to map new susceptibility loci for the disease, or for related traits like liver cancer or cirrhosis. As data sets continue to grow, we are intrested in aggregating them for novel discovery as well as for down-stream analyses.
Signal fine mapping. Mapping causal variants is directly relevant to causal gene identification, understanding mechanisms, and disease progression. Signal fine mapping is the appropriate statistical method to identify multiple, statistically independent signals associated with disease. These allow us to extend heritability explained as well as find new targets for variant-to-gene mapping. As with T2D, I am particularly interested in application of novel tools and applications in MVP to deeply characterize the full spectrum of allelic heterogeneity comprehensively have never been better. The collection of liver disease and progression traits are opportune target for these efforts.
Multi-Trait mapping. MASLD by new definition carries with it at least one cardiometabolic trait association. I am interested in characterizing the trait physiology perturbed at MASLD and progression traits, identifying loci that are associated with lean MASLD, and performing multi-trait GWAS to expand the catalog of relevant associations that could be targets for functional studies beyond our primary scan.
Functional validation of MASLD Loci in the Zebrafish model system. Of established MASLD associations, a few transcripts identified through GWAS have been assigned causality for relevant phenotypic traits, such as liver fat, injury, or fibrosis/cirrhosis. These transcripts harbor variants with positive or negative risk associations that act through loss- and gain-of-function mechanisms in either hepatocytes or communicating cell types (immune). But like other complex diseases, many of the leading variants associated with MASLD are non-coding, which do not demarcate a clear candidate gene. While computational approaches can narrow down the list of potential candidates to reduce the number of genes in the list, candidate genes nominated in this way still require functional validation to demonstrate a phenotypic effect, direction of effect, and underlying mechanisms. Because we have dozens of MASLD candidate genes to evaluate, we need in vivo model system that facilitates tractable throughput, is scalable, and yet retains phenotypic depth and relevance. As such, we are embarked to build a collaboration with den Hoed Lab (Uppsala) to use zebrafish as a model system for validation. Zebrafish have a well-annotated genome with orthologues of at least 71% of human genes, or 82% of human disease-associated genes. Our results so far show strong directional consistency in effects on liver fat across zebrafish larvae and humans or mice for genes previously implicated through pLOF variants.

Quantitative Trait Locus Mapping Activities

We have recently expanded our interest and expertise in performing genetic mapping of molecular phenotypes, i.e., quantitative trait locus (QTL) discovery. Previously, we have performed scans in cell models for transcript abundance (eQTLs) or proportion of alternative spliced isoforms (sQTLs), and collaborate with local investigators to study primary tissues, including kidney (with Katalin Susztak). More recently, we are involved in several projects which utilize primary tissue in bulk and at single cell resolution, as well as expanding to chromatin accessibility (caQTL) mapping. Active projects include:

Single-cell eQTL and caQTL mapping for islet-cell types. In collaboration with Gaulton Lab (UCSD) and Grant Lab (Penn/CHOP), led by shared graduate student (Mary Ann Weidekamp), we are using single-cell data generated by the Human Pancreas Analysis Program which is collecting multi-omic readouts of pancreatic tissue obtained from deceased donors. These data include scRNA-Seq but also scATAC-Seq, and more recently, multiome RNA/ATAC-seq readouts. Our initial focus has been on mapping eQTLs in alpha, beta, and delta-cell populations, as well as additional cell types also captured in this experiment (e.g. acinar cells).
QTL mapping of B/T cells in pediatric cohort. In collaboration with Romberg Lab (CHOP), we are engaged in eQTL mapping in matched immune cell types (T cell, B cell) obtained from tonsil tissues extracted from pediatric cohorts, under baseline and stimulated conditions. These data are particularly novel, as they characterize immune cell populations from within a tissue context (not circulating in Blood, i.e. PBMCs), and derived from a pediatric cohort (age <18). Our initial efforts in >100 subjects have mapped ~14,000 eGenes in basal and stimulated states, including >1,700 not previously reported in either large scale data sets (i.e., GTEx or eQTLGen) nor from catalogs of eQTLs from immune cell type data obtainde from PBMCs (i.e., the DICE consortium).
caQTL mapping in Bulk Liver. Lead by a graduate student (Bradon Wenz) during his time working with Casey Brown, we are focused on the analysis of ~180 primary bulk liver samples that he profiled using chromatin accessibility mapping via ATAC-Seq. The result is a massive map of regions of accessible chromatin in bulk liver tissue as well as a catalog of caQTLs. We are working together to finalize the analyses for these data.

Additional Collaborative Projects

We also have a number of ongoing projects that generally involve collaborations or shared graduate students with another lab with complementary expertise our lab:

Fine-mapping and functional validation in Bone-mineral Density. Genetic factors strongly influence bone mineral density (BMD) and osteoporosis risk over time. To date, genome-wide association studies (GWAS) have discovered over 1,100 independent BMD signals; however, the causal variant(s) and respective effector gene(s) for the majority of these loci remain unknown. In collaboration with the Grant Lab, we previously published the results of a 3D genomics-based ‘variant-to-gene’ mapping method for BMD loci based on ATAC-seq, RNA-seq, and chromatin confirmation capture data from hMSC-derived osteoblasts. Led by graduate student (Mitch Conery) co-advised by Struan and myself, we are applying this approach, which links putative causal variants in open chromatin to candidate effector genes via chromatin interactions with open promoters, to a dataset derived from the human fetal osteoblast 1.19 cell line (hFOB). For validation of both sets of findings at scale, we conducted a CRISPRi screen in hFOBs differentiated for five days. In addition, Mitch is performing multi-trait signal fine-mapping for these signals across a range of related traits, to inform further prioritization efforts and potentially link BMD signals to novel biological and physiological mechanisms.
Genome-wide, phenome-wide Association studies in MVP. I was fortunate to join colleagues nation wide, to report a genome-wide, Phenome-wide association study across MVP participants. The scope and volume of computation require to complete this project require collaborations with the DOE and supercomputer access with modified, efficient implementations of analysis tools, leading to new collaborations and inference at scale that has hitherto not been reported. There, across >2,000 traits in >630K MVP participants, we reported >38,000 independent variants associated with 1 or more traits, performed fine-mapping which resulted in 6,318 singals from 613 traits to single-variant resolution. Future work extends in this vein to involve X-Chromosome analysis and expanded fine-mapping using multi-ancestry aware tools like SuSiEx, meSuSiE, mJAM, or flashfm.