Recent advances in next-generation sequencing (NGS) and single nucleotide polymorphism (SNP) genotyping promise to greatly accelerate crop improvement if properly deployed. High-throughput SNP genotyping offers a number of advantages over previous marker systems, including an abundance of markers, rapid processing of large populations, a variety of genotyping systems to meet different needs, and straightforward allele calling and database storage due to the bi-allelic nature of SNP markers. NGS technologies have enabled rapid whole genome sequencing, providing extensive SNP discovery pools to select informative markers for different sets of germplasm. Highly multiplexed fixed array platforms have enabled powerful approaches such as genome-wide association studies. On the other hand, routine deployment of trait-specific SNP markers requires flexible, low-cost systems for genotyping smaller numbers of SNPs across large breeding populations, using platforms such as Fluidigm’s Dynamic Arrays™, Douglas Scientific’s Array Tape™, and LGC’s automated systems for running KASP™ markers. At the same time, genotyping by sequencing (GBS) is rapidly becoming popular for low-cost high-density genome-wide scans through multiplexed sequencing. This review will discuss the range of options available to modern breeders for integrating SNP markers into their programs, whether by outsourcing to service providers or setting up in-house genotyping facilities, and will provide an example of SNP deployment for rice research and breeding as demonstrated by the Genotyping Services Lab at the International Rice Research Institute.
Over the past few decades, there has been a large investment around the world in basic plant science research, from trait characterization to functional genomics, as well as the infrastructure to collect, store, and characterize the genetic resources of important crop species. Although some of these efforts are embarked upon purely for scientific discovery, the underlying justification of many of these initiatives is that the advances in genomics and germplasm collections will prove essential to make future gains in crop improvement to feed a growing world (Delmer 2005; McCouch
The foundation for this opportunity is based on two main developments: the accumulated knowledge of useful genetic diversity, genes, and QTLs, and the technical advances in sequencing, genotyping and bioinformatics that have enabled rapid, high-throughput molecular marker approaches. Since the shift to simple sequence repeat (SSR) markers around 15 years ago and subsequently to SNP markers, excellent progress has been made to characterize the genetic diversity of major crop species, to map QTLs for key traits, and to clone genes important for crop improvement. For major crop species such as rice, maize and wheat, there are a large number of fine-mapped and cloned genes with associated functional markers that now provide breeders with a molecular marker toolkit for transferring traits into new varieties (L?bberstedt
This review will discuss the advantages that have led to the recent shift to SNP genotyping, show how NGS has provided valuable SNP discovery data sets, review the various SNP genotyping platforms including fixed arrays, flexible low-cost approaches, and genotyping by sequencing (GBS), and finally describe the key issues to consider when integrating SNP markers into a breeding program. While these techniques are relevant to all crop species, the examples provided will focus on rice research and breeding, with a case study of the high-throughput SNP genotyping facility at the International Rice Research Institute (IRRI).
To understand the shift to single nucleotide polymorphism (SNP) markers, we must first look into the limitations of SSR markers. First, there are limited numbers of SSR motifs in the genome?which becomes a constraint when trying to saturate a region with markers or when trying to identify gene-based markers. In addition, one of the main advantages of SSRs?their high information content from multiple alleles per locus?also presents difficulties when merging SSR data from different platforms and curating allele sizes in databases. In addition, gel-based SSRs are labor intensive and automated fragment sizing systems have limited scope for multiplexing. Therefore, SSR genotyping quickly hits a point where the low throughput and higher cost becomes a limiting factor?which is in contrast to recent SNP genotyping techniques.
The main advantages of SNP markers relate to their ease of data management along with their flexibility, speed, and cost-effectiveness. Bi-allelic SNP markers are straightforward to merge data across groups and create large databases of marker information, since there are only two alleles per locus and different genotyping platforms will provide the same allele calls once proper data QC has been performed. Although it is important to have a bioinformatics data management and curation team to convert SNP markers from different platforms to be on the same DNA strand, that is less challenging than trying to harmonize SSR allele sizes from different systems. With the help of a high quality reference genome, merging sequence and SNP data also enables more powerful analyses of the complete SNP catalog or “SNP universe” for each species. As the most common type of DNA polymorphism, SNPs are also flexible in the selection of SNP variants at target loci, as well as the large numbers of genome-wide loci available to choose from when selecting sets of informative markers for specific germplasm pools.
A major factor in the advantages of SNP markers for flexibility, speed and cost-effectiveness is the range of genotyping platforms available to address a variety of needs for different marker densities and costs per sample. Whereas early SNP genotyping techniques relied on gel-based methods such as cleaved amplified polymorphic sequence (CAPS) markers (Thiel
The rise of NGS has led to a flood of sequence data for most agriculturally relevant plant and animal species (Rounsley
One major challenge has been to identify and validate sets of informative genome-wide SNP loci from large sequence data sets that will function well as SNP markers. The first steps are to filter the data to ensure that the specific SNP variant in question has been observed multiple times, is single copy in the genome, and has no nearby variant that might interfere with the assay design. This can be further refined with population-based filtering across accessions to ensure that the SNP has a minor allele frequency (MAF) above a certain threshold within and between target germplasm groups, which will eliminate sequence errors and rare SNPs, while maximizing chances SNP markers will be polymorphic and informative (Fig. 1). Correlation between SNPs is also used to select tagging SNPs that represent all of the linkage disequilibrium (LD) blocks across the genome (Zhao
Large pools of NGS data are also valuable for characterizing SNP haplotypes that enable precise tracking of beneficial alleles for breeding applications. Now that many important genes and QTLs have been cloned, there has been progress to identify functional SNPs and gene-based SNP haplotypes to improve selection of target alleles. This process requires knowledge of the genetic donor providing the gene or QTL, along with some idea of the LD of the markers linked to the allele of interest. For marker-assisted backcrossing (MABC) applications, this is fairly straightforward since the donor introgression will usually be at least 1 Mb in size and easily tracked with flanking SNP markers polymorphic between the donor and recurrent parent. However, the availability of high density SNP genotyping and genome-wide association studies (GWAS) across sets of diverse germplasm has led to the identification of important chromosome segments with smaller LD blocks (on the order of 50?100 kb for inbred crops such as rice, but much smaller for outcrossing species such as maize), along with an interest in predicting target alleles across breeding lines and varieties having unknown relationships with the original genetic donors. The challenge then becomes characterizing the desired chromosomal segments or IBD blocks predictive for the trait of interest at that particular gene. In a few cases the functional nucleotide polymorphism (FNP) can be assayed directly, especially if the causal variant is a SNP, but more commonly a set of closely linked SNPs will define the unique haplotype that is associated with the targeted IBD block.
This is where NGS data becomes important, since a haplotype map (HapMap) of the genome can help define the extent of local LD decay, along with common haplotype block segments, as has been recently characterized for the maize genome (Chia
The early successes with high-throughput SNP genotyping relied on fixed sets of SNP markers assayed using microarrays. For example, the first phase of the human HapMap project employed whole-genome SNP arrays from Illumina and Affymetrix for large scale SNP genotyping (International HapMap Consortium, 2005). The Illumina BeadArray technology uses beads covered with specific oligos that fit into patterned microwells allowing for highly multiplexed SNP detection, initially employing the GoldenGate assay that incorporates locus and allele-specific oligos for hybridization followed by allele-specific extension and fluorescent scanning (Shen
The advantages of fixed array SNP platforms include a range of multiplex levels providing rapid high-density genome scans, robust allele calling with high call rates, and cost-effectiveness per data point when genotyping large numbers of SNPs. Between the different Illumina and Affymetrix technologies, a wide range of options is available for custom genotyping of various numbers of samples × SNPs, such as running 24 samples by 3K up to 700K SNPs or 384 samples by 50K SNPs. By carefully selecting informative, evenly spaced SNPs across the genome, these arrays are powerful tools for GWAS and diversity analysis, as has been achieved in rice using Illumina 1,536 and 50K SNP arrays (Zhao
While fixed arrays have been the SNP genotyping workhorses over the past decade, they have several disadvantages. First, it is expensive to design a custom SNP array, which also limits the number of re-designs that can be used to optimize the chip, in addition to needing a large initial commitment to get volume discounts to make them more cost-effective. For this reason, fixed arrays are best used when a “universal” design can be employed to make them widely usable across a broad range of germplasm?thus allowing the development cost to be spread across a large number of users, as can be implemented through a consortium model for designing custom SNP chips useful to the larger community. However, this presents a key challenge: to cover rare SNPs across multiple germplasm groups, a universal design can quickly become too large and expensive (and will result in large numbers of monomorphic loci for non-target germplasm groups), while using multiple population-specific chips adds to the development costs and limits the number of users needing any particular chip. In addition, the process of selecting informative SNPs for different germplasm groups will introduce ascertainment bias: these SNP variants no longer represent a set of random, neutral loci, but instead present a biased view of genetic relationships depending on what selection criteria were used to select the SNPs (Moragues
In addition to genotyping systems employing fixed SNP arrays, there are a number of high-throughput technologies available to run flexible sets of SNP markers. At the low range of the spectrum, PCR-based fluorescently-labeled SNP assays, such as TaqMan? and KASP™ markers, can be run one marker at a time and scanned on real-time PCR machines or fluorescent plate readers. For these methods, the cost is determined by the size of the PCR reaction volume?since fewer reagents are needed for smaller volumes. Thus, a 5 uL reaction in a 384-well PCR plate is more cost-effective than running 15 uL reactions in 96-well PCR plates. Moving to 1,536-well PCR plates can further reduce the cost, but at this point automation becomes necessary. The 5′-nuclease TaqMan? assay, which combines PCR with competitive hybridization, has been considered the gold standard in SNP genotyping since it was introduced almost 20 years ago (Livak
The concept of miniaturizing the reaction volumes for reducing the PCR reagent costs has been further advanced in a number of flexible high-throughput SNP systems, including Array Tape™ by Douglas Scientific, the OpenArray? system from Life Technologies, and Dynamic Arrays™ from Fluidigm. The Douglas Scientific technology uses a production line of automated modules to process spools of Array Tape™ that contain the equivalent of 200 microplates with 800 nL ? 1.6 μL reaction volumes through automated assay setup, PCR, and fluorescent detection?genotyping up to 150,000 data points per day using either TaqMan or KASP assays (
Each of these genotyping systems has their own pros and cons, since no single system is most efficient for every application; each breeding program, institute, or research community should select the platform that best addresses their specific needs. The flexible systems described above share the key advantage of being able to mix and match different SNPs for each set of samples?which reduces the wasted resources from genotyping a proportion of monomorphic loci as occurs with fixed SNP sets, especially when genotyping mapping populations. For diversity and fingerprinting, different subsets of informative SNPs for various germplasm groups can be optimized to enable running smaller sets of SNPs than a universal fixed array would require. Flexible SNP systems are also ideal for targeted SNPs, including functional SNPs and trait-specific haplotypes, since they have a very low cost per data point; however, for genome-wide scans they will quickly reach a threshold where fixed arrays or GBS will be more efficient. That threshold is determined by the cost per data point × number of SNP loci for a single-plex system versus the cost per sample of a multiplex system; i.e. it may still be reasonable to run 200?300 genome-wide SNPs on Fluidigm or Array Tape, but for a higher SNP density it might be more cost-effective to run a 6K SNP chip or GBS instead. However, these systems vary greatly in the initial capital investment required to purchase the equipment?often, with greater capital investments needed to reach the very low costs per data point (Table 1). This is a major factor to consider between setting up several small labs versus having centralized genotyping facilities, as will be discussed further below.
There are a number of examples of flexible SNP systems being used successfully for crop research and breeding. A major effort was initiated by the Generation Challenge Program as part of their Integrated Breeding Platform to validate KASP markers across globally-important field crops, ranging from 96 SNPs for groundnut to over 1,000 SNPs for ten other crops, including 1,250 for maize, 1,864 for wheat, and 2,015 for rice (
In addition to fixed arrays and flexible methods, the approach of using NGS for low-cost genotyping, called “genotyping by sequencing” (GBS), has become increasingly popular. While whole genome sequence data can be used to call SNP variants, for most crop species it is still too expensive to obtain deep sequence data merely for genotyping purposes. Thus, a number of approaches have been developed to bring down the cost of NGS to a level where it can be used for routine genotyping?which entails lower coverage sequencing, often by running multiple barcoded DNA samples in a single lane of an NGS machine. So at the simplest level, genotyping by sequencing can be achieved by low coverage “skim” sequencing, as has been used recently in rice with 0.02X-0.13X sequence coverage for mapping populations and approximately 1X sequence coverage for diverse germplasm (Huang
Most genotyping by sequencing techniques make use of restriction enzyme (RE) digestion, followed by adapter ligation, PCR and sequencing. The first of these used NGS at restriction-site associated DNA (RAD) tags by restriction digestion and ligation of adapters containing unique barcode sequences for sample multiplexing (Baird
While the original GBS protocol employed a single- enzyme protocol, a two-enzyme modification has been successfully employed in barley, wheat and oat (Poland
In contrast to the random, genome-wide SNP loci produced by RE-based GBS approaches, there are also several targeted re-sequencing approaches that can be used for genotyping. In the past, Sanger sequencing of PCR amplicons has been used for SNP variant detection, but it is too expensive for large-scale genotyping projects. Thus, recent efforts have focused on taking advantage of the power of NGS while maximizing the number of amplicons and samples that can be pooled into a single NGS run. For example, the Targeted Amplicon Sequencing (TAS) method uses a two-step PCR process to amplify specific targets across the genome and then add a barcode multiplex identifier across multiple individuals before pooling and sequencing (Bybee
One major challenge with GBS approaches is the considerable investment needed for bioinformatics support to properly analyze, curate and store the massive amounts of sequence data obtained from running GBS on large populations. GBS analysis pipelines are required to group the sequence tags, align to a reference genome (if available), call SNP variants, and assign calls to individual samples. For example, the
GBS has a number of advantages that has led to its rapid uptake (Poland and Rife, 2012). First, GBS performs SNP discovery and genotyping simultaneously, without the ascertainment bias that occurs when selecting sets of SNPs for fixed arrays, and without any prior information needed. It also has a low entry cost to establish a manual GBS library preparation workflow, while at the same time is it amenable to setting up an automated workflow using liquid handling workstations. The greatest advantage, however, is that GBS leverages the rapidly falling costs of NGS to provide an excellent balance of low costs per sample and high-density genome-wide SNP data. With tweaking of the choice of restriction enzymes, the number of samples multiplexed per run, and the sequencing platform, GBS can be further fine-tuned to provide a wide range of SNP densities at varying costs per sample (Beissinger
One of the main issues to consider when evaluating options for SNP genotyping is whether to develop in-house facilities or outsource to a service provider. In most cases, the new genotyping technologies require a large capital investment in order to provide very low costs per sample; moreover, these platforms are most efficient when they run very large numbers of samples, due to discounts for high volume purchases of reagents and consumables. Thus there has been a shift for smaller labs to outsource their genotyping needs to commercial service providers or for core facilities or “genotyping hubs” to be set up to serve the needs of local or regional communities of researchers and breeders. Although it can be convenient to outsource to a service provider who takes on the risk of upgrading infrastructure when equipment becomes obsolete, there becomes a point when having a core facility in-house becomes more efficient?especially if there is enough demand to keep the genotyping platforms running at full capacity. In these cases, the advantages of having a centralized core facility in-house include: faster turnaround times, being able to optimize protocols and markers to a few target crops, and avoiding the hassle of shipping seeds, leaf tissue or DNA samples out of the country. On the other hand, having service providers and core facilities available to accept DNA samples from anywhere in the world allows for unprecedented flexibility for smaller labs and breeding programs. In either case, it is essential to have professional level sample tracking, along with solid QA/QC measures, to ensure that reliable and accurate data is provided.
Another issue to consider is how much effort should be spent for MAS and MABC of targeted, trait-specific SNPs for known genes and QTLs versus employing more high-density genome-wide SNP scans (Fig. 2). This will depend on several factors, including the crop species, the genetic architecture of the trait of interest, and the number of breeding-relevant, large-effect genes and QTLs that are fine-mapped and cloned. As was discussed earlier in this review, one important aspect of targeted selection is knowledge on the size of the LD block and IBD status of the target?whether it’s included in a large introgression from a known, recent donor, or selected from a smaller LD block across a set of diverse germplasm. “Diagnostic” SNP markers, whether functional SNPs or gene-based haplotypes, can be used to profile diverse sets of germplasm with unknown pedigrees for the specific allele of interest, while flanking SNPs are best used to transfer introgressions from a known genetic donor from a recent cross. On the other hand, some traits do not lend themselves to a targeted approach and are better suited to genome-wide prediction or genomic selection methods that use precise phenotyping and high-density genotyping on a training population to calculate genome estimated breeding values (GEBVs), which are then used on breeding populations for rapid cycles of selection with the genotype data alone (Heffner
Above all these factors, bioinformatics plays an essential role behind any SNP genotyping program. Whether analyzing clusters of two-color fluorescent intensities or complex sequence data from GBS, robust pipelines need to be set up for routine allele calling, preferably in relation to a high quality reference genome. Moreover, many labs and breeding programs need to merge data across platforms, such as whole genome sequence data, fixed arrays, GBS and targeted single-plex assays?which requires careful attention to the DNA strand used to design the SNP assay, in addition to the relation of the SNP to the reference genome. At the same time, lower density data can be imputed using NGS data, whether from related lines or from a global HapMap. Once data is compiled and imputed, it needs to be analyzed for quality control and stored in a database structure that allows for user-friendly queries and downstream data analysis. The final step is then enabling access and decision support tools for breeders to integrate SNP markers into their selections to accelerate the progress in their breeding programs, such as the breeding information management systems being developed at IRRI (E. Nissil?, pers. comm.) and by the Integrated Breeding Platform (
An example of a core facility for SNP genotyping is IRRI’s Genotyping Services Laboratory (GSL), which was recently set up to provide rapid and cost-effective marker services to research and breeding groups at IRRI and the larger rice community. GSL currently has 12 full time staff divided into teams for marker validation, optimizing lab operations, running the routine genotyping services, and interfacing with IRRI’s bioinformatics group. A sample processing workflow is being optimized to efficiently move leaf tissue in the greenhouse and field into the DNA extraction and SNP genotyping pipelines. Leaf tissue is sampled using a Brooks PlantTrak Hx™ handheld plant sampling and barcoding device, which allows up to 12 leaf punches per sample and 100 samples per plastic magazine cartridge, and reduces issues of human error during the sampling process (
As of late 2014, the GSL genotyping platforms were focused on using a Fluidigm EP1™ Reader for targeted SNP markers and an Illumina Infinium rice 6K chip for genome-wide scans. For targeted genotyping, 96.96 Dynamic Array IFCs are used for diversity analysis, QTL mapping and background selection, while 192.24 sample × SNP format IFCs are used for running trait-specific SNPs across large populations. Targeted SNPs have been selected as flanking key gene and QTL positions from the rice 44K SNP chip (Zhao
Recent efforts at GSL have also aimed towards improving sample tracking, SNP analysis, and data management in the lab. An integrated laboratory information management system (LIMS) is being optimized for GSL’s operations using the web-based, cloud-hosted Biotracker™ LIMS from Ocimum Biosolutions (
Recent advances in molecular marker technology have enabled rapid high-throughput genotyping for pre-breeding discovery research as well as SNP deployment in breeding programs. Research and breeding groups now have a large number of options, including outsourcing to genotyping service providers or setting up a core facility based on one of the many genotyping platforms. With the rapid decrease in NGS costs, genotyping by sequencing (GBS) will become increasingly attractive to handle high-density genome-wide marker scans, as long as adequate bioinformatics support is available. Future prospects to increase the efficiency and impact of SNP genotyping will come on several fronts, including improved DNA extraction, more predictive SNP markers, more efficient GBS, and improved bioinformatics tools for SNP data analysis, management, and integration with breeders’ selection decisions. While techniques for DNA extraction from leaf tissue can be further improved, a larger gain can be made by switching to automated seed chipping, which saves the embryo for germination while extracting DNA from the remainder of the seed, allowing genotyping to screen out unwanted individuals before going to the field (see Monsanto patent EP1869961B1). At the same time, further progress in cloning important QTLs and characterizing functional SNPs and allele-specific haplotypes will continue to provide improved predictive markers for targeted selection. Moreover, as whole genome sequence and GBS data accumulates, it will become more feasible to impute functional variants with genome-wide data. Alternatively, it may be increasingly possible to use a smaller number of low-cost markers for genome-wide scans and then impute back to the whole genome sequence data of the parental lines. In either case, having improved bioinformatics and SNP data management tools will be essential™ the gains of the future will largely rest on the bioinformatics teams who are optimizing allele calling pipelines, building infrastructure for managing massive GBS data sets, and developing the tools that will seamlessly link SNP data with downstream applications for calculating GEBVs, tracking haplotypes, and assisting the breeders in making selections. The tsunami of sequence and SNP data has arrived; we should be prepared to take advantage of the data to accelerate progress in trait development, gene discovery, and increasing the rate of genetic gain for crop improvement.
Examples of high-throughput SNP genotyping technologies.
|Genotyping Platform||Technology||SNP × sample combinations||Capital investment||Cost per sample||Advantages|
|Illumina Infinium iSelect HD||Fixed array||3,072 ? 700K SNPs × 24 samples||High (iScan)||Moderate to high||Highly multiplexed|
|Affymetrix Axiom||Fixed array||50K SNPs × 384 samples; 650K SNPs × 96 samples||High (GeneTitan)||Moderate to high||Highly multiplexed|
|Douglas Array Tape||Flexible, PCR-based||1 SNP/sample × 76,800 reactions/reel||Very High (Nexar, Soellex, Araya)||Very low||Ultra high-throughput|
|Fluidigm Dynamic Arrays||Flexible, PCR-based||96 SNPs × 96 samples; 24 SNPs × 192 samples||Moderate (IFC Controller, FC1, EP1)||Low||High-throughput|
|RE-based GBS||Genotyping by sequencing||~10K-100K SNPs × 96 or 384 samples||Low to moderate (NGS outsourced or in-house)||Low to moderate||Lots of data relative to the cost|
|Amplicon sequencing||Genotyping by sequencing||Variable (e.g. 20?500 SNPs × 48?384 samples)||Low to moderate (NGS outsourced or in-house)||Low to moderate||Multiple targeted loci at once|