search for




 

Characterization of Chloroplast Genomes, Nuclear Ribosomal DNAs, and Polymorphic SSR Markers Using Whole Genome Sequences of Two Euonymus hamiltonianus Phenotypes
Plant Breeding and Biotechnology 2019;7:50-61
Published online March 30, 2019
© 2019 Korean Society of Breeding Science.

Junki Lee1,2, Shin-Jae Kang1, Hyeonah Shim1, Sang-Choon Lee3, Nam-Hoon Kim3, Woojong Jang1, Jee Young Park1, Jeong Hwa Kang4, Wan Hee Lee4, Taek Joo Lee4, Gyoungju Nah2, Tae-Jin Yang1,5,*

1Deptartment of Plant Science, Plant Genomics and Breeding Institute, and Research Institute of Agriculture and Life Sciences, College of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Korea, 2Genome Analysis Center at National Instrumentation Center for Environmental Management, Seoul National University, Seoul 08826, Korea, 3Phyzen Genomics Institute, Seongnam 13558, Korea, 4Hantaek Botanical Garden, Yongin 17183, Korea, 5Crop Biotechnology Institute/GreenBio Science and Technology, Seoul National University, Pyeongchang 25354, Korea
Corresponding author: *Tae-Jin Yang, tjyang@snu.ac.kr, Tel: +82-2-880-4547, Fax: +82-2-873-2056
Received February 20, 2019; Revised February 23, 2019; Accepted February 23, 2019.
This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract

Although genomics provides useful tools for crops, most wild resource plants still lack molecular data. To retrieve useful genomic data and thus provide fundamental information for a resource plant, we established a multi-directional approach using two low coverage whole-genome shotgun sequence (WGS) data of Euonymus hamiltonianus, which is a wild resource plant with potential as a medicinal and ornamental plant. We assembled complete chloroplast genome and nuclear ribosomal DNA (nrDNA) sequences and analyzed polymorphic simple sequence repeats (pSSRs) in the nuclear genome based on the comparison of WGS data between two different phenotypes. We developed a bioinformatics pipeline to identify pSSR motifs by systematic comparison of two WGS datasets. The pipeline is composed of multiple steps including end-joining of paired reads, isolation of joined reads harboring SSR motifs derived from unique non-repetitive regions, identification of pSSR via in silico comparison with counterpart WGS reads, design of pSSR primer sets, and validation. The pipeline was applied to WGS data of E. hamiltonianus and identified 161 contigs with pSSR motifs between the two different phenotypes. Based on the pSSR motifs, 20 primer pairs were designed, of which seven were successfully validated as real pSSR markers. We expect this information to be applicable to genomic resources of E. hamiltonianus.

Keywords : Euonymus hmiltonianus, Chloroplast, nrDNA, Polymorphic SSR, Bioinformatic pipeline
INTRODUCTION

Over the past decade, high-throughput parallel DNA sequencing, popularly called next-generation sequencing (NGS) technologies, have become widely accessible, decreasing the cost of DNA sequencing (Metzker 2010). NGS technologies have been evolving rapidly with the robust development of protocols for generating sequencing libraries and building effective solutions to data analysis (Shendure et al. 2008). Beginning with the genome project of Arabidopsis thaliana, hundreds of plants have been examined and analyzed (Arabidopsis Genome Initiative 2000; Michael and VanBuren 2015). Although NGS greatly enhanced whole-genome shotgun sequence (WGS), obtaining the complete reference genome sequence has been a big challenge due to its requirement of massive WGS data, as well as a variety of supporting data. As a result, reference genome sequencing is usually geared towards model plants or major crops, while genomes of most resource plants are still under veil. Due to financial and technological challenges, extensive genomic analysis for understanding non-model plants is still limited, and thus an efficient analytical method is necessary for the resource plants.

A molecular marker is a particular DNA fragment which includes specific genetic information with differences in the genomic level (Agarwal et al. 2008). Molecular markers have been widely used as a valuable tool for assessing genetic variation and have highly improved the genetic analysis in crop plants (Varshney et al. 2009). Various genomic components of plant cells have been considered as targets of molecular markers. Among them, chloroplast genome and 45S nuclear ribosomal DNA (nrDNA) are key elements that are used to explain plant genetic diversity and evolution (Qiu et al. 1999; Soltis et al. 1999; Kim et al. 2015). The chloroplast genomes are circular DNA molecules ranging from 120 to 217 kb in length with highly conserved structures and gene order (Palmer 1985). They are composed of a large single copy (LSC), small single copy (SSC), and two copies of inverted repeats (IR). The nrDNA unit has a fundamental genetic role of linking transcription to translation (Richard et al. 2008). The nrDNA unit contains the 28S large subunit, the 18S small subunit, the 5.8S gene, two internal transcribed spacers (ITS1 and ITS2), and large intergenic spacers (IGS) in the 45S nrDNA and 5S ribosomal RNA gene in the 5S rDNA (Long and Dawid 1980). They are generally high-copied and tandemly-repeated transcription units in the plant genome (Rogers and Bendich 1987).

Simple sequence repeats (SSRs, or microsatellites) consisting of one to six or more nucleotides are sequentially repetitive motifs in a head-to-tail structure popularly used as genetic markers (Kelkar et al. 2010; Choi et al. 2011; Kim et al. 2012). SSR based molecular markers are universally implemented for population genetic studies such as parentage analysis, fingerprinting, genetic structure analysis and genetic mapping due to copy number variations or polymorphic features with reproducibility (Mittal and Dubey 2009; Meng et al. 2014; Grover et al. 2016; Lee et al. 2017a). Recently, various softwares to find SSR motifs were developed such as MISA (http://pgrc.ipk-gatersleben.de/misa/), SSR Locator, and FullSSR (Rozen and Skaletsky 2000; da Maia et al. 2008; Abdelkrim et al. 2009; Metz et al. 2016). However, conventional SSR marker development and experimental authentication still require considerably long guided sequence information and a labor-intensive procedure to find polymorphisms (Ma et al. 2009).

In this study, we developed a pipeline of essential genomic information with low coverage WGS data. A non-model species Euonymus hamiltonianus which is a medicinal and valuable ornamental plant was analyzed in this study. To identify genomic diversity and evolution, chloroplast genome, 45S nrDNA, and polymorphic simple sequence repeats (pSSRs) were characterized by a simple and effective pipeline with small amounts of whole genome NGS data.

MATERIALS AND METHODS

Plant materials, genomic DNA extraction and NGS sequencing

Two naturally-occurring phenotypes of E. hamiltonianus, one with normal leaves (EH-n) and the other with variegated leaves (EH-v), were sampled from Hantaek Botanical Garden (http://www.hantaek.co.kr, South Korea). Genomic DNAs were extracted from leaves using a modified cetyltrimethylammonium bromide (CTAB) method (Allen et al. 2006) and then used to construct genomic libraries with insert sizes of about 500 bp, according to Illumina paired-end (PE) standard protocol (http://www.illumina.com). The libraries were sequenced using Illumina MiSeq genome analyzer at LabGenomics (www.labgenomics.co.kr).

Sequence assembly and phylogenetic analysis of chloroplast genome and 45S nrDNA

Complete chloroplast genomes and 45s nrDNA units were assembled by dnaLCW protocol using CLC_novo_ assemble (ver. 4.21.104315, CLC Inc, Aarhus, Denmark) and manual curation (Kim et al. 2015). The complete chloroplast genome of E. hamiltonianus was aligned with complete chloroplast genomes of eleven plants using MAFFT 7 (Katoh et al. 2002). A phylogenetic tree was generated by maximum likelihood analysis using MEGA 7.0 with bootstrap values of 1,000 (Kumar et al. 2016).

Sequence preparation (quality control and paired end joining), identification of SSR motif and clustering

The WGS reads of EH-n and EH-v were trimmed by Trimmomatic (ver. 0.33) based on quality score and sequence length (set the minimum quality score: ≥ 20, read length: ≥ 70 bp) (Bolger et al. 2014). Trimmed PE WGS reads of EH-n were assembled by clc_overlap_reads (ver. 4.21.104315, CLC Inc, Aarhus, Denmark) with a minimum overlapping length of 20 bp and 95% similarity (applying ‘-o 20–s .95’). SSR motifs in joined and non-joined (≥ 250 bp) contigs were identified using microsatellite search module (MISA: http://pgrc.ipk-gatersleben.de/misa/), with the minimum number of repeat units 6, 5, 5, 5, 5, and 4 for di-, tri-, tetra-, penta-, hexa-, and hepta-nucleotides, respectively. The primary trimming of redundant contigs of reference species (EH-n) was fulfilled by sequence clustering using BLASTCLUST with a sequence-identity cutoff of 90% and length coverage threshold of 50% (Altschul et al. 1997). In this study, we used single clustered contigs for further analysis.

Discovery of polymorphic SSR between two WGS data

The trimmed WGS reads of EH-v were aligned to single-clustered SSR contigs of EH-n using CLC_mapper with the matched length fraction of 90% and sequence similarity of 90% (ver. 4.21.104315, CLC Inc, Aarhus, Denmark). The second trimming of redundant contigs of EH-n was done by removing high-depth contigs that have more than 10 mapping coverage by WGS reads of EH-v, because highly mapped contigs could have possibly been derived from redundant DNA regions such as repeats within the genome. Variation sites that indicate a consistent difference between the contigs of EH-n and WGS reads of EH-v were found using CLC_find_variation (ver. 4.21.104315, CLC Inc, Aarhus, Denmark). The information of the sequence variation file and SSR motif file from MISA (http://pgrc.ipk-gatersleben.de/misa/) were combined to estimate the polymorphic regions including microsatellite sites using an in-house python program. Primer binding sites (PBS) were designed using primer3 (Rozen and Skaletsky 2000).

PCR validation of designed SSR markers

Genomic DNAs of two E. hamiltonianus accessions were used for PCR validation of SSR markers. PCR amplification was proceeded in a 25 μL reaction volume containing the following components: 20 ng of DNA template, 10 μM of primer set, 5 mM of dNTP, and one unit of Taq DNA polymerase (Vivagen, Seongnam, Korea). The amplification condition was as follows: 5 minutes at 94°C, 35 cycles of 94°C 20 seconds, 58°C 20 seconds, and 72°C 20 seconds, and then 72°C for 7 minutes. PCR products were then separated by 12% polyacrylamide gel electrophoresis for two and a half hours to identify polymorphisms. The gel was stained with ethidium bromide and visualized under UV illuminator for manual genotyping.

RESULTS

Complete chloroplast genome and 45S nuclear ribosomal DNA assembly

Two types of E. hamiltonianus showing different leaf patterns, one with normal leaves (EH-n) and the other with variegated leaves (EH-v), were collected from Hantaek Botanical Garden in South Korea (http://www.hantaek.co.kr, South Korea) (Fig. 1). Approximately 477 Mbp and 481 Mbp of WGS sequence data were obtained from EH-n and EH-v respectively, by NGS analysis using Illumina MiSeq platform.

Complete chloroplast genome sequences of EH-n and EH-v were successfully assembled through de novo assembly using low coverage WGS (dnaLCW) with high quality reads followed by manual curation (Table 1, Fig. 2A) (Kim et al. 2015). EH-n and EH-v had identical chloroplast genome sequences, both 157,360 bp in length (GenBank accession no. KY921875). The genome was composed of four sections: an LSC of 86,399 bp, an SSC of 18,317 bp, and a pair of IRs of 26,322 bp each. A total of 113 genes including 79 protein-coding genes, 30 transfer RNA genes, and 4 ribosomal RNA genes were annotated in the chloroplast genome (Fig. 2A). Phylogenetic analysis was done using complete chloroplast genome sequence of E. hamiltonianus with 11 chloroplast genomes of different species. Three monophyletic groups were divided into three cohorts: Rosids, Asterids, and Commelinids. The chloroplast genome-based phylogenetic tree showed that E. hamiltonianus was grouped with Rosids (Fig. 2B). The 45S nrDNA sequences of 5,824 bp were also assembled for EH-n and EH-v, which include complete transcriptional unit sequences (GenBank accession no. KY926695) (Table 1, Fig. 2C).

Establishment of bioinformatics pipeline for detection of polymorphic SSR using low coverage of WGS (dpsLCW)

The dpsLCW contains 1) contig set construction by PE joining and singleton selection based on length of reference WGS data set, 2) identification of SSR motifs in contig set, 3) primary selection of potentially non-repetitive (NR) contigs by clustering, 4) secondary selection of potential NR contigs by alignment with WGS reads of alternative species, and 5) finding pSSR candidate contigs.

First, at least two sets of WGS data from closely related species (or intra-species) were required. The paired-end reads of both samples were trimmed based on quality score (minimum 20) and combined (or assembled) into a contig by overlapping each PE read. Each sample was assigned to ‘reference’ and ‘alternative’, respectively. Considerably long sequences of reference among overlapped contigs and non-overlapped singleton reads that are able to find PBS and can be compared with alternative WGS reads were selected. In this study, the contigs under 250 bp in length were filtered out. The MiSeq platform was recommended with 500 bp library insert size which could easily join PE reads and also non-joined singleton could generally have a long sequence with an average of 300 bp in length. Second, the reference contigs were used to find robust SSR motifs using MISA (http://pgrc.ipk-gatersleben.de/misa/). Third, NR candidates obtained among SSR motifs contigs were primarily selected by sequence clustering. Single clustered contigs were used in this study. Fourth, the selected NR contigs with SSR motifs of reference were aligned to trimmed WGS reads of the alternative. Contigs with high-mapping depth were discarded because these contigs could have been derived from redundant genomic regions such as repetitive DNA sequences. Finally, pSSR contigs were selected using the information of alignment between reference and alternative from the previous step. All steps of this progress are portrayed in the schematic diagram in Fig. 3.

Verification of dpsLCW protocol with E. hamiltonianus

The dpsLCW protocol was applied for the identification of intraspecific variations between EH-n and EH-v. Quality-trimmed sequences of 426 Mb and 423 Mb were prepared for EH-n and EH-v, respectively. 508,265 PE reads of EH-n were joined into single contigs by the alignment between each PE read (63.8%). A total of 457,385 PE reads in EH-v were also joined into single contigs (57.9%). EH-n was used as a reference while EH-v was used as an alternative, because the number of joined contigs of EH-v were smaller than that of EH-n. The contig data set and the remaining 596,964 non-joined singletons of EH-n were filtered out by a read length of up to 250 bp due to their capability of easily finding their potential PBS. Qualified 805,049 contigs were used to search SSR motifs. Within these contigs, MISA detected 19,053 contigs comprising of various SSR motifs. After this, sequence clustering was performed to investigate NR contigs using BLASTCLUST (Altschul et al. 1997). Single clustered contigs of 7,629 were chosen based on the median value of all clusters for further analysis (median value: 1) (Table 2).

To identify pSSR motifs between two samples, 1,600,199 WGS reads of EH-v were aligned to 7,629 contigs of EH-n. Only 46,284 reads of EH-v (2.89%) were mapped onto EH-n contigs with an average coverage of 2.78 (maximum coverage: 10243.3). A total of 116 contigs were discarded with more than ten mapping coverage by WGS reads of EH-v, because highly mapped contigs might have originated from repetitive DNA sequences. Moreover, 6,054 unmapped contigs were excluded for the unavailability of pSSRs. Among the remaining, 1,459 contigs which have been mapped under ten mapping coverage by EH-v reads, 161 contigs with 60,859 bp in total length were finally selected for the detection of pSSR (Table 2, Supplementary Fig. S1). A total of 163 pSSR motifs were found between EH-n and EH-v, of which 1 contig had 3 SSR motifs including a dinucleotide and two trinucleotide SSR motifs. 117 contigs had dinucleotide SSR motifs, 31 contigs had trinucleotide SSR motifs, eight contigs had tetranucleotide SSR motifs, and six contigs had more than pentanucleotide SSR motifs.

Validation of predicted pSSR for E. hamiltonianus

To validate that identified pSSRs show actual variance between EH-n and EH-v, 20 contigs were randomly chosen and designed primer pairs using Primer3 (Rozen and Skaletsky 2000) (Table 3). The 20 primer pairs were successfully amplified (Supplementary Fig. S2). Among them, seven polymorphic markers were developed including a dominant marker of EhSSR08 (Supplementary Fig. S2). Using these markers, EH-n and EH-v can be distinguished from each other (Supplementary Fig. S2). These SSR markers will be valuable for evaluation and classification of another E. hamiltonianus accessions.

DISCUSSION

Complete chloroplast genomes and 45S nrDNA sequences of E. hamiltonianus

Non-model or underutilized plants are plants with potential value but are not widely grown (Park et al. 2009). Compared to major crops, non-model plants lack genetic resources including the information of applicable molecular markers. Under these circumstances, dnaLCW was previously developed (Kim et al. 2015) which could facilitate the process of improving knowledge and availability of non-model plants. To investigate genetic relationship between EH-n and EH-v, complete chloroplast genome and 45S nrDNA were generated for both E. hamiltonianus types (Figs. 2A, 3C). However, the two E. hamiltonianus did not have any polymorphic sites in the complete chloroplast genome and 45S nrDNA sequences (Fig. 2A, 2C). The phylogenetic tree of chloroplast genome of E. hamiltonianus with 11 species showed that Rosids and Asterids clades of core eudicots were distinguished from Commelinid clade of monocots. As expected, E. hamiltonianus was placed closely to the E. japonicus (GenBank accession no. KB189362), a species in the same genus with a nucleotide similarity of 97% in Rosids clades (Fig. 2B).

False pSSRs derived from dpsLCW in E. hamiltonianus

SSR markers are one of the most applicable markers for the genetic evaluation and utilization of crops due to their benefits: 1) high-reproducibility, 2) hyper-variable nature, and 3) co-dominant nature (Park et al. 2009). However, developing SSR markers was not easy in non-model plants due to insufficient genomic information (Gong et al. 2008; Ma et al. 2009). Therefore, an efficient protocol was devised in this study, named dpsLCW, to identify pSSR motifs for marker development to support fundamental research on non-model plants.

The dpsLCW was applied to finding pSSRs between two E. hamiltonianus phenotypes (EH-n and EH-v). Twenty SSR primers designed by dpsLCW were successfully amplified (EhSSR01-20). Among them, polymorphisms were observed in seven primer pairs including a dominantly amplified marker of EhSSR08 (35% of designed EhSSR primers). Those primer sets were almost consistent with the estimated size of PCR products (Supplementary Fig. S2). The non-amplification of EhSSR08 in EH-v might be caused by nucleotide polymorphisms in the PBS. The other unexpected non-polymorphic PCR products may have been due to paralogous sequences in E. hamiltonianus. Candidates of paralogous pairs were observed in some sequences with polymorphic sequences at flanking regions of SSR motifs (Supplementary Fig. S3). Sequencing or assembly errors during the “PE joining & Singleton selection” step of dpsLCW could be one of the reasons for unexpected non-polymorphic PCR products. However, 20 SSR markers by dpsLCW showed polymorphisms with a high success rate (7 out of 20). The seven markers and other 142 contigs with candidates of pSSR will provide fundamental information on the evaluation of useful genetic resources and breeding in E. hamiltonianus. The dpsLCW could be a considerably useful pipelines for polymorphic marker development and could be applied to other non-model plants.

The advantages of dpsLCW

The dpsLCW is a relatively economical and efficient method for non-model or non-sequenced plants because the pipeline only requires two small scale WGS data, regardless of the presence of reference or long sequence information. Although recent advances in single molecule real time sequencing that can significantly increase the read length (Eid et al. 2009), NGS technology could still be more compatible with dpsLCW because of its inexpensive and high-throughput productivity for wide genomic coverage.

The dpsLCW is considerably precise for detecting authentic pSSR. In this study, 35% (7 of 20) of the polymorphic SSR markers were successfully identified (Fig. 2B). Unlike conventional researches, two factors may play a role in achieving high success rates with low rates of multi bands: (1) The dpsLCW method can select pSSR candidates through the direct comparison between WGS reads from two data sets. Conventional research for pSSR identification required time consuming wet experiments due to low polymorphic rates (with less than 5% of success rate) from the SSR candidates because they utilized sequence harboring the SSR motif from one genotype (Choi et al. 2011; Kim et al. 2012; Izzah et al. 2014). (2) The dpsLCW could remove highly abundant sequences such as repetitive DNA sequences through the filtering steps “Primary NR filtering by clustering” and “Secondary NR filtering by alignments”. Repetitive DNA, especially retrotransposons, could also be used as molecular markers in forms of inter-retrotransposons amplified polymorphisms (IRAP), sequence-specific amplified polymorphisms (S-SAP), and retrotransposon-microsatellite amplified polymorphism (REMAP) (Kalendar et al. 2006). However, most retrotransposon based molecular markers usually produce multiple bands with a wide range of amplicon lengths and were seriously affected by genomic tendency of targeted retrotransposons. Additionally, advanced information of retrotransposons in the plant subject is needed. Moreover, a considerable portion of the dual filtered contigs might be composed of NR regions or genic regions like EST in the genome. To verify whether the dual trimmed contigs of EH-n were related to genic regions or not, mapping data with ESTs of E. alatus, belonging to the Euonymus genus, were used. The mapping data with 3,279,262 ESTs were produced by the 454 sequencing platform (SRA accession no. SRA025080). Among the 7,629 contigs, 1,334 (17%) contigs were mapped with ESTs of E. alatus. It may be deduced that at least one of the five filtered contigs through dpsLCW was related to the genic region. However, it is hard to say that the rest of contigs were not related to genic regions because the number of targeted ESTs of E. alatus was too small.

The dpsLCW has scalability. The designed SSR primer sets could successfully be applied to related species due to the variability of SSR motifs in plants. Furthermore, another molecular marker system such as cleaved amplified polymorphic sequences (CAPS) or derived cleaved amplified polymorphic sequences (dCAPS) caused by single nucleotide polymorphisms (SNPs) or insertion and deletions (INDELs) in genome sequences might be applied through dpsLCW. Moreover, our protocol could be applied to WGS data of plant species registered in public databases.

However, the dpsLCW has its limitations. One limitation is that read length of WGS affects assembly process and efficiency. In this study, an average read length of 300 bp generated from Illumina MiSeq platform were used, and thus dpsLCW was able to easily select long sequences to develop SSR markers with a potential PBS. When using the relatively short Illumina sequence of HiSeq, NextSeq, or other platforms, the de novo assembly process or control of WGS library size is required to generate long contigs. The other limitation is that the step of NR filtering by clustering and alignment in dpsLCW could not perfectly filter out all repeat-rich regions in the genome due to their small amount WGS.

Solution for genomic study of non-model plants using low coverage WGS

This study has shown that fundamental genomic analysis can be achieved with only a low coverage WGS (LCW) by efficient bioinformatics pipeline. The key concept of the pipeline is that repetitive sequences in plant genomes or cells are sufficient enough for LCW due to their abundance. Organelle genome and 45S nrDNA sequences could be completely assembled and phylogenetically analyzed using dnaLCW (Kim et al. 2015). Based on the same point, comparative analysis of pSSR were conducted by dpsLCW. Recently, repeat analysis could be demonstrated by genomic quantification using LCW (Lee et al. 2017b) and a large number of plants could be simultaneously investigated through the multiplexed sequencing technology (Smith et al. 2010). These results indicated that fundamental and comparative genomic analysis in non-model plants could be successfully conducted using our bioinformatics pipeline (Fig. 4).

ACKNOWLEDGEMENTS

This work was carried out with the support of “Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ013238)” Rural Development Administration, and the Bio & Medical Technology Development Program of the NRF funded by the Korean government, MSIP (NRF-2015M3A9A5030733), Republic of Korea.

Figures
Fig. 1. Leaf morphologies of two E. hamiltonianus phenotypes. (A, C) Leaves and unopened flowers of normal plants. (B, D) Variegated leaves and unopened flowers of natural mutant plant which is being developed as an ornamental cultivar. The picture was taken from Hantaek Botanical Garden in May 12, 2017.
Fig. 2. Analysis of complete chloroplast genome and 45S nrDNA units of E. hamiltonianus. (A) Chloroplast genome map of E. hamiltonianus. The complete chloroplast genome sequence was annotated by the DOGMA program (). The map was generated using OGDRAW (). Genes in inner-circle and outer-circle were transcribed clockwise and anti-clockwise, respectively. The features of GC contents are displayed in the inner ring with internal blocks of chloroplast genome, such as long single copy section (LSC), inverted repeat B (IRB), short single copy section (SSC), and inverted repeat A (IRA). (B) Phylogenetic analysis of chloroplast genome sequence of E. hamiltonianus and eleven different species. The green, blue, and red letters indicate the cohorts of Commelinids, Asterids, and Rosids, respectively. The tree was generated by maximum likelihood using MEGA 7 (). (C) Schematic diagram of a complete 45S nrDNA unit of E. hamiltonianus. The WGS reads of EH-v were mapped again to the assembled EH-v 45S nrDNA unit. GC content per 100-bp unit length is indicated by the red line.
Fig. 3. Pipeline for dpsLCW. (A) Steps for WGS reads of reference and alternative species are shown in different colored boxes, blue and red, respectively. The adopted programs in each protocol are shown in orange boxes. (B) Yellow colored, variously colored, and grey colored small rectangles indicate SSR motifs, reference WGS reads, and alternative WGS reads, respectively. Same colored rectangles of reference WGS represent homologous WGS reads. WGS reads in the pink box indicate one cluster in step 4. The WGS reads in light grey regions in step 4 and 5 were not used in further steps.
Fig. 4. A schematic pipeline showing various genomic analyses using plant LCW.
Tables

Assembly status of chloroplast genome and nrDNA of two E. hamiltonianus phenotypes.

NameChloroplast genomenrDNA


Length (bp)Coverage (x)Aligned readsLength (bp)Coverage (x)Aligned reads
E. hamiltonianus (normal)157,36072.7246,1875,824381.699,500
E. hamiltonianus (variegated)157,360707.52441,4915,824355.178,829

WGS reads of two E. hamiltonianus accessions used in the dpsLCW pipeline.

PhaseContents of phaseE. hamiltonianus (normal)
EH-n
E. hamiltonianus (variegated)
EH-v


ReadsBasesReadsBases
Raw data1,637,296477,360,0271,625,572481,075,391
Step 1Trimmed data1,613,494426,154,2311,600,199423,728,477
Step 2PE joined & singleton (≥ 250)805,049284,940,618
Step 3Reads including SSR motif19,0537,169,090
Step 4Singlet SSR reads7,6292,564,370
Step 5Non-repeat SSR reads1,459564,592
Step 6Reads containing pSSR16160,859

Primer information of candidate pSSR markers designed in this study.

Marker IDSSR motifContig length (bp)Estimated PCR product sizePrimer sequencesDescription based on Blastx searches (e-value)


EH-nEH-vEH-n (bp)EH-v (bp)
EhSSR01(TC)17(TC)6488254232F GAAATTGTGCACTCCCCTGTT
R TCTCAAAATGCGAAGCGCAG
EhSSR02(TC)17(TC)7404220200F CGGATCAACCAGTCGTCCAAXP_011035889 probable methyltransferase PMT23 (4e-04)
R TACTGTGCTAGCCCAAACCG
EhSSR03(AT)16(AT)6463215195F GGTGCAGGTTCAGAAAGGCT
R AGAGCCAAATCGACAAAAAGGG
EhSSR04(AGA)10(AGA)4393280262F TCACTAACCTGCTTGCACCAA
R GAGAGCGATGAAGATGCGTG
EhSSR05(GA)12(GA)3287188174F TAGTAGTCGAGTGGGATGGGG
R TCATGTGCCACCGAAATACCAA
EhSSR06(GAAAGGA)6(GAAAGGA)4284207193F CCGAGCCGGATCTTGAAAGT
R TGGATAGGTCCGGATTGCCT
EhSSR07(CT)26(CT)19379298284F TGTGTGGCCAAGACACAAGT
R ACTGGCAACTTTCCTAGACTGA
EhSSR08(GA)16(GA)10477273261F CCAGCAAAAGCTTAAGGAAACGA
R GCACATCTCCATTGCAAGTTCA
EhSSR09(TA)11(TA)5453283271F GGCCTCGTTACTGCTATGCTXP_015382307 patatin-like protein 2, partial (3e-59)
R TGCCATCGTATTTGGGTCCT
EhSSR10(CTG)7(CTG)3430232220F GCCATGGACTAATTGCTGCGXP_002526966 protein SIEVE ELEMENT OCCLUSION B (0.073)
R TGGGACCAACAAGCCAACAT
EhSSR11(GA)9(GA)16301231244F ACGTCACATCCACCATGCAA
R ATGGCATTCCGTCCGTGATT
EhSSR12(AT)10(AT)6371236228F GAATGCATGCCACTCCAACAXP_013315502 hypothetical protein PV05_07244 (5.7)
R ATAAGCAATTGGGGAACCTAGTA
EhSSR13(AG)8(AG)5270251245F TCAGGTCTTGCAGTCTCTGATTTXP_015868974 uncharacterized protein LOC107406380 (3e-13)
R GAAGAAGGGGCAGAGGTTGTT
EhSSR14(TGT)6(TGT)4270179173F ACATACACGCACCTTAGGTCAXP_018845393 DEAD-box ATP- dependent RNA helicase 8-like (4e-04)
R CAATCGCAGCAGCAACAGTATC
EhSSR15(TA)7(TA)4286236230F AGTCCCCGCTAAGAGGCATA
R AACACAGAGAAGTCTGCGGG
EhSSR16(ACA)6(ACA)4532202196F AGGACAGACATGGCCTTTCACXP_016689476 homeobox protein knotted-1-like 3 (2e-07)
R CCGAGAAGTTCGGAGGTTGT
EhSSR17(ATG)8(ATG)6407220211F CCAAAGCGAGATGAGTGTGTTAATXP_002301160 hypothetical protein POPTR_0002s12380g (9e-14)
R TCGTCCAGTTGGGGTCCTTT
EhSSR18(GGT)5(GGT)3436269263F GTTGGTTTATCTGGGTTGGCT
R ATTGGGTGAGCAGCACTGTA
EhSSR19(ATAA)5(ATAA)4301170166F TGCACAAGAGTTCTTTATTTCAGCA
R GCAGTAGCTTAGCATGGGTCA
EhSSR20(ATGT)5(ATGT)4301155147F AGCTTGGCTTGCCTTTTTCAG
R ACAATTATGGATGCATTTGTTGTTT

References
  1. Abdelkrim J, Robertson B, Stanton JA, Gemmell N. 2009. Fast, cost-effective development of species-specific microsatellite markers by genomic sequencing. Biotechniques. 46: 185-192.
    Pubmed CrossRef
  2. Agarwal M, Shrivastava N, Padh H. 2008. Advances in molecular marker techniques and their applications in plant sciences. Plant Cell Rep. 27: 617-631.
    Pubmed CrossRef
  3. Allen GC, Flores-Vergara MA, Krasynanski S, Kumar S, Thompson WF. 2006. A modified protocol for rapid DNA isolation from plant tissues using cetyltrimethylammonium bromide. Nat Protoc. 1: 2320-2325.
    Pubmed CrossRef
  4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
    Pubmed KoreaMed CrossRef
  5. Arabidopsis Genome Initiative 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 408: 796-815.
    Pubmed CrossRef
  6. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 30: 2114-2120.
    Pubmed KoreaMed CrossRef
  7. Choi H-I, Kim NH, Kim JH, Choi BS, Ahn I-O, Lee J-S, et al. 2011. Development of reproducible EST-derived SSR markers and assessment of genetic diversity in Panax ginseng cultivars and related species. J Ginseng Res. 35: 399-412.
    Pubmed KoreaMed CrossRef
  8. da Maia LC, Palmieri DA, de Souza VQ, Kopp MM, de Carvalho FI, de Oliveira AC. 2008. SSR Locator: tool for simple sequence repeat discovery integrated with primer design and PCR simulation. Int J Plant Genomics. 2008: 412696.
    Pubmed KoreaMed CrossRef
  9. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science. 323: 133-138.
    Pubmed CrossRef
  10. Gong L, Stift G, Kofler R, Pachner M, Lelley T. 2008. Microsatellites for the genus Cucurbita and an SSR-based genetic linkage map of Cucurbita pepo L. Theor Appl Genet. 117: 37-48.
    Pubmed KoreaMed CrossRef
  11. Grover A, Sharma PC. 2016. Development and use of molecular markers: past and present. Crit Rev Biotechnol. 36: 290-302.
    Pubmed CrossRef
  12. Izzah NK, Lee J, Jayakodi M, Perumal S, Jin M, Park BS, et al. 2014. Transcriptome sequencing of two parental lines of cabbage (Brassica oleracea L. var. capitata L.) and construction of an EST-based genetic map. BMC Genomics. 15: 149.
    Pubmed KoreaMed CrossRef
  13. Kalendar R, Schulman AH. 2006. IRAP and REMAP for retrotransposon-based genotyping and fingerprinting. Nat Protoc. 1: 2478-2484.
    Pubmed CrossRef
  14. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30: 3059-3066.
    Pubmed KoreaMed CrossRef
  15. Kelkar YD, Strubczewski N, Hile SE, Chiaromonte F, Eckert KA, Makova KD. 2010. What is a microsatellite: a computational and experimental definition based upon repeat mutational behavior at A/T and GT/AC repeats. Genome Biol Evol. 2: 620-635.
    Pubmed KoreaMed CrossRef
  16. Kim K, Lee SC, Lee J, Lee HO, Joh HJ, Kim NH, et al. 2015. Comprehensive survey of genetic diversity in chloroplast genomes and 45S nrDNAs within Panax ginseng species. PLoS One. 10: e0117159.
    Pubmed KoreaMed CrossRef
  17. Kim K, Lee SC, Lee J, Yu Y, Yang K, Choi BS, et al. 2015. Complete chloroplast and ribosomal sequences for 30 accessions elucidate evolution of Oryza AA genome species. Sci Rep. 5: 15655.
    Pubmed KoreaMed CrossRef
  18. Kim NH, Choi HI, Ahn IO, Yang TJ. 2012. EST-SSR marker sets for practical authentication of all nine registered ginseng cultivars in Korea. J Ginseng Res. 36: 298-307.
    Pubmed KoreaMed CrossRef
  19. Kumar S, Stecher G, Tamura K. 2016. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol. 33: 1870-1874.
    Pubmed CrossRef
  20. Lee J, Joh JH, Kim N-H, Lee S-C, Jang W, Choi BS, et al. 2017a. High-throughput development of polymorphic simple sequence repeat markers using two whole genome sequence data in Peucedanum japonicum. Plant Breed Biotech. 5: 134-142.
    CrossRef
  21. Lee J, Waminal NE, Choi HI, Perumal S, Lee SC, Nguyen VB, et al. 2017b. Rapid amplification of four retrotransposon families promoted speciation and genome size expansion in the genus Panax. Sci Rep. 7: 9045.
    Pubmed KoreaMed CrossRef
  22. Long EO, Dawid IB. 1980. Repeated genes in eukaryotes. Annu Rev Biochem. 49: 727-764.
    Pubmed CrossRef
  23. Ma K-H, Kim N-S, Lee G-A, Lee S-Y, Lee JK, Yi JY, et al. 2009. Development of SSR markers for studies of diversity in the genus Fagopyrum. Theor Appl Genet. 119: 1247-1254.
    Pubmed CrossRef
  24. Meng W, Fei X, Peng Y, Duan X-Y, Zhou Y-L, Shen C-Y, et al. 2014. Development of SSR markers for a phytopathogenic fungus, Blumeria graminis f. sp. tritici, using a FIASCO protocol. J Integr Agric. 13: 100-104.
    CrossRef
  25. Metz S, Cabrera JM, Rueda E, Giri F, Amavet P. Array. FullSSR: Microsatellite Finder and Primer Designer. Adv Bioinformatics. Article ID 6040124
    Pubmed KoreaMed CrossRef
  26. Metzker ML. 2010. Sequencing technologies - the next generation. Nat Rev Genet. 11: 31-46.
    Pubmed CrossRef
  27. Michael TP, VanBuren R. 2015. Progress challenges and the future of crop genomes. Curr Opin Plant Biol. 24: 71-81.
    Pubmed CrossRef
  28. Mittal N, Dubey AK. 2009. Microsatellite markers-A new practice of DNA based markers in molecular genetics. Pharmacogn Rev. 3: 235-246.
  29. Palmer JD. 1985. Comparative organization of chloroplast genomes. Annu Rev Genet. 19: 325-354.
    Pubmed CrossRef
  30. Park Y-J, Lee JK, Kim N-S. 2009. Simple sequence repeat polymorphisms (SSRPs) for evaluation of molecular diversity and germplasm classification of minor crops. Molecules. 14: 4546-4569.
    Pubmed KoreaMed CrossRef
  31. Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, et al. 1999. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature. 402: 404-407.
    Pubmed CrossRef
  32. Richard GF, Kerrest A, Dujon B. 2008. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev. 72: 686-727.
    Pubmed KoreaMed CrossRef
  33. Rogers SO, Bendich AJ. 1987. Heritability and variability in ribosomal RNA genes of Vicia faba. Genetics. 117: 285-295.
    Pubmed KoreaMed
  34. Rozen S, Skaletsky H. 2000. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 132: 365-386.
  35. Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nat Biotechnol. 26: 1135-1145.
    Pubmed CrossRef
  36. Smith AM, Heisler LE, St Onge RP, Farias-Hesson E, Wallace IM, Bodeau J, et al. 2010. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 38: e142.
    Pubmed KoreaMed CrossRef
  37. Soltis PS, Soltis DE, Chase MW. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature. 402: 402-404.
    Pubmed CrossRef
  38. Varshney RK, Nayak SN, May GD, Jackson SA. 2009. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 27: 522-530.
    Pubmed CrossRef


June 2019, 7 (2)
Full Text(PDF) Free

Cited By Articles
  • CrossRef (0)

Social Network Service
Services
  • Science Central