
Although milk thistle can be distinguished from
A total of six accessions of milk thistle have been used in this study. Four of them were from National Agrobiodiversity Center, National Institute of Agricultural Sciences of Rural Development of Administration (K001033 from Canada, K044886 from Germany, K153821 from North Korea, K227004 from Moldova) and the other two were bought from a local market (unknown genetic sources, ‘912036’ and ‘912171’ from EL&I, Co., Ltd.) in Gyeonggi-do, Korea. The collected seeds have been grown and observed in pots to develop homogeneous plants. The selfed seeds of the six milk thistles were separately sown in May. DNA from a single plant of each accession was extracted by the Cetyltrimethylammonium Bromide (CTAB) method (Murray and Thomson 1980). Each DNA was quantified by NanoDrop 2000 (Thermo Fisher Scientific, USA) and only the high-quality DNA samples were used for genome sequencing.
Illumina paired-end (PE) library with a 400 bp insert size was constructed according to the manufacture’s recommendation, and the library was sequenced on Illumina Novaseq with 2 × 150 bp. The low quality sequences (Phred score ≤ 20) and Illumina adapter sequences were removed in raw fastq files using Trimmomatic v.0.39 (http://www.usadellab.org/cms/?page=trimmomatic) and the chloroplast sequences were collected by mapping the trimmed fastq files to the chloroplast sequence of milk thistle (Genbank acc# KT267161) using BWA (v0.7.17).
Whole genome sequences from the other five additional accessions were aligned to the ‘912036’ chloroplast sequence using BWA-mem (v.0.7.17-r1188) and variants were called using a genome analysis toolkit (GATK v.3.8). Variants were filtered using vcftools (v.0.1.15) with the following conditions: minimum read coverage < 5; genotype quality < 20; genotype missing > 20%.
Based on our preliminary chemical analysis and agronomic traits of the six plants which were used for sequencing, ‘912036’ was selected for the chloroplast genome construction. ‘912036’ produced the highest level of silybin B (around 3.50 mg/g) from the dried seeds and showed the most typical shape of flower sets with vigorous thorns.
After trimming, 127.7 million reads covering 18.9 Gb were retained from a total of 149 million raw reads (about 22.5 Gb). About 7.9% of total reads (∼10 million reads) were identified as chloroplast reads in chloroplast mapping and they were used for assembly (Table 1). Chloroplast genome sequence was assembled
Table 1 . Pre-processing statistics of the sequencing products of the chloroplast of a
Reads | Length | Q30 (%) | Q20 (%) | GC (%) | ||
---|---|---|---|---|---|---|
Raw Data | 149,012,860 | 22,500,941,860 | - | 88.61 | 95.32 | 36.23 |
Trimmed Data | 127,679,160 | 18,932,927,244 | 84.14% | 92.36 | 97.83 | 35.66 |
CP Data | 10,036,686 | 1,495,223,450 | 7.90% | 92.56 | 97.92 | 37.72 |
The complete chloroplast genome of
Table 2 . The complete chloroplast genome structure of a
Structure | Length | GC (%) | Start | End |
---|---|---|---|---|
LSC | 83,535 | 35.81 | 1 | 83535 |
IR | 25,195 | 43.1 | 83536 | 108730 |
SSC | 18,631 | 31.45 | 108731 | 127361 |
IR | 25,195 | 43.1 | 152556 | 127362 |
Total | 152,556 | 37.69 |
A total of 87 protein coding genes with 104 exons were annotated (Fig. 1 and Table 3). The average size of the protein coding sequences is 854 bp, whose G+C content is 38.51%. Besides, 37 tRNAs and eight rRNAs were annotated in the chloroplast DNA. Most photosynthesis related genes were located within the LSC region.
Table 3 . Annotation result of
Annotation Info | ||
---|---|---|
Genome Size (bp) | 152,556 | |
G+C content (%) | 37.69 | |
Protein No | 87 | |
exons | 104 | |
Protein Coding (%) (excluding introns) | 48.7 | |
Average Size (bp) | 854 | |
Average exon Size (bp) | 715.1 | |
G+C content (%) | 38.51 | |
tRNAs | 37 | |
G+C content (%) | 52.78 | |
rRNA | 8 | |
G+C content (%) | 55.21 |
The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei 1987). The optimal tree with the sum of branch length = 0.07854853 is shown (Fig. 2). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) are shown next to the branches (Felsenstein 1985). The evolutionary distances were computed using the Maximum Composite Likelihood method (Tamura
The chloroplast genome assembled in this study was very close to
The NGS sequences of the other five milk thistle accessions were mapped against the reference of ‘912036’, but we could not find sequence polymorphism among them although they were from different European and Asian countries. Therefore, the chloroplast genome from this study can be used to develop
Recently, the importance of the identification of the useful herbal and medicinal plants is globally increasing. Using the genome sequence information, the uniformed and certified seed production and the proper identification can be achieved. At this point, utilizing chromosomal DNA for species identification will be useful.
This work was supported by the Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01418503) of the Rural Development Administration, Republic of Korea.
![]() |
![]() |