Genome Characterization of Probiotic Lactobacillus delbrueckii Subsp. bulgaricus Strain NDO2 Isolated from Traditional Yogurt using High Throughput Next Generation Sequencing

Mati Ullah1*, Muhammad Rizwan2, Ali Raza3, Xianlin Zhao4, Yanling Sun4, Sarah Gul5, Muhammad Ihtesham Waheed6, Muhammad Nadeem Khan7, Alvina Gul8, Sami Ullah Jan9* and Chao Huang4* 1Department of Microbiology, Faculty of Veterinary & Animal Sciences, The University of Agriculture, Dera Ismail Khan, Pakistan 2College of Animal Sciences and Technology, Anhui Agricultural University, Hefei, Anhui, China 3Hefei National Laboratory for Physical Sciences at Microscale, School of Life Sciences, University of Science and Technology of China, Hefei, Anhui, China 4Biotechnology Research Center, Shandong Academy of Agricultural Sciences, Jinan, China 5Department of Biological Sciences, Faculty of Basic and Applied Sciences, International Islamic University, Islamabad, Pakistan 6Department of Plant Breeding and Genetics, University of Agriculture, Peshawar 7Department of Microbiology, Faculty of Biological Sciences, Quaid-I-Azam University, Islamabad, Pakistan 8Atta-Ur-Rahman School of Applied Biosciences, National University of Sciences and Technology, H-12 Campus, Islamabad, Pakistan 9Department of Bioinformatics and Biosciences, Faculty of Health & Life Sciences, Capital University of Science and Technology, Islamabad, Pakistan Article Information Received 08 July 2021 Revised 20 January 2022 Accepted 04 February 2022 Available online 07 April 2023 (early access)


INTRODUCTION
Y ogurt, the fermented milk product is a famed dairy item around the globe and is usually consumed in various O n l i n e

F i r s t A r t i c l e
yogurts can improve lactose digestion (Lerebours et al., 1989;Labayen et al., 2001), affect the intestinal transit time (Conway et al., 1987), and improve the immune system of the gut ( Van de Water et al., 1999;Aldinucci et al., 2002). Lactic acid bacteria (LAB) are the main and prominent microbial component of dairy-related fermented foods. The member of this important group of bacteria is used as a starter culture in the fermentation process. These are gram positive bacteria and are present as indigenous microflora in milk and yogurts. This specific group of bacteria contains numerous species from the genus Lactobacillus, Leuconostoc, and Lactococcus (Tserovska et al., 2002;Papadimitriou et al., 2015). LAB serves as a natural source of preservation, associated with the texture, flavor, and nutritional value of the fermented foods (Atrih et al., 2001;Messens and De, 2002;Mohd Adnan and Tan, 2007). Probiotics are live microorganisms which when administered in adequate amounts confer a health benefit to the host (Hill et al., 2014;Papadimitriou et al., 2015). Numerous strains of different probiotic bacterial species are associated with human health. The reduction of cholesterol level, modulation of immunity, and effective treatment of atopic dermatitis, diarrhea, and Crohn's disease are linked with probiotics (Reid, 1999).
Lactobacillus delbrueckii subsp. Bulgaricus is a gram positive, facultatively anaerobic, and industrially important lactic acid bacteria. This species is not native to humans, however, can survive the gastric transit (Mater et al., 2005). It has been used for decades in combination with Streptococcus thermophilus and has been very important industrially ever since. Both species have synergistic association, but thickness and aroma of the yogurt is linked with Lactobacillus delbreuckii's pH lowering and acetaldehyde producing abilities (Gezginc et al., 2015). Particularly, L. delbreuckii subsp bulgaricus NDO2, due to its moderate acidity, high viscosity and excellent water holding capacity (Sun et al., 2011). In combination with other gastrointestinal flora including, but not limited to Lactobacillus casei, Lactobacillus acidophilus, and Lactobacillus rhamnosus, L. delbreruki has been used to improve nutritional values of the yogurt (Heller, 2001;Vaughan et al., 2002). Some strains of L. delreuckii has been reported as Bacteriocin producers, which suggests that strains of this specie have defensive properties and can be used as preservatives and therapeutic agents (Boris et al., 2001;Simova et al., 2008). The L. delbreuckii subspp bulgricus NDO2 has been known resistant to freeze drying temperature and desiccation which further confirms it a best starter candidate for industrial use (Shao et al., 2014). Initial results of L. delbreucki research in recent years claim that this specie has potential to manage many medical complications like fatty liver disease, flue, Antibiotic associated diarrhoea, Inflammatory bowel disease, eczema, hay fever, colic, tooth decay and enterocolitis (Hempel et al., 2012;Ghouri et al., 2014;Ouwehand et al., 2016).
No matter how many benefits a bacterium may have, according to FDA, it cannot be introduced to the food chain for use until its genome is known (Degnan, 2008). Genomic studies of bacteria are vital as it authenticate the experimental work done in laboratories, enlightens the possibilities for future research and discovers safety fears. Bacteria are genetically flexible and have high adoptability with respect to habitual and nutritional situations. Hence, making their characteristic strain specific. Because, Lactic acid bacteria's genome is fragile and is prone to shredding, it is possible that its genome may have time dependent variation. The application of high throughput next-generation sequencing in microbial genome characterization has gained much attention over the last few years. In the current study, we used this emerging sequencing approach to comprehensively analyze the genome of industrially and medically important probiotic L. delreuckii NDO2 bacteria isolated from traditional yogurt sold in the Karak district of Khyberpakhtoonkhwa, Pakistan. The findings of the current study can provide a detailed insight into the genome of L. delbreuckii NDO2 as well as advance the dairy research in terms of starter culture characterization using the next-generation sequencing platform.

Yogurt samples
Samples of cow's milk artisanal yogurt were collected from local vendors in the Karak district of Khyber Pakhtunkhwa, Pakistan. Sterile sampling bottles with and ice pack were used for collection of the samples. Soon after collection the samples were sent to the laboratory.

Isolation and identification of probiotic bacterial species
The yogurt samples were aseptically homogenized, and one milliliter of the sample was diluted in 9 milliliters of phosphate-buffered saline (PBS) followed by serial dilutions. Different dilutions were plated on Man-Rogosa-Sharpe (MRS) pH 5.2 agar plates. Plates were incubated in triplicate under anaerobic conditions at 37, and 45 °C for 24-72 h. The Yuejin anaerobic incubator was used for probiotic bacterial growth (Shanghai Yuejin Medical Instruments Co., Ltd, Shanghai, China). Colonies from plates were sub-cultured till obtaining pure isolated colonies. The pure culture isolates were used for Gram reaction, cell morphology, etc.

DNA extraction
The pure bacterial colony from the agar plate was grown overnight on MRS broth followed by DNA extraction using a Tiangen bacterial DNA isolation kit (Tiangen, Beijing, China). Further, Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA) was used to quantify the extracted DNA.

DNA library preparation and sequencing
DNA Libraries for high-throughput sequencing were made by commercially available Vazyme TruePrep™ DNA Library Prep Kit V2 for Illumina (Vazyme Biotech, Nanjing, China). The sequencing of the sample libraries was achieved on an Illumina HiSeq-2000 sequencing platform (BGI-Shenzhen, Shenzhen, China).

Bioinformatics analysis
Trimming, quality control, and de novo assembly Trimming of the sequencing reads was completed by means of the Trimmomatic version 0.38 (Bolger et al., 2014). The contaminated reads were screened through BWA aligned to the reference sequence. The assembly of the high-quality sequencing reads was made with shovill (version 1.1.0, https://github.com/tseemann/shovill) assembler. The de novo assembly results were evaluated using the Quast software (Gurevich et al., 2013). DNA Plotter (Carver et al., 2009), an interactive Java application was used for generation of circular DNA plots. The draft genome of isolate LB2 was taken as the reference genome, and randomly connected the Contig sequences of the LB2 genome to form a pseudo-single genome sequence. The Prokka program was used for annotations and Align the Y14 and Y16 genomes to the reference genomic group constructed by LB2, using a minimum similarity of 70%, and convert the comparison results into GFF format. The results were imported in the form of image.

Genome annotation
The assembled genome was annotated through the Prokka (version 1.14.5, 10.1093/bioinformatics/btu153) pipeline. The subsystem identification in the genome and functions assigning was carried out through the proteincoding sequences (CDS). The comparison of the assembled genome sequences was assessed in the SEED Subsystem (Overbeek et al., 2014). The evaluation of orthologous genes was performed via clusters of orthologous genes (COGs) (Galperin et al., 2015) and eggnog (Huerta-Cepas et al., 2016). To examine the high level functions of the three isolates, the Kyoto encyclopedia of genes and genomes (KEGG) database (https://www.genome.jp/ kegg/) was retrieved.

Genome analysis
The pan-genome analysis of the three genomes was done by roary (version 3.13.0, 10.1093/molbev/msz284) pipeline. Briefly, the protein sequences from gff 3 files were extracted to a reduced set of protein sequences followed by an all against all comparison with BLASTP. The sequence identity was set 98% with e-value 1E-6. Based on BLASTP similarity score, sequences were clustered with a Markov cluster (MCL) algorithm. Further, a concatenated core gene alignment sequence was produced using PRANK (10.1007/978-1-62703-646-7_10).
The raw data used in the current study have been deposited in NCBI and allocated the accession number of PRJNA664263.

Sequencing data and assembled genome properties
The summary and statistics of the sequence data and assembly results is shown in Table I. The number of clean reads used in the analysis was in the range of 2303582 to 6604528, with a mean value of 4848434.29. The GC contents were in the range between 38.79 to 49.09%, with 45.3% is the mean value. The coverage and GC contents of each identified species/strains in the corresponding sample are shown in Figure 1 as well as in Figure 2. The assembled contigs from the genome sequencing data of the three bacterial genomes vary significantly and were in the range of 116 (lowest) for sample NY4 and 1605 (highest) for sample LB2. The other important genome features characterized in the current analysis are summarized in Table I.

Comparative genome analysis
The annotated genomes of the three isolates of the L. delbrueckii subsp. bulgaricus strain NDO2 were mutually compared. The pan-genome analysis of the three isolates of L. delbrueckii subsp. bulgaricus NDO2 identified a total of 2284 genes among which 970 represent core genes and 1314 represent shell genes respectively. The three isolates of the species L. delbrueckii subsp. bulgaricus NDO2 share a similar core genome, however, regarding the accessory genes pattern, the isolates of Y14 and Y18 showed more similarity compare to LB2. The pattern of the number of conserved genes shows a continuous decline compare to the number of total genes as the number of genomes embedded in the analysis. Similarly, the number of both unique and new genes declined as the number of genomes were embedded into the analysis. Graphical summarized illustration of the comparative gene analysis is presented in Figure 3. Detail of the pangenome analysis is given in Supplementary Table SI.

Functional genome annotation Subsystem category distribution
The subsystem category distribution of the dominant systems identified in the three annotated genomes using the SEED database. Among these subsystems, carbohydrates was the largest subsystem, followed by O n l i n e

F i r s t A r t i c l e
Genome Characterization of Probiotic Lactobacillus delbrueckii 5 protein metabolism. The other dominant systems were DNA metabolism and nucleoside and nucleotide (Fig. 4). Besides this, numerous other subsystems with different numbers were also identified. Their detail is provided in Supplementary Tables SII, SIII, SIV).

Genome annotation with eggNOG database
The eggnog database, hosted by the European molecular biology laboratory, is a database of orthologous groups and function annotation of the genome. The functional study analysis of the three annotated genomes revealed that the COGs associated with carbohydrate transport and metabolism were dominant followed by those associated with Replication, recombination, and repair (Fig. 5). The other dominant categories with most COGs associated were amino acid transport and metabolism and translation ribosomal structure and biogenesis, respectively. For viewing the results in detail Supplementary data can be accessed (Supplementary  Tables SV, SVI, SVII).

KEGG analysis
The information provided by the KEGG database is related to the high level functions for high throughput sequencing data regarding genetic information processing, metabolism, cellular processes, human diseases, environmental information processing, drug development, and organismal systems. Using the KEGG database pathways as a reference, the genome content in the current analysis were characterized to identify the biological pathways. Most gene contents were involved in metabolic pathways, biosyn thesis of secondary metabolites, microbial metabolism in diverse environments, and biosynthesis of antibiotics respectively (Fig. 6). More detail of the results is given in Supplementary Tables SVIII, SIX, SX.

DISCUSSION
Bacterial strains used as starter cultures for fermentation of yogurt are of inordinate curiosity because they have a direct effect on nutritional, therapeutic and commercial values of the final product. In dairy industry for fermentation of milk, commercially available strains of two bacteria, S. thermophilus and L. bulgaricus are most used. These strains can produce yogurt with necessary properties in a gainful way, but some wild strains to be discovered may be superior to them and upsurge the qualities of the product, reduce the expenses for yogurt manufacture, or yield new types of yogurt with different properties. To translate microbial potentials, complete genome sequencing of industrially important microorganisms holds vital significance. In order to understand and easily utilize important bacteria, whole genome sequencing of multiple species and re-sequencing of existing genomes are extremely needed.
In this study, we aimed to isolate L. debreukii subsp bulgaricus NDO2 strains from traditional yogurt and performed their whole genome sequencing for functional understanding and comparison. This study is also important because no other genome of this strain has yet submitted to the NCBI genome data base (Sun et al., 2011).
The number of clean reads used in the analysis was in the range of 2303582 to 6604528, with a mean value of 4848434.29. while in case of the initial reported strain NDO 2 total of reads were reported to be 8,778,388 that constituted complete genome of ND02 having 2,125,753 base pairs (Sun et al., 2011). The GC contents we analyzed for the genomes were in the range between 38.79 to 49.09%, with 45.3% the mean value. The published literature supports ours results as lactic acid bacteria have GC contents ranging from 32 to 52%/ (Mendes-Soares et al., 2014). While GC contents of the NDO2 which is used as a reference strain are 49.56% (Sun et al., 2011). The difference in percentage of GC content is directly linked with genomic size and structure, and ecological niche to which a microorganism belongs, thus it may vary even for strains (Šmarda et al., 2014). The assembled draft genome contains a mean of 1620 coding sequences that encodes 1749 genes. The submitted genome of early reported strain contains 2151 coding sequences with 2177 genes (Sun et al., 2011). The variation in coding sequences and gene number is also strain dependent trait.
The assembled contigs from the genome sequencing data of the three bacterial genomes vary significantly and were in the range of 116 (lowest) for sample NY4 and 1605 (highest) for sample LB2. The effect of contig number on the functional annotation has not been proven by the literature or there is any standard for the number of contigs. The variation in contig number is associated with the technology being used for sequencing and methodology used for assembly (Smits, 2019).
The pan-genome analysis of the three isolates of L. delbrueckii subsp. bulgaricus identified a total of 2284 genes among which 970 represent core genes and 1314 represent shell genes respectively (need to know functions of core genes and shell genes). The three isolates of the species L. delbrueckii subsp. bulgaricus share a similar core genome, however, regarding the accessory genes pattern, the isolates of Y14 and Y18 showed more similarity compare to LB2. The pattern of the number of conserved genes shows a continuous decline compare to the number of total genes as the number of genomes embedded in the analysis. Similarly, the number of both unique and new genes declined as the number of genomes were embedded into the analysis. Differences in the accessory genes is associated with strain specific characteristics (Pintara et al., 2020).
The subsystem category distribution of the dominant systems identified in the three annotated genomes using the SEED database. Among these subsystems, carbohydrates was the largest subsystem, followed by protein metabolism. The other dominant systems were DNA metabolism and nucleoside and nucleotide. Milk contain high amount of carbohydrates (i.e 5g/100ml) followed by proteins (3.5g/100ml) and fats (1.5/100ml). That is carbohydrate metabolism genes are most common in lactic acid bacteria followed by proteins and fats metabolisms (Buron-Moles et al., 2019). The eggnog database, hosted by the European molecular biology laboratory, is a database of orthologous groups and function annotation of the genome. The functional study analysis of the three annotated genomes revealed that the COGs associated with carbohydrate transport and metabolism were dominant followed by those associated with replication, recombination, and repair. The other dominant categories with most COGs associated were amino acid transport and metabolism and translation ribosomal structure and biogenesis, respectively the eggnog database, hosted by the European molecular biology laboratory, is a database of orthologous groups and function annotation of the genome. The functional study analysis of the three annotated genomes revealed that the COGs associated with carbohydrate transport and metabolism were dominant followed by those associated with Replication, recombination, and repair. The other dominant categories with most COGs associated were amino acid transport and metabolism and translation ribosomal structure and biogenesis, respectively.

Funding
This research received no external funding.

O n l i n e F i r s t A r t i c l e
Genome Characterization of Probiotic Lactobacillus delbrueckii 7 Data availability statement All the data related to this manuscript are included in tables, figures, and supplemental files

Supplementary material
There is supplementary material associated with this article. Access the material online at: https://dx.doi. org/10.17582/journal.pjz/20210708190732

Statement of conflicts of interest
The authors have declared no conflict of interest.