Informative Regions In Viral Genomes

Viruses, far from being just parasites affecting hosts’ fitness are major players in any microbial ecosystem. In spite of their broad abundance, viruses, in particular bacteriophages remain largely unknown since only about 20% of sequences obtained from viral community DNA surveys could be annotated by comparison to public databases. In order to shed some light into this genetic dark matter we expanded the search of orthologous groups as potential markers to viral taxonomy from bacteriophages and included eukaryotic viruses, establishing a set of 31,150 ViPhOGs. To this, we examine the non-redundant viral diversity stored in public databases, predict proteins in genomes lacking such information and all annotated and predicted proteins were used to identify potential protein domains. Clustering of domains and unannotated regions into orthologous groups was done using cogSoft. Finally, we employed a supervised classification method by using a Random Forest implementation to classify genomes into their taxonomy and found that the presence/absence of ViPhOGs is significantly associated with their taxonomy. Furthermore, we established a set of 1,457 ViPhOGs that given their importance for the classification could be considered as markers/signatures for the different taxonomic groups defined by the ICTV at the Order, Family, and Genus levels. Both, the ViPhOGs and informative ViPhOGs sets are available at the Open Science Framwork (OSF) under the ViPhOGs project (https://osf.io/2zd9r/).


Introduction
Viruses are entities widely spread all around the biosphere. It is estimated that viral particles are ten times more abundant than other types of microorganisms and although their inclusion in a new life domain remains controversial, it is clear that they are not merely parasites (Ignacio-Espinoza et al. 2013;Martínez Martínez et al. 2014;Koonin et al. 2015). Viruses actively participate in ecosystem remodelling, population dynamics and a wide variety of ecological, biogeochemical, genetic and physiological processes (Kristensen et al. 2010).
Despite their importance and abundance, the viral diversity has not been well characterized. Difficulties in the isolation of pure cultures and the description of viral cycles are common limitations in virus research, since less than 1% of environmental microorganisms can be grown in the laboratory (Hugenholtz et al. 1998). Next generation sequencing techniques enabled us to partially overcome these difficulties, revolutionizing this field of virology. Deep sequencing surveys of viral communities (viromes) have revealed a diversity beyond all expectations, and they have evidenced the lack of knowledge we currently have on global viral diver-sity. Only about 25% of sequences in viromes from marine environments have a match (e-value <= 0.001) to a known sequence (Breitbart and Rohwer 2005;Hurwitz et al. 2016;López-Bueno et al. 2009). The large diversity and the lack of universal molecular markers make it difficult to organize and characterize known and new viral genomes.
Currently, computational tools such as VIROME (Wommack et al. 2012), MetaVir2 (Roux et al. 2014), and MG-RAST (Keegan et al. 2016) use available molecular databases to analyze genomes and metagenomes. Nevertheless, they are limited by considering only the fraction of the data generated that has significant similarity to previously annotated data. New approaches based not only on annotated sequence comparisons, but also on all the available information promise to be useful for analyzing individual viral genomes and viromes. For instance, Skewe-Cox et al., used protein sequence clustering to generate viral profile Hidden Markov Models ("vFams") that were subsequently used for classifying highly divergent sequences (Skewes-Cox et al. 2014). Furthermore, the identification of highly conserved genes in specific taxonomic groups has been another approach for the taxonomic classification of viral sequences. For instance, diversity analyses of cyanophages, algae viruses, T4 and T7 phages have been conducted following this method, and they enabled the characterization of the viral diversity in the studied environments (Zhong et al. 2002;Short and Suttle 2005;Fujihara et al. 2010). However, the mentioned studies were restricted to specific families and ecosystems.
Using another approach, Kristensen et al. constructed a collection of phage orthologous groups ("POGs") from bacterial and archeal viruses (Kristensen et al. 2011;Kristensen et al. 2013). Orthologous gene sets are widely used as a powerful technique in comparative genomics and for viruses it has been suggested that marker genes could be obtained using this technique (Kristensen et al. 2011;Kristensen et al. 2013). Based on the same concept, eggNOG has implemented in its latest version a database of orthologous groups focused on viruses (Viral OGs) (Powell et al. 2014). However, they do not include the whole breadth of viral diversity represented in the public databases, since genomes of eukaryotic viruses were not included by either of the mentioned studies. Given the large amount of currently available genetic information, it is imperative to develop and implement new tools to reliably and efficiently analyze these data to better describe viral diversity.
Computational techniques have been used previously for similar issues in biology. Machine learning methods, for D R A F T example, are algorithms which learn through the experience, attempting to classify information according to shared features. These techniques allow us to extract patterns, trends and finally analyze the information using a non-deterministic way. Supervised learning algorithms such as Random Forest, Support Vector Machine and Neural Networks have been successfully introduced to solve complex biological problems, such as image analysis, microarray expression analysis, QTLs analysis, detection of transcription start sites, epitopes detection, protein identification and function, among others (Libbrecht and Noble 2015;Vervier et al. 2016).
In this work we used the methodology proposed by Kristensen et al., for the identification of gene and domain orthologous groups from related viral sequences. We expanded the reach of this approach by incorporating genomes of eukaryotic viruses and applying Random Forest as a machine learning strategy to identify taxonomically informative orthologous groups. We denominated the set of orthologous groups as ViPhOGs (Virus and Phages Orthologous Groups).

Results
The viral diversity represented in public (NCBI) databases. We searched for all viral genomes stored in either RefSeq or Genbank databases (April 2015) and obtained 50,728 entries by using the selected set of queries (see methods). Depuration of the search results led to the exclusion of 6,617 entries, as some of the keywords in their descriptions indicated that they did not correspond to complete genomes. Entries kept following the depuration step were clustered based on sequence identity in order to collapse near identical viral sequences, which resulted in an overall 57% reduction that reflects a very high redundancy in the searched databases. Bacteriophage sequences decreased from 3,573 to 2,071 (57.9%), while sequences of eukaryotic viruses went from 40,538 to 13,011 (32.1%). Finally, a second de-replication at the protein level was conducted following the prediction of genes for those genome accessions without protein annotations (see methods).This process led to a final reduced set of 14,057 entries, comprising 1,974 bacteriophages and 12,083 viruses. Those accessions are considered as the non-redundant viral diversity stored in NCBI public databases (Table S1).
According to the type of genetic material stated in the description of the accessions, these were categorized into: double stranded DNA (dsDNA), single stranded DNA (ss-DNA), double stranded RNA (dsRNA), single stranded RNA -despite the sense-(ssRNA) and retro-transcribing viruses (rtviruses). A total of 1,122 accessions didn't have a complete taxonomic annotation at the moment of the study; those accessions were found to be either unclassified phages (66), unclassified viruses (80), satellites (203) or assemblies from marine metagenomes (773). The longest genome belonged to the dsDNA virus Pandoravirus salinus (NC_022098) with a genome length of 2,473,870 bp, whereas the smallest genome belongs to the ssRNA Lucerne transient streak satellite virus with 324 bp. In general, DNA viruses are larger than RNA viruses (Mannwhitney test: p-value=1.12e-135). A compar-ison of the genome length distribution of phages and eukaryotic viruses shows that dsDNA and ssDNA phages tend to have larger genomes than dsDNA and ssDNA eukaryotic viruses (Mannwhitney test when comparing dsDNA viruses: p-value=7.19e-06; Mannwhitney test when comparing ss-DNA viruses: p-value=3.08e-64). In the case of ssRNA viruses, eukaryotic viruses tend to have larger genomes than phages (Mannwhitney test when comparing ssRNA viruses: p-value=7.89e-16) (Figure 1).
The set of 14,057 non-redundant genomes code for a total of 442,007 proteins, where we observe that the number of genes is directly proportional to the length of the genome. Interestingly, a linear regression suggests a gene density of 12 proteins per kilobase in the case of phages, while in the case of eukaryotic viruses the gene density is only about 2.5 proteins per kilobase; indicating a lower gene density for eukaryotic viruses in comparison with phages ( Figure 2).

Viruses and Phages Orthologous Groups (ViPhOGs).
We searched for domains in all identified and predicted proteins using InterProScan (Jones et al. 2014). Domains were found only in 52.59% of the proteins (232,033 proteins), which means that even for the sequences stored in public databases half of the information belongs to the viral dark matter. In an attempt to gain further information, unannotated proteins were used as a query against vFams, but only 39,344 (8.9%) proteins had a significant match to entries in this database.
Given the proportion of unannotated sequences, protein regions without domain or vFam annotation were also considered for further analysis, meaning that the final set of viral regions consisted of: 365,368 annotated regions (309,251 InterPro domains + 56,137 vFam matches), 157,591 unannotated regions of at least 40 residues (69,333 regions in between annotated domains + 88,258 regions in between vFam matches) and 170,637 unannotated proteins (proteins with no hit to vFam or InterProScan). Viruses and Phages Orthologous Groups (ViPhOGs) were built from the symmetric best matches between viral regions using the software COGsoft

Fig. 2. Gene density.
Scatter plot showing the gene density of (A) eukaryotic viruses and (B) phages. Viruses are colored by the type of genetic material and each dot represents a genome. Best-fit lines with 95% confidence intervals from linear regression are plotted. (Kristensen et al. 2010) (see methods). A total of 31,150 orthologous groups with at least three members were obtained. Interestingly, most of the ViPhOGs were built from a single type of viral regions: unannotated proteins (9,953), unannotated regions (8,103) or annotated regions (8,023). Among all possible combinations of viral region types, the highest number of clusters was obtained for the combination of unannotated proteins and unannotated regions (2,309) ( Figure 3). This suggests that although the large majority of regions and proteins in viral genomes are uncharacterized, they are conserved among the different viruses used.
The median amount of regions clustered in a ViPhOG was 5 (IQR:3,11) with the largest ViPhOG having 3,440 regions from 1,180 different genomes of both phages and eukaryotic viruses. This large ViPhOG contained regions mainly annotated as Helicases. However, it was not the only ViPhOG that comprised a rather large number of regions, as a total of 1,081 ViPhOGs contained more than 100 regions (Table S2). In terms of the host type, we found 14,746 ViPhOGs represented exclusively by eukaryotic viral genomes, 10,100 ViPhOGs represented only by phage genomes and the remaining 6,304 ViPhOGs were represented by both phages and eukaryotic viral genomes. As a ViPhOG may include paralogs, any given genome can contribute with several regions to a single ViPhOG. However, the number of regions per genome for each ViPhOG was on average 1.008 (max: 9.162), which indicates that the vast majority of ViPhOGs are composed of orthologs instead of paralogs. This is also evidenced in Table S2, where most of the ViPhOGs have the same number of genomes as regions, indicating that each genome contributed only one region to each orthologous group.
A Random Forest classifier correctly classifies viral genomes according to the presence of ViPhOGs. We used the Scikit-learn Random Forest implementation to test if ViPhOGs can be used as features to predict taxonomy (see methods). From the model testing, it was determined that in general a model with 60 estimators had a reasonably good balance between classification-score and computation time, as models with 60 estimators result in a high classification score and less variance ( Figure 4). Therefore, a Random Forest classifier with 60 estimators was run separately for each of the evaluated taxonomic levels. For each taxonomic level, all genomes classified into a taxon at the analysed level were used. For the Order level the algorithm received a matrix of 1,031 ViPhOGs and 4,698 genomes to classify into 7 Orders. In the case of Family the matrix contained 11,328 ViPhOGs and 11,978 genomes from 84 different Families, and for the Genus level, the size of the matrix was 20,310 ViPhOGs and 10,151 genomes from 335 different genera. For each case, matrices were split in 100 train/test (70:30 distribution) sets. The mean accuracy score achieved was 99.06%, 95.60%, and 89.58% for Order, Family and Genus, respectively (Figures 5, 6 and 7).
As the classification score suggests, the chosen algorithm excelled at accurately classifying the genomes into their respective taxonomic groups; where most of the misclassification cases were a small proportion of the genomes represented by each of the assessed taxons ( Figure 8). The 10 most common classification mistakes per taxonomic level are shown in the Table S3. Although we observed that the classification error does not perfectly correlate with the total number of available genomes for the classification, it is evident that the lower the number of genomes available for a given taxon, the higher the classification error ( Figure 9).
Informormative ViPhOGs: signatures of the taxonomy of viral genomes. As the good performance of the models suggests, there is a set of ViPhOGs that allows to identify with high accuracy the taxonomic group of each genome. Therefore, we aimed at minimizing the number of ViPhOGs capable of reaching high accuracy of classification for each given taxonomic level (see methods). That approach allowed us to determine that a reduced set of 20 ViPhOGs was enough to achieve a high accuracy score for the Order level. For the Family and Genus levels, the number of ViPhOGs needed was 388 and 1,392, respectively ( Figure 10). We designated those ViPhOGs as "Informative ViPhOGs". Their taxonomic assignment and their functional annotation (if available) is presented in the Table S4.
We used the set of Informative ViPhOGs to build an UPGMA tree that delineates the viral diversity clustering, based on the presence/absence of these genomic characteristics (see methods). Genomes that belong to a defined Or-  der constitute a defined clade in the tree. Furthermore, in all cases except for Caudovirales, there was a consistent branching of the Orders containing the corresponding Families and Genera designated by the International Committee on Taxonomy of Viruses (ICTV) (Figure 11). A closer look at the tree revealed some interesting viral features.

D R A F T
(i) Bacteriophages. The Family Tectiviridae (dsDNA viruses) shares a clade with the genus Rosemblanvirus and Salasvirus, both members of the Podoviridae Family (also dsDNA viruses from the order Caudovirales), while the other bacteriophage families Inoviridae, Microviridae (both ssDNA), and Leviviridae (ssRNA) appear as independent clades without shared characteristics among them or members of the Order Caudovirales. Regarding archaeal viruses, the family Fuselloviridae and the Order Ligamenvirales, which includes the families Rudiviridae and Lipothrixviridae, form a single clade. Furthermore, most of the families of archaeal viruses had very few representatives, revealing a bias in the explored diversity.
(ii) Characteristics shared between Eukaryotic and Bacterial viruses. The Order Herpesvirales appears as a sister clade of a subset of the Caudovirales, in particular, members of the Myoviridae Family. Moreover, the Nucleo-Cytoplastmatic Large DNA Viruses (NCLDV) group is enclosed by the Caudovirales clade. We looked for ViPhOGs present in members of all NCLDV families and found ViPhOGs number 937 and 1598, which are associated with helicase domains and, as mentioned before, prevalent among dsDNA viruses in general; ViPhOG 821, which codes for a ribonucleotide reductase and is shared mainly among members of the families Myoviridae, Herpesviridae and Poxviridae; ViPhOG 865, a serine/threonine kinase domain, and ViPhOG 72, an EF binding domain, both also very common in members of Herpesviridae.

D R A F T
(iii) RNA viruses. Those from the family Chrysoviridae (dsRNA) grouped with Totiviridae (dsRNA), in particular with the Genus Totivirus that also infects fungi. Interestingly, there was a clade formed by different positive sense ssRNA viruses, which had no other common taxonomic assignment. This clade included the Families Tombusviridae, Nodaviridae, Bromoviridae, Virgaviridae, Togaviridae, Hepeviridae, Closteroviridae and the Families of the Order Tymovirales.

Discussion
In recent years, as metagenomics has revealed a great diversity of phages from different biomes, and evidenced the huge unexplored diversity that the viral world holds. The use of genetic signatures to characterize the diversity of specific viruses and phages have been successfully applied to describe specific groups of viruses in diverse contexts (Dwivedi et al. 2013;Sakowski et al. 2014 (Grazziotin et al. 2017). Here, we followed the methodology by Kristensen et al., and took it a step further by extending the search of orthologous groups beyond bacterial and archeal viruses to also include eukaryotic viruses, which consolidated a final set of 14,057 non-redundant genomes.
We took a non-waste-information approach in order to get orthologous groups among (i) annotated domains, (ii) unannotated regions of annotated proteins and (iii) unanno-tated proteins, all derived from the non-redundant diversity of viruses stored in public databases. This strategy proved to be useful given that a large majority of the ViPhOGs were constituted solely or in combination of unannotated regions or unannotated proteins. Finally, we established a comprehensive set of 31,150 orthologous groups that we denominated ViPhOGs.
As the ICTV provides a single classification scheme that reflects the evolutionary relationship among viruses, we evaluated the possibility that the presence/absence of ViPhOGs in viral genomes reflected the ICTV taxonomy using a machine learning approach. The low misclassification scores reached by the Random Forest algorithm suggested that the use of ViPhOGs as features for performing taxonomic classification of viruses had great potential. Therefore, we determined the subset of informative ViPhOGs that could be considered as markers/signatures for the different taxonomic groups defined by the ICTV at the Order, Family, and Genus levels.
We found a high degree of agreement between clustering identified using informative ViPhOGs and the monophyletic Orders described by ICTV. Example of those were the Nidovirales (Gorbalenya et al. 2006), Ligamenvirales (Prangishvili and Krupovic 2012), Mononegavirales (Afonso et al. 2016), and Tymovirales (Martelli et al. 2007) whose branches shows a clear separation in accordance to the proposed Families and Genera.
Importantly, the current approach has been consistent with recent changes in the ICTV taxonomy. For example the family Pneumoviridae, whose members were considered as a subfamily of the family Paramyxoviridae up till 2016 (Afonso et al. 2016;Rima et al. 2017). Our classification mechanism used the ICTV classification from 2014, but was capable of showing the separation of the family Paramyxoviridae. The tree clearly showed how viruses from the Genera Metapneumovirus and Pneumovirus (now known as Metapneumovirus and Orthopneumovirus, respectively) form a separate clade (now Family Pneumoviridae), whose sister clade is the Paramyxoviridae Family.
Despite the absence of a link between several ss-RNA(+) families in the ICTV taxonomy of 2014, in the ViPhOG-based tree built here families Tombusviridae, Nodaviridae, Bromoviridae, Virgaviridae, Closteroviridae and Hepeviridae were grouped together with the families of the Order Tymovirales. In the most recent ICTV taxonomy the families Bromoviridae, Virgaviridae, Togaviridae and Closteroviridae were assigned to the Order Martellivirales; and the family Hepesviridae to the Order Hepelivirales. These two new Orders (Hepelivirales and Martellivirales), together with the Tymovirales, belong now to the Class Alsuviricetes and the Phylum Kitrinoviricota. Furthermore, in our tree the families Tombusviridae and Nodaviridae are a sister clade of what seems now to be the Alsuviricetes clade. Tombusviridae and Nodaviridae belong now to the Orders Tolivirales (Class:Tolucaviricetes) and Nodamuvirales (class:Magsaviricetes), respectively. All classes, together with Alsuviricetes, constitute now the Kingdom Kitrinoviri- cota. This suggests that a tree generated with conserved amino acid features could identify basal evolutionary relationships among viruses matching the new scope of the ICTV (International Committee on Taxonomy of Viruses. 2020).

D R A F T
Misclassification cases were very limited and more common in the lowest taxonomic level (Genus) than in the highest taxonomic level (Order). Although we did not observe a perfect negative correlation between the number of genomes available and the number of misclassification cases, we did observe that Genera like Mupapillomavirus, Yetapoxvirus, Kappapapillomavirus and Families such as Alphatetraviridae and Amalgaviridae with two or three genomes available per taxa were frequently misclassified. Another kind of misclassification event occurred between related taxa, such was the case for viruses from the Genus Vesiculovirus that were confounded with members of the Genus Sprivivirus. Both Genera belong to the Family Rhabdoviridae. Interestingly, this pair of genera share more ViPhOGs between each other than against any other member of the Family Rhabdoviridae. This observation could be the basis of a more in-depth study which could potentially lead to the suggestion of both genera being part of a new sub-family which separates them from the rest of the family. Lastly, regarding misclassification events, we want to acknowledge that there is still a place for improvement of the classification models. We identified misclassification cases where a taxon was misclassified and the confusion does not appear to be directed by genomic relatedness. As an example we chose to discuss the case of the Caulimoviridae Family. This Family had 123 representative genomes in our database

D R A F T
and in 10% of the cases it was misclassified as Myoviridae. Only a few representative genomes of each Family have (at most) 3 ViPhOGs in common (ViPhOGs number 731, 1158, and 269). Those 3 ViPhOGs are not informative ViPhOGs for Myoviridae, and appear to be present in several different viral families and clades, therefore there is no clear answer to why the classifier confused these two unrelated families.

Fig. 7. Random Forest accurately classifies genomes into their respective
Genera. Heatmap representing the confusion matrix obtained after classifying viral genomes at the Genus level. The color code indicates the proportion of genomes of the genus in the x-axis classified as a genome of the genus in the y-axis. Genome names were removed for ease of interpretation.
One of the major strengths of the presented work is that, in addition to genomes of prokaryotic and archeal viruses, we included genomes of eukaryotic viruses. As expected, not a single ViPhOG was present in all viral genomes. Viruses do not encode for ribosomes or any other universal markers that allow the study of their phylogenetic relationships. Furthermore, it has been accepted that viruses have not evolved from a single common ancestor (Holmes 2011;Koonin et al. 2015;Iranzo et al. 2016;Koonin and Yutin 2018), which might be reflected in the high number of polytomies observed in the informative ViPhOGs tree. Besides the absence of an universal ViPhOG, a not negligible number of ViPhOGs were formed by regions from phages and eukaryotic viruses. Further analyses would be needed to determine if the fact that a ViPhOG is shared between eukaryotic and prokaryotic/archeal viruses is due to functional convergence, or if it is because those viruses presumably have an evolutionary relationship as is the case for Herpesviridae and Siphoviridae (Baker et al. 2005;Rixon and Schmid 2014;Andrade-Martínez et al. 2019) or as ssRNA(+) viruses, which presumably co-evolved with their hosts before they split into eukaryotes (Koonin et al. 2015;Wolf et al. 2018).
The fact that a machine learning approach, based solely on genomic features reached a high score when classifying viruses in their assigned taxa, highlights how the viral taxonomy based on ecological (e.g. pathogenicity and host range) and molecular (e.g. composition of the virus genome and sequence similarity) features is a robust system able to depict the evolutionary relationships among viruses. The informative ViPhOGs dataset is, therefore, nothing but a reflection of the efforts done to establish a taxonomic system for viruses and the strength of machine learning algorithms that were able to depict patterns among a comprehensive dataset. We consider that the result, the ViPhOGs and the informative ViPhOGs datasets, may be used as a start point to hypothesize about the genetic relationships among known viral groups and as an useful tool to attempt to characterize and define the viral dark matter that is being exposed via metagenomics. We released the ViPhOGs dataset hoping that: (i) the community can use it as a tool to explore the genetic relationships among viral clades encouraging viral research, (ii) to facilitate the exploration of specific viral groups by the use of its ViPhOGs, and (iii) to obtain viral profiles in specific biomes. We want to encourage the community to exploit the benefits of the use of this comprehensive set of orthologous groups in a world of fast evolving entities that quickly lose their protein sequence conservation. previously used by Kristensen et al. (Kristensen et al. 2013 To download complete genomes stored in the Genbank database but not in the RefSeq database, the negation of the proposition srcdb_refseq[Properties] was used in the queries.

Methods
Following the retrieval of viral and phage genome sequences, a series of filtering steps was applied to the query results to remove incomplete genome sequences and to reduce the redundancy in the dataset. First, keyword depuration: all entries with the words "segment", "ORF", "Gene", "mutant", "protein", "complete sequence", "Region", "CDS", "UTR", "recombinant" or "terminal repeat" were removed. Second, genomic dereplication: genomes were clustered to a 95% se- quence identity over the full length of the shorter sequence using CD-HIT-EST (Li and Godzik 2006). Only the representative sequence of each cluster, the longest one according to CD-HIT documentation, was used for further analysis. Finally, dereplication at the protein level: as in Kristensen et al. (Kristensen et al. 2011), to remove redundancy of sequences that are not in synteny but are essentially the same genome, genomes were clustered according to their protein content using a complete linkage approach. To identify shared proteins among genomes, all proteins from all genomes were grouped at a global sequence identity of 99% using CD-HIT. Genomes coding for 20 proteins or less must share all the proteins to be D R A F T Fig. 11. UPGMA tree representation of the non-redundant viral diversity. Unrooted (center) and circular middle point rooted (outer circle) representations of an UPGMA tree of the non-redundant viral diversity, based on the presence/absence of informative ViPhOGs. Colored branches highlight ICTV designated Orders. Bold black branches highlight phage families without an order assignment. Names between the trees indicate the name of the ICTV taxonomic order colored in the same color for the branches. Tip labels indicate the Family of each genome and are colored to facilitate their differentiation within an Order not to provide a different color to each family. clustered while genomes coding more than 20 proteins must share at least 90% of their proteins to be clustered.
Nucleotide and protein sequences from the genomes that passed the aforementioned filters were considered as the non-redundant viral diversity available in NCBI at the time this study was conducted and were used for further analyses.
Gene prediction. Genes were predicted for genomes without any annotation using Glimmer (Delcher et al. 1999) as implemented in RAST-tk (Brettin et al. 2015), Gene-MarkS (v.2.0) (Borodovsky and Lomsadze 2014) and Prodi-gal (v.2.6.3) (Hyatt et al. 2010). The protein prediction was carried out separately for viruses and phages; and in the particular case of GeneMark, it was possible to specify if the genome was single or double stranded according to the taxonomy annotation of each genome. Predicted proteins per genome were de-replicated using CD-HIT at 99% sequence identity to collapse the predictions made by the 3 packages.

Domain prediction and ViPhOGs.
To split proteins into their component domains we first used InterProScan, which combines several signature recognition methods to predict D R A F T the presence of functional domains (Jones et al. 2014). Domains were extracted and protein regions without domain annotations and comprising at least 40 residues were also kept.
In an attempt to get an annotation for proteins and protein regions without InterProScan annotations, we used them as queries against vFams. a database of Hidden Markov Models (HMMs) built from viral RefSeq proteins (Skewes-Cox et al. 2014). Protein sequences that matched entries in the vFam database inherited the corresponding annotations, whenever these were available.
After this process, complete proteins without any domain annotation, protein regions of at least 40 residues, and domains identified by either InterProScan or vFams were considered for further analysis and referred as viral regions from now on. Orthologous groups were built from the symmetric best matches between viral regions using the software COGsoft (Kristensen et al. 2010), and only considering matches with an E-value < 0.1 and that covered at least 50% of the viral region lengths. The clusters of orthologous groups built from viral regions of viruses and phages are hereafter denominated ViPhOGs (Viruses and Phages clusters of Orthologous Groups).

Random Forest classifiers.
To test if the ViPhOGs can be used as a set of features that defines every virus or phage genome in our dataset, we aimed to correctly assign viral taxa to each genome according to the presence/absence of ViPhOGs. To solve this supervised learning task, we used the scikit-learn implementation of the Random Forest Classifier algorithm (Pedregosa et al. 2011) to, independently, perform the classification process at three different taxonomic levels: Order, Family and Genus. For each taxonomic level half of the genomes were randomly chosen for training, while the other half were used for testing the classifiers. Although randomly chosen, we constrained the selection of genomes to balance the taxa represented in both the training and testing sets. As a consequence, taxons with a single representative were not included in the classification process.
To evaluate the effect of the number of estimators in the classification we varied the number of estimators from 10 to 100 (by increments of 10 on each test), using 50 random training/testing sets for each model. The mean of the generalization score and the elapsed real time were the variables analysed to set the optimal number of estimators in the final model.
After establishing the number of estimators for the model, we sought to reduce the number of features or ViPhOGs used for the classification. A set of ViPhOGs was pre-selected by calculating both sensitivity/Precision (SP) and mutual information (MI) metrics as in Reyes 2015(Reyes et al. 2015. The ViPhOGs selected as features were those showing both, (i) a low SP index (high sensitivity and precision for the evaluated taxa) and (ii) a high MI index for the evaluated taxa. The selected set of ViPhOGs were used for the Random Forest algorithm to solve the classification problems. This time, 100 random training/testing sets were used for each model. Selection of informative ViPhOGs. In order to identify the most informative ViPhOGs for each classification model, features were ranked in descending order according to their mean Gini importance. Then each model was run again several times, but each time the number of features was reduced by 1 5 of the number used in the previous run to exclude the least important ViPhOGs. This process was repeated until the model was runned with the 4 most important ViPhOGs. Finally, the classification score of each iteration was plotted as a function of the number of features, and the smallest set of features that reached the highest mean classification score was selected as the set of informative ViPhOGs for each model. To depict how the viral diversity is related (or not), we built an UPGMA tree based on the presence/absence of informative ViPhOGs for each genome.