Sparse Feature Selection for Classification and Prediction of Metastasis in Endometrial Cancer: Extended Abstract

Metastasis via pelvic and/or para-aortic lymph nodes is a major risk factor for endometrial cancer. Lymph-node resection ameliorates risk but is associated with significant co-morbidities. Incidence in patients with stage I disease is 4-22% but no mechanism exists to accurately predict it. Therefore, national guidelines for primary staging surgery include pelvic and para-aortic lymph node dissection for all patients whose tumor exceeds 2cm in diameter. We sought to identify a robust molecular signature that can accurately classify risk of lymph node metastasis in endometrial cancer patients. We introduce a new feature selection algorithm, lone star, for applications where the number of samples is far smaller than the number of measured features per sample. We applied lone star to develop a predictive miRNA expression signature on a training. When applied on an independent testing cohort, the classifier correctly predicted 90% of node-positive cases, and 80% of node-negative cases (FDR= 6.25%). Our results indicate that the evaluation of the quantitative sparse-feature classifier proposed here in clinical trials may lead to significant improvement in the prediction of lymphatic metastases in endometrial cancer patients.


INTRODUCTION
Incidence of pelvic and para-aortic node metastasis in patients with stage I endometrial cancer varies from 4-22% depending on grade, depth of invasion, lymphovascular space invasion, and histologic subtype [6]. Patients harboring tumors less than 2 centimeters in diameter and with less than 50% myometrial invasion are considered to be at low risk for lymphatic metastasis [12]. In a key clinical study, patients whose tumors violate these criteria were recommended for lymphadenectomy. Yet, within this high risk group, only 22% had lymph node metastasis, suggesting that 78% of the lymphadenectomies were unnecessary [12]. A more recent study [11] that separately considered pelvic versus paraaortic lymph node invasion showed little improvement in this statistic. It is therefore clear that current best practice clinical-pathologic parameters are grossly insufficient for reasonable prediction of metastatic disease [12].
In this paper, we develop a new classification algorithm which combines the best aspects of the 1-norm SVM of [5] and the Elastic Net formulation of [16], which uses a convex combination of the 1-norm and the 2-norm squared. We applied this new algorithm to quantitative genome-scale microRNA expression data from 86 clinically annotated pri-mary endometrial tumors, 18 micro-RNAs were recovered that are sufficient to predict the risk of lymph node metastasis within the training cohort. This biomarker panel was tested on an independent cohort of 28 tumors, and returned predictions with high sensitivity, low false discovery rate, and P < 0.0004. The panel therefore provides a path towards the development of a practical molecular diagnostic to avoid unnecessary surgeries (and their associated morbidities) in patients who are not at risk. This study is thus a transdisciplinary combination of two distinct advances: (i) a new algorithm for sparse feature selection in binary classiffication problems, and (ii) its application to predict the risk of metastasis in endometrial cancer.

Selection of training cohort and generation of the predictive feature matrix
Quantitative measurement of miRNA expression was chosen for detection of putative predictive features. As a family, miRNAs represent a relatively compact feature set which is, never-the-less, profoundly integrated with cell and tissue behavior [3,7,13]. Moreover, miRNA expression patterns have been identified that can predict benign vs. malignant disease, histologic subtypes, survival, and response to chemotherapy [4,10,9]. Two recent surveys highlight the role of miRNAs in cancer in general [1] and endometrial cancer in particular [15].
Total cellular miRNA was extracted from all tissues and measured using LNA-based detection arrays (Supplemental Table 1). 86 samples passed quality controls based on RNA integrity and expression array performance. Among the 1, 428 available probe sets, 213 miRNAs were detectable in all 86 samples. An unsupervised two-way hierarchical clustering of the resulting miRNA expression values within each subclass revealed substantial expression variation between tumors, with no qualitatively evident distinctions between subclasses.

Generation of Molecular Signature for Predicting Lymph Node Metastasis
In order to detect candidate quantitative microRNA feature sets within the primary tumors that may discriminate between node positive and node negative disease, as well as a numerical procedure for combining the measured values of the features, we turned to machine learning protocols. When the number of features is larger than the number of samples, which is typical for biological problems such as the one here, machine learning approaches commonly encounter a phenomenon known as "over-fitting," wherein a classifier does an excellent job on the training data, but has poor generalization abilities. To overcome this problem, we developed a sparse classification algorithm that uses a convex combination of 1-and 2 norms as a regularization term in its objective function. To detect discriminatory features that may predict metastatic disease, 213 miRNA expression features measured in 86 samples (43 lymph node-positive and 43 lymph node-negative) were used as the training data. The application of the lone star algorithm in the training data with 80 random cross validations at each iteration resulted in a set of 18 features. Afterwards, to compute a unique classifier, a single iteration of lone star is run with these 18 features and the 20 best-performing classifiers giving the best cross-validation error were computed (Supplemental Table  4). To have a more robust classifier the weight vectors and thresholds of these 20 classifiers were averaged to arrive at the weight vector and threshold of the final classifier. This classifier was applied to the 86 tumor training cohort, and it classified all 86 tumors correctly.

BIOLOGICAL SIGNIFICANCE OF SELECTED BIOMARKERS
We carried out an analysis of the various genes that are regulated by the 18 miRNAs in the final feature set. We retrieved data from the miRTarbase database, which comprises experimentally validated micro-RNA to target gene interactions in humans. A total of 740 genes were recovered, the vast majority of which are associated with the micro-RNA hsa-mir-155. A recent study suggests that hsamir-155 is over-expressed in endometrial cancer patients visa-vis normal patients [14].

CLASSIFIER VALIDATION WITH AN IN-DEPENDENT COHORT
To rigorously test the classifier developed using the lone star algorithm, an independent cohort of primary tumors with known metastatic state was collected. This comprised 28 endometrial cancer samples obtained between 2010 and 2012 under an IRB approved Comprehensive Gynecologic Oncology Tumor Repository protocol.
The quality of the classification results were determined with a 2×2 contingency table, and computing the likelihood of arriving at the classifications purely through chance. Pvalues were computed using the Fisher exact test [8] and the Barnard exact test [2] .
The P -value of the clasiffication was 0.0004 with the Barnard exact test, and 0.0012 with the less powerful Fisher exact test.

CONCLUSIONS
In this work, we have developed a novel sparse classification algorithm and applied it to predict risk of lymph node metastasis in endometrial cancer patients. The algorithm produced a weighted classifier, using 18 micro-RNAs, and achieved 100% accuracy on the training cohort. When applied to an independent testing cohort, the classifier correctly predicted 90% of node-positive cases, and 80% of node-negative cases (FDR= 6.25%).
The classifier developed in this study was based on molecular measurements from excised tumors. If one could predict the risk of lymph node metastasis on the basis of a biopsy, then the decision to carry out lymphadenectomy or not could be made at the time of excision of the primary tumor. Therefore a useful next step would be to repeat the present study on a cohort of biopsies. Pending the completion of such a study, it is worth noting that a prediction of the risk of metastasis is valuable even if lymphadenectomy is not performed, as it can inform choices for post-resection patient care.