arrow drop search cross

Selecting genes in high dimension: investigation of Lasso like methods in survival datasets and prioritizing gene selection with gene network a priori

Seminar run by Laurent Guyon (CEA BIG Grenoble) on May 9, at 14h00


09/05/2019   :   14h00
Salle des séminaires MSI Sophia Antipolis
 Speaker: Laurent Guyon (CEA BIG Grenoble)
Publication : 09/05/2019
Partager cet article :

Speaker: Laurent Guyon (CEA BIG Grenoble)

Abstract: The genomic era led to the generation of large datasets of clinical interest, in particular in cancer. As an example, the American TCGA database gathers tumor profiling together with clinical datasets for hundreds of cancer patients having clear cell renal carcinoma (KIRC dataset). However, the number p of parameters (gene product levels) is of the order of two tens of thousands and largely exceeds the number n of patients. This situation of p>>n leads to the so-called ‘curse of dimensionality’, and raises a collection of issues in data analysis.

Cox model is popular to link genomic with survival datasets to predict overall survival of patients from the genomic profiling of their tumor. To reduce the number of predicting genes in the model by selecting a few predictive ones, Lasso penalization and a few variants including Elastic Net and Adaptive Elastic Net are often applied.

We benchmarked these three Lasso-like methods to identify prognostic biomarkers and predict individual patient survival on the KIRC dataset, both using microRNA (p ~ n) and mRNA (p >> n) sequencing data separately. While very unstable, patient survival is still correctly predicted for one fifth of the patients having a bad prognosis. Such information could orient the clinician decision towards novel therapies for these patients.

While the previous analysis is purely ‘data driven’, we are also proposing to take into account biological a priori in the model. We chose a network based a priori in which every node is a gene (or gene product) and two nodes are connected if there is a known biological interaction between both gene products. We apply a Markov random field (MRF) procedure to select genes depending on the data of interest (list of p-values) taking into account the gene network. As a result, two genes will have higher chances to be selected if there are connected on the network.