Non-parametric graph-based methods

The notion of similarity among data objects plays a fundamental role in many machine learning methods. Graph-based methods induce the similarity between data objects from (1) local similarities that are first complied into a similarity graph and (2) spectral decomposition of this graph, that aims to aggregate the effects of local similarities into a global (data-driven) similarity metric (or kernel) between the objects. A direct consequence of this is that machine learning algorithms that are designed to work with absolute distances can be now applied to problems with data-driven distances.

Our research on the graph-based methodology covers:

In the following we briefly describe the main ideas of these effors and include the related publications:

Graph-based text metrics

Graph-based methods can be used to induce similarities for an arbitrary set of objects from their local similarities. We pursued this approach by defining similarities among terms or concepts in a document corpus. Briefly, we used co-occurence counts of the two terms in the sentence or paragraph in the corpus to define their local similarity. The global similarity was then induced using spectral decomposition methods and smooth kernels derived from them. This let us build similarities among pairs of concepts. After that we have developed a novel way of defining similarity among sets of terms which let us compare sentences, paragraphs or even whole documents. To compute this metric efficiently, we have develop an approximation technique that relies on the precompiled term-term similarities. We demonstrated the benefits of the whole framework on two text inference tasks: (1) prediction of terms in the biomedical article from its abstract and (2) query expansion in information retrieval; and on the gene prioritization task, where our goal was to infer from the text genes that are most likely associated with Alzheimer disease. This approach (developed in 2010) is an alternative approach to popular skip-gram and CBOW models for defining the similairy of terms, words or concepts, that uses eignevectors of the spectral decomposition of the graph Laplacian to define the lower dimensional embeddings.

Related publications:

Approximation algorithms for large-scale and high-dimensional data.

In this research we studied ways of improving the methodology so that it can scale up and can be applied to high-dimensional, large-scale data. First, we have developed a variational dual-tree framework to effectively represent the transition matrix of the random walk on the graph (or equivalently the random walk Laplacian matrix). This transition matrix is specifically useful in applications such as diffusion analysis, semi-supervised learning and link analysis. We demonstrated an order of magnitudes speedup for label propagation tasks on real datasets with this framework. Second, we have proposed a novel approach for building the kernel for high-dimensional data by factoring it into to multiple kernels defined on indepenendent data subspaces.

Related publications:

Application of graph-based methods to clinical data

One of the research topics we currently pursue is how to define similarity among patients (clinical data) and how to use this similarity to support various inferences. We have developed a graph-based method and a new soft-label propagation algorithm to support inferences in conditional outlier detection. We applied the method to detect deviations in patient care.

Related publications:


The web page is updated by milos.