Non-parametric graph-based methods

The notion of similarity among data objects plays a fundamental role in many machine learning methods. Graph-based methods induce the similarity between data objects from (1) local similarities that are first complied into a similarity graph and (2) spectral decomposition of this graph, that aims to aggregate the effects of local similarities into a global (data-driven) similarity metric (or kernel) between the objects. A direct consequence of this is that machine learning algorithms that are designed to work with absolute distances can be now applied to problems with data-driven distances.

Our research on the graph-based methodology covers:

Development of a text similarity metric to support inferences in text based on the spectral graph-Laplacian based low dimensional embeddings for words, concepts
Development of approximation methods for large-scale and high-dimensional data
Applications of the methods to clinical data.

In the following we briefly describe the main ideas of these effors and include the related publications:

Graph-based text metrics

Graph-based methods can be used to induce similarities for an arbitrary set of objects from their local similarities. We pursued this approach by defining similarities among terms or concepts in a document corpus. Briefly, we used co-occurence counts of the two terms in the sentence or paragraph in the corpus to define their local similarity. The global similarity was then induced using spectral decomposition methods and smooth kernels derived from them. This let us build similarities among pairs of concepts. After that we have developed a novel way of defining similarity among sets of terms which let us compare sentences, paragraphs or even whole documents. To compute this metric efficiently, we have develop an approximation technique that relies on the precompiled term-term similarities. We demonstrated the benefits of the whole framework on two text inference tasks: (1) prediction of terms in the biomedical article from its abstract and (2) query expansion in information retrieval; and on the gene prioritization task, where our goal was to infer from the text genes that are most likely associated with Alzheimer disease. This approach (developed in 2010) is an alternative approach to popular skip-gram and CBOW models for defining the similairy of terms, words or concepts, that uses eignevectors of the spectral decomposition of the graph Laplacian to define the lower dimensional embeddings.

Related publications:

S. Wang, M. Hauskrecht.
Keyword Annotation of Biomedical Documents with Graph-based Similarity Methods.
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Philadelphia, October 2012.
S. Amizadeh, S. Wang, and M. Hauskrecht.
An Efficient Framework for Constructing Generalized Locally-Induced Text Metrics, International Joint Conference on AI (IJCAI), Barcelona, Spain, July 2011.
S. Wang, and M. Hauskrecht.
Effective Query Expansion with the Resistance Based Term Similarity Metric,
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval , Geneva, Switzerland, July 2010, pp. 715-716.
S. Wang, M. Hauskrecht, S. Visweswaran.
Candidate Gene Prioritization Using Network Based Probabilistic Models.
AMIA Summit on Translation Bioinformatics , March 2010.

Approximation algorithms for large-scale and high-dimensional data.

In this research we studied ways of improving the methodology so that it can scale up and can be applied to high-dimensional, large-scale data. First, we have developed a variational dual-tree framework to effectively represent the transition matrix of the random walk on the graph (or equivalently the random walk Laplacian matrix). This transition matrix is specifically useful in applications such as diffusion analysis, semi-supervised learning and link analysis. We demonstrated an order of magnitudes speedup for label propagation tasks on real datasets with this framework. Second, we have proposed a novel approach for building the kernel for high-dimensional data by factoring it into to multiple kernels defined on indepenendent data subspaces.

Related publications:

A. Amizadeh, B. Thiesson, M. Hauskrecht.
The Bregman Variational Dual-Tree Framework.
The 29th International Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, WA, July 2013.
S. Amizadeh, B. Thiesson, and M. Hauskrecht.
Variational Dual-Tree Framework for Large-Scale Transition Matrix Approximation.
The 28th International Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, August 2012.
S. Amizadeh, H. Valizadegan, and M. Hauskrecht.
Factorized Diffusion Map Approximation.
15th International Conference on Artificial Intelligence and Statistics (AISTATS), La Palma, Canary Islands, April 2012. ( appendix)

Application of graph-based methods to clinical data

One of the research topics we currently pursue is how to define similarity among patients (clinical data) and how to use this similarity to support various inferences. We have developed a graph-based method and a new soft-label propagation algorithm to support inferences in conditional outlier detection. We applied the method to detect deviations in patient care.

Related publications:

M. Valko, H. Valizadegan, B. Kveton, GF. Cooper, and M. Hauskrecht.
Conditional Anomaly Detection with Soft Harmonic Functions,
IEEE International Conference on Data Mining, Vancouver, Canada, December 2011.
M. Valko, H. Valizadegan, B. Kveton, GF. Cooper, and M. Hauskrecht.
Conditional Anomaly Detection Using Soft Harmonic Functions: An Application to Clinical Alerting,
ICML Workshop For Global Challenges, Bellevue, Washington, June 2011.

Funding:

NIH. R01GM088224 Detecting deviations in clinical care in ICU data stream. PIs: Hauskrecht and Clermont , August 2009-June 2018.

The web page is updated by milos.