Data mining

Statistical learning
Clustering and classification methods based on mixture models for high dimensional data and non-vector type of data such as discrete distributions with non-fixed supports (sets of unordered and weighted vectors).
  • Multilayer mixture models
  • Two-way mixture models for discrete and continuous data
  • D2-clustering for discrete distributions (sets of unordered and weighted vectors) under the Kantorovich-Wasserstein metric.
  • Generalized mixture modeling for a metric (but not vector) space
  • Clustering via mode association

Applications explored: document retrieval/classification, image annotation/retrieval/segmentation/compression, social networks, information visualization, genomics, etc.

Sample talk | Free software | Basics on data mining & learning
Stochastic modeling
Spatial stochastic models attempt to characterize the inherent dependence among image pixels. The dependence can then be exploited for various tasks, for instance, segmentation, compression, classification. We have developed the 2-D Hidden Markov model (i.e., Spatial HMM) with extensions to a multiresolution model (MHMM) and 3-D for volume data.

Applications explored: general-purpose photographs, satellite images, Chinese classical paintings, Van Gogh paintings, etc.

Sample talk | Tutorial on HMM
Image annotation
Image annotation is about tagging pictures by words automatically using only pixel information. We have developed ALIPR, a real-time computerized image annotation system. The work is rooted in the ALIP system developed in 2002. Relevant methodologies: 2-D MHMM, D2-clustering, generalized mixture modeling.

Sample talk | | In the news: MIT Tech Review ...
Image retrieval
Content-based image retrieval systems search for similar pictures using only pixel information. We have developed the SIMPLIcity retrieval system that has been deployed at several real-world Web sites, e.g., , , , and requested for educational purposes by dozens of universities. We continue to work on image retrieval to bring in new aspects such as aesthetics, semantics learning, and story picturing.

Sample talk | Demo | Slashdot news
Social networks
Statistical modeling and learning techniques are used to discover E-communities and to study academic collaboration networks with applications to citseer.

Comparative genomics
Data mining and statistical modeling methods are used to study evolution and functions of DNA segments based on aligned DNA sequences of multiple species.

Sample talk
Data compression/Source coding theory
Asymptotics of vector quantizers with high bit rate when perceptually based distortion measures are used.

Sample talk

        Software for the public (author: Jia Li)

@Jia Li, Updated August, 2005          Back to Home