Statistical Methodology for the National Virtual Observatory

The NAS Taylor/McKee Decadal Report on astronomy for 2000-2010 recommends as a top priority the formation of a National Virtual Observatory (NVO) to link archival datasets and catalogues from many existing astronomical surveys. The effective use of such integrated massive datasets involves more than just access and extraction of information -- scientific understanding requires sophisticated statistical modeling of the selected data. This effort falls under the rubric of statistical inference and includes the fields of multivariate analysis, nonparametrics, Bayesian analysis, spatial point processes, density estimation and data mining. Large-scale multiwavelength astronomical surveys present a variety of new challenging statistical and algorithmic problems that require methodological advances. The principal investigator and his colleagues address some of the critically important statistical challenges raised by the NVO. Specific approaches include: low-storage percentile estimation for large datasets, multi-resolutional K-Dimensional trees for clustering and outlier detection, and multi-dimensional goodness-of-fit tests for comparison of multivariate astronomical datasets with astrophysical models and simulations. Such an endeavor needs close collaboration of statisticians, astronomers and NVO specialists who reside at different institutions. Developing a statistical toolkit within the NVO software environment implementing both new and existing methods is one of the central goals of this project.

As the data volume and complexity of astronomical findings have enormously increased in recent decades, a paradigm shift is underway in the very nature of observational astronomy. While in the past a single astronomer might observe a handful of objects, today data mining of large digital sky archives obtained at all wavelengths of light is becoming a major mode of study. The astronomical community thus faces a key task: to enable efficient and objective scientific exploitation of enormous multifaceted datasets. In recognition of this need, the National Virtual Observatory (NVO) initiative has recently emerged to federate numerous large digital sky archives and develop tools to explore and understand these vast volumes of data. The investigation here aims at developing statistical and computational methods to achieve these goals. The cross-disciplinary team, of astronomers and statisticians, brings advances in these fields into the toolbox of observational astronomy. The project seeks not only to formulate effective techniques to address NVO problems, but to code these methods into statistical toolkits within NVO software environments for the entire astronomical community. The collaboration includes two institutions skilled in astrostatistics (Penn State and Carnegie Mellon) and an institution at the center of the NVO effort (California Institute of Technology). The participation by graduate students and postdocs give them a rare opportunity to develop skills needed for cross-disciplinary work.