Code & Data
Research Project Overview

Mining Petabytes of Data using Cloud Computing and Supercomputing

Petros Drineas, Bulent Yener, Chris Carothers, Mohammed Zaki and Angel Garcia
Rensselaer Polytechnic Institute

There is a growing need for effective approaches to mining very large, i.e., petabyte scale data sets in many areas of science, engineering, and business.

The project aims to design, analyze, and implement a number of fundamental matrix-mining and graph-mining operations that are scalable to petabyte-sized inputs. Such efforts guarantee the continuation of the phenomenal growth in analyzing, visualizing, and extracting information from massive matrices and graphs. The project leverages Rensselaer's unique computing platform in the form of a massively parallel machine (a Blue Gene/Q) with access to approximately 1.2 petabytes of storage, as well as a data-staging layer, named the RAM Storage Accelerator (RSA) with 512 computational nodes and a a total of 8TBs of fast RAM. The platform is configurable to allow the computational nodes at the RSA level to be used to pre-process data from the secondary storage in a cloud-like fashion. The project aims design and analyze approximation algorithms for matrix and graph mining tasks that follow an iterative, two-step approach: given petabytescale data, first, using computationally inexpensive approaches to obtain compact data sketches using the RSA layer as a "cloud" in order to reduce their size from the petabyte scale to the terabyte scale. The resulting data sketches are processed using computationally demanding approaches on the Blue Gene/Q. This process is iterated using the approximate solutions in order to improve the quality of the sketches and the approximation guarantees.

The research team expects to release software and libraries for matrix and graph mining algorithms that implement our two-phase approaches for PB-scale matrices and graphs. The resulting tools will be applied to the analysis of petabytes of data from computer simulations of the dynamics of biomolecular systems. The investigators plan to involve students and researchers from other institutions in the design, analysis, and development of the proposed methods through an internship program. The project also offers increased opportunities for research-based training in Data Analytics and High Performance Computing to graduate and undergraduate students at RPI. The results of the research will be made available to the academic community through the project web site.

The material presented at this web site is based in part upon work supported by the National Science Foundation, IIS/Information Integration and Informatics(III) division, through the Data Science Research Center under Grant No. 1302231. 

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

[Home] [Overview] [Papers] [People] [Code & Data] [News]

[Created by Chander Iyer: iyerc@rpi.edu]   [RPI Home]   [CS Home]