Data Sampling for Matrix Completion

In this project we discuss how, using a small subset of rows, we predict the location of nonzero elements, in a large, symmetric, sparse matrix. These elements are of interest in applications such as genome research, in which the genome of an organism is represented as a large, symmetric matrix with a sparsity of approximately 0.2% nonzero elements. Making accurate estimates of the locations of nonzero elements helps direct a biologist to sample the nonzero elements rather than wasting resources on confirming the overwhelming number of zero valued elements. We use five different methods for selecting the rows that will be sampled, including completely random, three different methods of weighting all the rows, and only selecting from rows above a minimum eccentricity. We compare the performance of these methods using an F-measure calculation to determine how well a subset of rows predicts the location of nonzero elements. Our work shows that the intelligent methods of row selection perform approximately 7.4 times better than random row selection when sampling at a 98% accuracy level. These results prove the feasibility of better prediction of nonzero elements in large, symmetric, sparse matrices given better row selection techniques.