مرکز منطقه ای اطلاع رساني علوم و فناوري - Estimation of singular values of very large matrices using random sampling

Abstract :

The singular value decomposition (SVD) has enjoyed a long and rich history. Although it was introduced in the 1870s by Beltrami and Jordan for its own intrinsic interest, it has become an invaluable tool in applied mathematics and mathematical modeling. Singular value analysis has been applied in a wide variety of disciplines, most notably for least squares fitting of data. More recently, it is being used in data mining applications and by search engines to rank documents in very large databases, including the Web. Recently, the dimensions of matrices which are used in many mathematical models are becoming so large that classical algorithms for computing the SVD cannot be used. We present a new method to determine the largest 10%–25% of the singular values of matrices which are so enormous that use of standard algorithms and computational packages will strain computational resources available to the average user. In our method, rows from the matrix are randomly selected, and a smaller matrix is constructed from the selected rows. Next, we compute the singular values of the smaller matrix. This process of random sampling and computing singular values is repeated as many times as necessary (usually a few hundred times) to generate a set of set of training data for neutral net analysis. Our method is a type of randomized algorithm, i.e., algorithms which solve problems using randomly selected samples of data which are too large to be processed by conventional means. These algorithms output correct (or nearly correct) answers most of the time as long as the input has certain desirable properties. We list these properties and show that matrices which appear in information retrieval are fairly well suited for processing using randomized algorithms. We note, however, that the probability of arriving at an incorrect answer, however small, is not naught since an unrepresentative sample may be drawn from the data.