Beta 1


Title Distributed Topic Modeling of Web-scale Text Corpora
Author Wahlgreen, Bjarne Ørum
Supervisor Hansen, Lars Kai (Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Institution Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark
Thesis level Master's thesis
Year 2010
Abstract The Internet is a vast resource of information which keeps growing every day. Automated tools for the analysis of Web information hence need to be scalable for growing problem sizes, both in terms of speed and memory capacities. We study how techniques for topic modeling of text corpora can be developed to encompass the data amounts generated by the online news media stream. In this thesis we propose a novel approach for automated topic modeling which scales in terms of both speed and memory capacities. We give a thorough background for the model we choose for topic modeling and document how Nonnegative Matrix Factorization (NMF) can be used to compute an approximation of the topic model. Various algorithms exist for computing the NMF approximation and we choose to focus our attention on two particular algorithms: "Multiplicative updates" and "projected gradient descent". While the NMF has been applied to large-scale datasets, such as the NetFlix challenge, we identify that it cannot be applied directly with input data exceeding memory capacities. Hence, we suggest a scheme for distribution of the input data in smaller segments and show how the NMF approximation can be extended on the algorithmic level to allow distributed computation of factors. We demonstrate the various NMF algorithms and document the results of simulating the proposed distributed NMF computation. Thereby we verify our design and obtain initial indications of the performance gains. A system is implemented which realizes distributed topic modeling by NMF computation. The system is tested with a dataset acquired from online news media during three weeks in February 2010 and analyze the results. The system is evaluated in terms of its ability to scale to larger problem sizes. We confirm this ability by demonstrating the system running with an input problem containing 6:9 billion nonzero elements, which we believe is the biggest problem to date that has been attempted with the investigated technique for topic modeling. We run the system in two configurations, where the second extends the NMF algorithm with sparsity constraints and adaptively tunes these in a framework known as Automatic Relevance Determination (ARD). The two solutions are evaluated to investigate the differences with and without ARD. The overall findings are, that while the relative importance of topics does change slightly, the topic features that span the latent space of the model are closely related between the solutions. Other qualitative results are provided for the differences between topics of the same solution and show that these are well separated. The results are applied in a simple application where topic dynamics are tracked over time. This provides visualizations from which we see good correspondence between topic activity and the topics represented by the features.
Imprint Technical University of Denmark (DTU) : Kgs. Lyngby, Denmark
Series IMM-M.Sc.-2010-45
Fulltext
Original PDF ep10_45_net.pdf (2.10 MB)
Admin Creation date: 2010-07-01    Update date: 2010-07-01    Source: dtu    ID: 264537    Original MXD