||Distributed Topic Modeling of Web-scale Text Corpora
||Wahlgreen, Bjarne Ørum
||Hansen, Lars Kai (Cognitive Systems, Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
||Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark
||The Internet is a vast resource of information which keeps growing every day.
Automated tools for the analysis of Web information hence need to be scalable
for growing problem sizes, both in terms of speed and memory capacities.
We study how techniques for topic modeling of text corpora can be developed
to encompass the data amounts generated by the online news media stream. In
this thesis we propose a novel approach for automated topic modeling which
scales in terms of both speed and memory capacities.
We give a thorough background for the model we choose for topic modeling
and document how Nonnegative Matrix Factorization (NMF) can be used to
compute an approximation of the topic model. Various algorithms exist for
computing the NMF approximation and we choose to focus our attention on
two particular algorithms: "Multiplicative updates" and "projected gradient
descent". While the NMF has been applied to large-scale datasets, such as the
NetFlix challenge, we identify that it cannot be applied directly with input data
exceeding memory capacities. Hence, we suggest a scheme for distribution of
the input data in smaller segments and show how the NMF approximation can
be extended on the algorithmic level to allow distributed computation of factors.
We demonstrate the various NMF algorithms and document the results of simulating
the proposed distributed NMF computation. Thereby we verify our
design and obtain initial indications of the performance gains.
A system is implemented which realizes distributed topic modeling by NMF
computation. The system is tested with a dataset acquired from online news
media during three weeks in February 2010 and analyze the results. The system
is evaluated in terms of its ability to scale to larger problem sizes. We confirm this ability by demonstrating the system running with an input problem
containing 6:9 billion nonzero elements, which we believe is the biggest problem
to date that has been attempted with the investigated technique for topic
We run the system in two configurations, where the second extends the NMF
algorithm with sparsity constraints and adaptively tunes these in a framework
known as Automatic Relevance Determination (ARD). The two solutions are
evaluated to investigate the differences with and without ARD. The overall
findings are, that while the relative importance of topics does change slightly,
the topic features that span the latent space of the model are closely related
between the solutions. Other qualitative results are provided for the differences
between topics of the same solution and show that these are well separated.
The results are applied in a simple application where topic dynamics are tracked
over time. This provides visualizations from which we see good correspondence
between topic activity and the topics represented by the features.
||Technical University of Denmark (DTU) : Kgs. Lyngby, Denmark
Creation date: 2010-07-01
Update date: 2010-07-01