Beta 1


Title Statistical text modelling - Towards modelling of matching problems
Author Birch, Sune
Supervisor Hansen, Lars Kai (Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Madsen, Rasmus Elsborg (Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark)
Institution Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark
Thesis level Master's thesis
Year 2003
Abstract Intelligente computer-baserede systemer, som behandler tekstdokumenter er et aktivt forskningsområde. Søgemaskiner er højst sandsynligt det bedst kendte eksempel på, at anvendelsen af statistisk analyse af ustruktureret tekst kan være særdeles effektiv. Men der findes et væld af andre mulige anvendelser, hvor statistisk tekst modellering vil være velegnet. Et andet eksempel er matching af jobansøgere og job. Arbejdsgivere bruger mange resourcer på at finde den rette ansøger, og jobansøgere bruger tilsvarende mange resourcer i søgningen efter det rette job. Intelligente systemer, som automatisk og pålideligt kunne matche ansøgere og job ville spare mange kræfter. Dette eksamensprojekt er blevet inspireret af jobmatching-problemet og lignende matching-problemer. Målet er at finde frem til velegnede statistiske modeller for matching-problemer og at udføre en grundig analyse af disse modeller - både teoretisk og eksperimentielt. Søgemaskiner hører til indenfor Information Retrieval. Vores studier af dette område viser, at der har været en del fokus på modeller, som beskæftiger sig med linkstrukturer. Med linkstrukturer menes dokumenter, som er indbyrdes forbundet med links. I forbindelse med matching-problemer kan det vise sig meget interessant, fordi et link minder meget om et match. Probalistic Latent Semantic Indexing (PLSI) er en model som kan udvides til, udover ord, også at repræsentere links. I dette eksamensprojekt bliver det vist, at PLSI er i stand til at "forstå" semantik i tekstdokumenter. Udvides PLSI dernæst med links, er modellen også til en vis grad i stand til at forudsige hvilke links et dokument burde have udfra dets tekstindhold. Dette virker bedst, hvis linkstrukturen er rimelig tæt. Intentionen er, at den insigt i statistisk tekstmodellering, som formidles i denne rapport vil vise sig nyttig under design af intelligente computersystemer for matching-problemer. in English: Automatic systems dealing with text is an active area of research. Search engines are probably the most successful and well-known area where statistical analysis of unstructured text documents have proven very useful. But there are a vast amount of potential applications, which are suitable for statistical text modelling. One such application is the matching of job applicants with job offers. Large amounts of human resources are used today on the search for future employees and - from the applicant's perspective - on the search for the right job. Successful automatic systems would be very welcome in this area. This master thesis has been motivated by the job matching problem and related matching problems. The objective is to find suitable models for matching problems and to perform a thorough analysis of these models from a theoretical and from an experimental perspective. Search engines belong to the area of Information Retrieval. In Information Retrieval there has been a lot of focus on models dealing with link structures - that is, documents interconnected by links - and links are interesting, because a link is very similar to a match. The Probalistic Latent Semantic Indexing (PLSI) model is a model, which can be extended to incorporate link information. This thesis shows that PLSI is a model capable of capturing semantics in text documents. Furthermore when extended with link information it is capable of predicting links to some degree in environments where link information is not too sparse. It is the hope that the insight and the capabilities of the models presented here will turn out useful when building automatic systems for matching problems or related problems.
Imprint Department of Informatics and Mathematical Modeling, Technical University of Denmark, DTU : DK-2800 Kgs. Lyngby, Denmark
Keywords Statistical text modelling; Probalistic Latent Semantic Indexing; PLSI; PLINK; Supervised PLSI; Job matching
Fulltext
Original PDF imm2799.pdf (0.65 MB)
Admin Creation date: 2006-06-22    Update date: 2012-12-20    Source: dtu    ID: 58600    Original MXD