AbstractsComputer Science

Automatization of text categorization based on unlabeled documents

by Vlatko Duric




Institution: University of Oslo
Department:
Year: 1000
Keywords: VDP::420
Record ID: 1280622
Full text PDF: https://www.duo.uio.no/handle/10852/9643


https://www.duo.uio.no/bitstream/10852/9643/2/Duric.pdf


Abstract

The subject of this thesis is automatization of text categorization, with focus on use of unlabeled documents, as well as domain ontologies, in order to improve performance of text categorization systems. The thesis is written in the context of the AKSIO project. This context has following important characteristics: - There exists a large collection of unlabeled documents concerned with the domain of interest - Manual labeling of a considerable amount of these documents would be very expensive - There exists an expert made domain ontology - Considerable number of text categories (i.e. ontology classes) Text categorization tasks with similar starting points are not rare cases. Currently, one of the major problems regarding automatization of text categorization is scarcity of manually labeled documents. Standard methods for supervised machine learning of text classifiers require a number of documents previously annotated by a domain expert, for each class (i.e. category). When we are talking about more than a couple of hundreds of classes (categories), the requirement becomes too difficult and too costly to satisfy. This thesis considers some possible approaches for solving this current research problem. We propose a method that combines available resources: - Large collection of unlabeled documents, - Domain knowledge stored in the ontology, and - External lexical sources, like the Web, WordNet, standard dictionaries, domain specific glossaries, etc. Some basic experimental work is done, which indicates that the method has some potential for improvement of automatic text categorization, in cases where almost no manually labeled documents are available.