Reading the news through its structure: new hybrid connectivity based approaches

by David Manuel Rodrigues

Institution: RCAAP
Year: 2014
Keywords: Adaptive networks; Q-analysis; Community detection; Swarm intelligence; Hamiltonian path; Travelling salesman problem; Ant colony optimisation; Redes adaptativas; Detecção de comunidades; Caminhos hamiltonianos; Problema do caixeiro viajante; Optimização por colónias de formigas
Record ID: 1318844
Full text PDF: http://rcaap.pt/detail.jsp?id=oai:repositorio.iscte-iul.pt:10071/8803


Degree of Doctor of Philosophy in the field of Complexity Sciences In this thesis a solution for the problem of identifying the structure of news published by online newspapers is presented. This problem requires new approaches and algorithms that are capable of dealing with the massive number of online publications in existence (and that will grow in the future). The fact that news documents present a high degree of interconnection makes this an interesting and hard problem to solve. The identification of the structure of the news is accomplished both by descriptive methods that expose the dimensionality of the relations between different news, and by clustering the news into topic groups. To achieve this analysis this integrated whole was studied using different perspectives and approaches. In the identification of news clusters and structure, and after a preparatory data collection phase, where several online newspapers from different parts of the globe were collected, two newspapers were chosen in particular: the Portuguese daily newspaper Público and the British newspaper The Guardian. In the first case, it was shown how information theory (namely variation of information) combined with adaptive networks was able to identify topic clusters in the news published by the Portuguese online newspaper Público. In the second case, the structure of news published by the British newspaper The Guardian is revealed through the construction of time series of news clustered by a kmeans process. After this approach an unsupervised algorithm, that filters out irrelevant news published online by taking into consideration the connectivity of the news labels entered by the journalists, was developed. This novel hybrid technique is based on Qanalysis for the construction of the filtered network followed by a clustering technique to identify the topical clusters. Presently this work uses a modularity optimisation clustering technique but this step is general enough that other hybrid approaches can be used without losing generality. A novel second order swarm intelligence algorithm based on Ant Colony Systems was developed for the travelling salesman problem that is consistently better than the traditional benchmarks. This algorithm is used to construct Hamiltonian paths over the news published using the eccentricity of the different documents as a measure of distance. This approach allows for an easy navigation between published stories that is dependent on the connectivity of the underlying structure. The results presented in this work show the importance of taking topic detection in large corpora as a multitude of relations and connectivities that are not in a static state. They also influence the way of looking at multi-dimensional ensembles, by showing that the inclusion of the high dimension connectivities gives better results to solving a particular problem as was the case in the clustering problem of the news published online. Neste trabalho resolvemos o problema da identificação da estrutura das notícias publicadas em…