Title: Hierarchical classification of web documents by stratified discriminant analysis
Authors: Gomez, Juan Carlos
Moens, Marie-Francine
Issue Date: 2012
Publisher: Springer
Host Document: Lecture Notes in Computer Science vol:7356 pages:94-108
Conference: location:Vienna, Austria date:2-3 July 2012
Abstract: In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. Hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and to identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.
ISSN: 0302-9743
Publication status: published
KU Leuven publication type: IC
Appears in Collections:Informatics Section

Files in This Item:
File Description Status SizeFormat
GomezMoensIRF2012.pdfMain article Published 334KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.