Frontiers in Artificial Intelligence vol:243 pages:615-625
International conference on knowledge-based and intelligent information & engineering systems edition:16 location:San Sebastian, Spain date:10-12 September 2012
In this work we implement and evaluate a methodology to classify multi-labeled web documents into large-scale taxonomies, using their text content. Multi-label hierarchical classification using large-scale taxonomies is a hard task due to problems of scarcity of training data in many nodes of the hierarchy, overlapping of content and complex decision surfaces. We propose a novel feature extraction model called Multilayered Class Discrimination (MCD), which reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy, helping to discriminate each class from other classes in the same level and reducing the effects of the mentioned problems. The results of categorizing web documents from the DMOZ directory show that our model improves the accuracy of the categorization when compared with the use of word features, and that the results are competitive with the ones presented in the Second LSHTC Challenge.