Title: PCA document reconstruction for email classification
Authors: Gomez, Juan Carlos ×
Moens, Marie-Francine #
Issue Date: 2012
Publisher: North-Holland Pub. Co.
Series Title: Computational Statistics & Data Analysis vol:56 issue:3 pages:741-751
Abstract: This paper presents a document classifier based on text content features and
its application to email classification. We test the validity of a classifier which
uses Principal Component Analysis Document Reconstruction (PCADR),
where the idea is that principal component analysis (PCA) can compress
optimally only the kind of documents - in our experiments email classes
- that are used to compute the principal components (PCs), and that for
other kinds of documents the compression will not perform well using only a
few components. Thus, the classifier computes separately the PCA for each
document class, and when a new instance arrives to be classified, this new
example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.
Description: The publisher is Elsevier, not North-Holland
ISSN: 0167-9473
Publication status: published
KU Leuven publication type: IT
Appears in Collections:Informatics Section
× corresponding author
# (joint) last author

Files in This Item:
File Description Status SizeFormat
COMSTA_5119_PCADR_Gomez.pdfMain article Published 126KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.

© Web of science