Title: Highly discriminative statistical features for email classification
Authors: Gomez, Juan Carlos ×
Boiy, Erik
Moens, Marie-Francine #
Issue Date: 2012
Publisher: Springer-Verlag London Ltd.
Series Title: Knowledge and Information Systems vol:31 issue:1 pages:23-53
Abstract: This paper reports on email classification and filtering, more specifically on spam versus ham, and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features.
We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of Biased Discriminant Analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly
retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
ISSN: 0219-1377
Publication status: published
KU Leuven publication type: IT
Appears in Collections:Informatics Section
× corresponding author
# (joint) last author

Files in This Item:
File Description Status SizeFormat
GomezetalKAIS2011.pdfMain article Published 3516KbAdobe PDFView/Open Request a copy

These files are only available to some KU Leuven Association staff members


All items in Lirias are protected by copyright, with all rights reserved.

© Web of science