Knowledge and Information Systems vol:31 issue:1 pages:23-53
This paper reports on email classiﬁcation and ﬁltering, more speciﬁcally on spam versus ham, and phishing versus spam classiﬁcation, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features.
We successfully test our methods under two schemas. The ﬁrst one is a classic classiﬁcation scenario using a 10-fold cross validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema we test the anticipatory properties of our extracted features and classiﬁcation models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classiﬁcation on different benchmarking corpora, and the evidence that especially the technique of Biased Discriminant Analysis offers better discriminative features for the classiﬁcation, gives stable classiﬁcation results notwithstanding the amount of features chosen, and robustly
retains their discriminative value over time and data setups. These ﬁndings are especially useful in a commercial setting, where short proﬁle rules are built based on a limited number of features for ﬁltering emails.