SpamAssassin Public Email Corpus
Informacje podstawowe
- Nazwa: SpamAssassin Public Email Corpus
- Alias: SpamAssassin Public Corpus, SA Corpus
- Dziedzina: NLP, Cybersecurity, Email Security
- Typ: Text (emails)
Źródło
- URL: https://spamassassin.apache.org/old/publiccorpus/
- Paper: Apache SpamAssassin Project documentation
- Organizacja: Apache SpamAssassin Project
- Rok: 2002-2005 (various releases)
Charakterystyka
- Rozmiar: 6,051 emails total (standard distribution)
- 1,897 spam emails
- 4,150 ham emails
- Some versions include ~1,000 spam + ~5,051 ham
- Podział: Pre-divided collections (easy_ham, hard_ham, spam)
- Klasy/Kategorie: Binary - spam vs ham (legitimate)
- Format: Unix mbox format (.mbox), individual .eml files
- Licencja: Apache License 2.0 (permissive open-source)
Opis
SpamAssassin Public Corpus to standardowy benchmark dataset dla email spam filtering research, stworzony przez Apache SpamAssassin Project jako część rozwoju popularnego open-source spam filtra. Dataset został starannie wyselekcjonowany aby reprezentować realistic spam i ham emails z różnymi poziomami trudności detekcji.
Struktura:
easy_ham/: Clearly legitimate emails (low false positive risk)hard_ham/: Legitimate emails resembling spam (newsletters, marketing)spam/: Phishing i spam emails
Charakterystyka linguistyczna: Mixed personal and business emails, różne języki (primarily English), varied spam tactics (Nigerian scams, pharmaceutical spam, phishing).
Zalety jako benchmark: Well-documented, pre-divided, widely used (>5,000 citations), enables reproducible research.
Zastosowania
- Email spam detection algorithm development
- Machine learning classifier training i evaluation
- Feature engineering research dla text classification
- Baseline benchmarking dla nowych metod
- Email security research
Używany w publikacjach
- Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection - Part of merged 82,486 email dataset; combined with Enron, Ling, CEAS, Nazario, Nigerian Fraud; SVM+TF-IDF achieved 99.1% accuracy
- [18] Gangavarapu et al. 2020 - Random Forest 98.4% accuracy on SpamAssassin+Nazario
- [26] Halgaš et al. 2020 - RNN classifier, combined with Enron and Nazario
- [30] Hijawi et al. 2017 - Feature extraction tool (140 features), Random Forest 99.3% accuracy
Benchmarki
| Model | Metric | Score | Rok | Publikacja |
|---|---|---|---|---|
| SVM + TF-IDF | F1-score | 99.0% | 2024 | Al-Subaiey (merged dataset) |
| Random Forest | Accuracy | 99.3% | 2017 | Hijawi et al. (140 features) |
| Random Forest | Accuracy | 98.4% | 2020 | Gangavarapu (with Nazario) |
| RNN | F1-score | 98.63% | 2020 | Halgaš (SA-JN subset) |
Uwagi
Advantages:
- Well-established benchmark (enables comparison)
- Pre-divided sets (easy/hard ham distinction useful)
- Moderate size (suitable for rapid experimentation)
- Apache license (permissive, commercial-friendly)
- Maintained by reputable organization (Apache)
Limitations:
- Relatively small (~6k emails vs modern datasets 100k+)
- Dated (2002-2005; phishing evolved significantly since)
- Limited diversity (primarily English, Western spam tactics)
- Class imbalance (more ham than spam in standard distribution)
- May not represent modern phishing (sophisticated social engineering, AI-generated content)
Recommendations:
- Combine with recent datasets (e.g., 2020+) dla temporal robustness
- Consider hard_ham subset dla testing false positive rates
- Useful as baseline/sanity check, ale insufficient alone dla production systems
- Good starting point dla educational purposes i algorithm prototyping
Common Variants:
- SA-JN: SpamAssassin combined with Enron (Halgaš 2020)
- Various subsets (easy_ham_2, spam_2) from different collection periods
Tagi
dataset spam-detection phishing email-security nlp apache benchmark legacy-dataset