SpamAssassin Public Email Corpus

Informacje podstawowe

  • Nazwa: SpamAssassin Public Email Corpus
  • Alias: SpamAssassin Public Corpus, SA Corpus
  • Dziedzina: NLP, Cybersecurity, Email Security
  • Typ: Text (emails)

Źródło

Charakterystyka

  • Rozmiar: 6,051 emails total (standard distribution)
    • 1,897 spam emails
    • 4,150 ham emails
    • Some versions include ~1,000 spam + ~5,051 ham
  • Podział: Pre-divided collections (easy_ham, hard_ham, spam)
  • Klasy/Kategorie: Binary - spam vs ham (legitimate)
  • Format: Unix mbox format (.mbox), individual .eml files
  • Licencja: Apache License 2.0 (permissive open-source)

Opis

SpamAssassin Public Corpus to standardowy benchmark dataset dla email spam filtering research, stworzony przez Apache SpamAssassin Project jako część rozwoju popularnego open-source spam filtra. Dataset został starannie wyselekcjonowany aby reprezentować realistic spam i ham emails z różnymi poziomami trudności detekcji.

Struktura:

  • easy_ham/: Clearly legitimate emails (low false positive risk)
  • hard_ham/: Legitimate emails resembling spam (newsletters, marketing)
  • spam/: Phishing i spam emails

Charakterystyka linguistyczna: Mixed personal and business emails, różne języki (primarily English), varied spam tactics (Nigerian scams, pharmaceutical spam, phishing).

Zalety jako benchmark: Well-documented, pre-divided, widely used (>5,000 citations), enables reproducible research.

Zastosowania

  • Email spam detection algorithm development
  • Machine learning classifier training i evaluation
  • Feature engineering research dla text classification
  • Baseline benchmarking dla nowych metod
  • Email security research

Używany w publikacjach

  • Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection - Part of merged 82,486 email dataset; combined with Enron, Ling, CEAS, Nazario, Nigerian Fraud; SVM+TF-IDF achieved 99.1% accuracy
  • [18] Gangavarapu et al. 2020 - Random Forest 98.4% accuracy on SpamAssassin+Nazario
  • [26] Halgaš et al. 2020 - RNN classifier, combined with Enron and Nazario
  • [30] Hijawi et al. 2017 - Feature extraction tool (140 features), Random Forest 99.3% accuracy

Benchmarki

ModelMetricScoreRokPublikacja
SVM + TF-IDFF1-score99.0%2024Al-Subaiey (merged dataset)
Random ForestAccuracy99.3%2017Hijawi et al. (140 features)
Random ForestAccuracy98.4%2020Gangavarapu (with Nazario)
RNNF1-score98.63%2020Halgaš (SA-JN subset)

Uwagi

Advantages:

  • Well-established benchmark (enables comparison)
  • Pre-divided sets (easy/hard ham distinction useful)
  • Moderate size (suitable for rapid experimentation)
  • Apache license (permissive, commercial-friendly)
  • Maintained by reputable organization (Apache)

Limitations:

  • Relatively small (~6k emails vs modern datasets 100k+)
  • Dated (2002-2005; phishing evolved significantly since)
  • Limited diversity (primarily English, Western spam tactics)
  • Class imbalance (more ham than spam in standard distribution)
  • May not represent modern phishing (sophisticated social engineering, AI-generated content)

Recommendations:

  • Combine with recent datasets (e.g., 2020+) dla temporal robustness
  • Consider hard_ham subset dla testing false positive rates
  • Useful as baseline/sanity check, ale insufficient alone dla production systems
  • Good starting point dla educational purposes i algorithm prototyping

Common Variants:

  • SA-JN: SpamAssassin combined with Enron (Halgaš 2020)
  • Various subsets (easy_ham_2, spam_2) from different collection periods

Tagi

dataset spam-detection phishing email-security nlp apache benchmark legacy-dataset