Enron Email Corpus

Informacje podstawowe

  • Nazwa: Enron Email Corpus
  • Alias: Enron Spam Dataset, Enron-Spam
  • Dziedzina: NLP, Cybersecurity
  • Typ: Text (emails)

Źródło

  • URL: http://www.aueb.gr/users/ion/data/enron-spam/
  • Paper: Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. ECML 2004.
  • Organizacja: Originally from Enron Corporation; prepared for research by CMU
  • Rok: 2004 (original emails from 1999-2002)

Charakterystyka

  • Rozmiar: ~500,000 emails total corpus; spam/ham subsets vary by version
    • Typical research subset: 3,000-20,000 emails
    • Al-Subaiey 2024: included as part of merged dataset
  • Podział: Varies by version; typically user-created train/test splits
  • Klasy/Kategorie: Binary - spam (phishing/fraud) vs ham (legitimate)
  • Format: Plain text (.txt), some versions in .eml
  • Licencja: Public domain (released by FERC after Enron bankruptcy)

Opis

Enron Corpus powstał z emaili pracowników Enron Corporation udostępnionych publicznie po skandalu finansowym firmy i postępowaniu FERC (Federal Energy Regulatory Commission). Jest jednym z najstarszych i najpowszechniej używanych datasetów dla email spam/phishing detection research.

Corpus zawiera autentyczn

e emails z prawdziwych skrzynek mailowych, co zapewnia realistic linguistic patterns i organizational context. Dla spam detection, badacze typowo używają podzbioru oznaczonego jako spam vs ham.

Charakterystyka linguistyczna: Professional business emails, formal language, varied topics (energy trading, internal communications, business correspondence).

Historyczne znaczenie: First large-scale public email corpus enabling reproducible research; widely cited w literaturze (>10,000 citations).

Zastosowania

  • Email spam/phishing detection
  • Email classification (topics, urgency, sentiment)
  • Social network analysis (organizational communication patterns)
  • Natural language processing benchmarks
  • Business email compromise (BEC) research

Używany w publikacjach

  • Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection - Part of merged dataset (82,486 emails total); combined with Ling, CEAS, SpamAssassin, Nazario, Nigerian Fraud datasets; 99.1% accuracy achieved
  • Multiple prior works referenced in Al-Subaiey 2024: [25] Ma et al. 2020 (6,000 emails subset), [26] Halgaš et al. 2020, [27] Gibson et al. 2020, [28] Mohammad 2024 (17,171 spam + 16,545 ham)

Benchmarki

ModelMetricScoreRokPublikacja
SVM + TF-IDFF1-score99.0%2024Al-Subaiey (merged dataset)
SVMAccuracy95.5%2020Ma et al. (6k subset)
ELCADP (ensemble)F1-score95.1%2024Mohammad (17k spam subset)

Uwagi

Advantages:

  • Real-world authentic emails (not synthetic)
  • Large corpus enabling diverse research
  • Public domain - no licensing restrictions
  • Widely used benchmark (enables comparison across studies)

Limitations:

  • Emails from 1999-2002 (dated; phishing tactics evolved significantly)
  • Professional/corporate context (may not generalize to consumer emails)
  • Original corpus not specifically labeled for spam - researchers created various spam/ham subsets
  • Privacy concerns addressed by FERC redaction, but some sensitive info may remain

Best Practices:

  • Specify exact subset and version used (many variants exist)
  • Combine with recent datasets for temporal generalization
  • Acknowledge age of corpus when discussing modern phishing threats

Tagi

dataset nlp spam-detection phishing email-classification public-domain legacy-dataset