Enron Email Corpus
Informacje podstawowe
- Nazwa: Enron Email Corpus
- Alias: Enron Spam Dataset, Enron-Spam
- Dziedzina: NLP, Cybersecurity
- Typ: Text (emails)
Źródło
- URL: http://www.aueb.gr/users/ion/data/enron-spam/
- Paper: Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. ECML 2004.
- Organizacja: Originally from Enron Corporation; prepared for research by CMU
- Rok: 2004 (original emails from 1999-2002)
Charakterystyka
- Rozmiar: ~500,000 emails total corpus; spam/ham subsets vary by version
- Typical research subset: 3,000-20,000 emails
- Al-Subaiey 2024: included as part of merged dataset
- Podział: Varies by version; typically user-created train/test splits
- Klasy/Kategorie: Binary - spam (phishing/fraud) vs ham (legitimate)
- Format: Plain text (.txt), some versions in .eml
- Licencja: Public domain (released by FERC after Enron bankruptcy)
Opis
Enron Corpus powstał z emaili pracowników Enron Corporation udostępnionych publicznie po skandalu finansowym firmy i postępowaniu FERC (Federal Energy Regulatory Commission). Jest jednym z najstarszych i najpowszechniej używanych datasetów dla email spam/phishing detection research.
Corpus zawiera autentyczn
e emails z prawdziwych skrzynek mailowych, co zapewnia realistic linguistic patterns i organizational context. Dla spam detection, badacze typowo używają podzbioru oznaczonego jako spam vs ham.
Charakterystyka linguistyczna: Professional business emails, formal language, varied topics (energy trading, internal communications, business correspondence).
Historyczne znaczenie: First large-scale public email corpus enabling reproducible research; widely cited w literaturze (>10,000 citations).
Zastosowania
- Email spam/phishing detection
- Email classification (topics, urgency, sentiment)
- Social network analysis (organizational communication patterns)
- Natural language processing benchmarks
- Business email compromise (BEC) research
Używany w publikacjach
- Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection - Part of merged dataset (82,486 emails total); combined with Ling, CEAS, SpamAssassin, Nazario, Nigerian Fraud datasets; 99.1% accuracy achieved
- Multiple prior works referenced in Al-Subaiey 2024: [25] Ma et al. 2020 (6,000 emails subset), [26] Halgaš et al. 2020, [27] Gibson et al. 2020, [28] Mohammad 2024 (17,171 spam + 16,545 ham)
Benchmarki
| Model | Metric | Score | Rok | Publikacja |
|---|---|---|---|---|
| SVM + TF-IDF | F1-score | 99.0% | 2024 | Al-Subaiey (merged dataset) |
| SVM | Accuracy | 95.5% | 2020 | Ma et al. (6k subset) |
| ELCADP (ensemble) | F1-score | 95.1% | 2024 | Mohammad (17k spam subset) |
Uwagi
Advantages:
- Real-world authentic emails (not synthetic)
- Large corpus enabling diverse research
- Public domain - no licensing restrictions
- Widely used benchmark (enables comparison across studies)
Limitations:
- Emails from 1999-2002 (dated; phishing tactics evolved significantly)
- Professional/corporate context (may not generalize to consumer emails)
- Original corpus not specifically labeled for spam - researchers created various spam/ham subsets
- Privacy concerns addressed by FERC redaction, but some sensitive info may remain
Best Practices:
- Specify exact subset and version used (many variants exist)
- Combine with recent datasets for temporal generalization
- Acknowledge age of corpus when discussing modern phishing threats
Tagi
dataset nlp spam-detection phishing email-classification public-domain legacy-dataset