PhishTank 2020 Dataset (Custom Collection)
Informacje podstawowe
- Nazwa: PhishTank 2020 Dataset
- Alias: PhishTank Custom Collection, PhishChain Evaluation Dataset
- Dziedzina: Cybersecurity, URL Classification
- Typ: URL data with verification metadata
Źródło
- URL: https://www.phishtank.com/ (original platform)
- Paper: PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs (WWW 2022)
- Organizacja: Custom collection by PhishChain authors from PhishTank platform
- Rok: 2020 (collection period: January 1 - December 23, 2020)
Charakterystyka
- Rozmiar: 23,000 URLs total
- Raw data: 17,000 phishing URLs, 6,000 non-phishing URLs
- Balanced evaluation set: 12,000 URLs (6,000 phishing + 6,000 non-phishing)
- Podział: Custom balanced subset for evaluation (6k/6k split)
- Klasy/Kategorie: Binary - phishing vs non-phishing (legitimate)
- Format: Scraped metadata including verifiers, verification order, timestamps
- Licencja: PhishTank data (check PhishTank terms of use)
Opis
PhishTank 2020 Dataset to custom collection stworzona przez autorów PhishChain paper na podstawie PhishTank platform. Dataset zawiera URLs wraz z metadanymi crowd-sourced verification: lista verifiers dla każdego URL, kolejność weryfikacji (temporal ordering), i crowd-sourced labels (phishing/non-phishing).
Unikalna charakterystyka: W przeciwieństwie do standardowego PhishTank API access, ten dataset zawiera dodatkowo scraped temporal verification data - kto i w jakiej kolejności weryfikował każdy URL. Ta informacja była krytyczna dla testowania PageRank-based truth discovery algorithm w PhishChain.
Sparse verification problem: PhishTank retrospective analysis pokazuje, że tylko handful of verifiers weryfikuje każdy URL (mimo thousands registered users) - co było kluczową motywacją dla PageRank-based approach zamiast traditional truth discovery algorithms (EM, GLAD) zakładających majority verification.
Collection period: Dane zebrane z PhishTank platform dla URLs submitted/verified od 1 stycznia do 23 grudnia 2020 - reprezentuje phishing landscape przed widespread AI-generated phishing (2023+).
Zastosowania
- Truth discovery algorithm evaluation dla sparse crowd-sourcing scenarios
- Phishing URL detection benchmarking
- Crowd-sourced verification system research
- Temporal verification pattern analysis
- Verifier behavior modeling
Używany w publikacjach
- PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs - Evaluation dataset dla PageRank-based truth discovery algorithm; balanced 12k subset (6k phishing + 6k non-phishing); 95.45% accuracy achieved outperforming EM (93.71%) i GLAD (93.98%)
Benchmarki
| Model | Metric | Score | Rok | Publikacja |
|---|---|---|---|---|
| PageRank-based Truth Discovery | Accuracy | 95.45% | 2022 | PhishChain (WWW 2022) |
| PageRank-based Truth Discovery | Precision | 96.74% | 2022 | PhishChain (WWW 2022) |
| PageRank-based Truth Discovery | Recall | 94.31% | 2022 | PhishChain (WWW 2022) |
| EM (Expectation Maximization) | Accuracy | 93.71% | 2022 | PhishChain baseline |
| GLAD | Accuracy | 93.98% | 2022 | PhishChain baseline |
Uwagi
Advantages:
- Real-world crowd-sourced data (not synthetic)
- Temporal verification metadata (enables time-based analysis)
- Balanced evaluation set (avoids class imbalance bias)
- Sparse verification pattern (realistic scenario dla crowd-sourcing research)
- Multiple verifiers per URL (enables truth discovery algorithm testing)
Limitations:
- Dated collection period (2020; phishing tactics evolved significantly 2020-2026)
- Not publicly available as standalone dataset (custom scrape by authors)
- PhishTank platform limitations acknowledged (inconsistent labeling, lack of transparency)
- Limited to URLs (no email content, no full webpage screenshots)
- Class imbalance w raw data (17k phishing vs 6k non-phishing)
Comparison to standard PhishTank:
- Standard PhishTank API: binary labels (phishing/not phishing)
- This dataset: labels + temporal verification metadata + verifier identities
- Enables verifier-verifier graph construction dla PageRank-based truth discovery
PhishTank Transparency Issues (documented w PhishChain paper):
- URL marked phishing despite all 5 verifiers agreeing phishing
- URL marked safe despite zero verification
- Decision process opaque → motivation dla blockchain-based transparent alternative
Recommendation:
- Use dla sparse crowd-sourcing algorithm research
- Combine with recent datasets (2024+) dla temporal robustness testing
- Consider PhishTank platform evolution (API changes, verification process updates)
Tagi
dataset phishing-detection url-classification cybersecurity crowd-sourcing truth-discovery phishtank legacy-dataset verification-metadata