PhishTank 2020 Dataset (Custom Collection)

Informacje podstawowe

Nazwa: PhishTank 2020 Dataset
Alias: PhishTank Custom Collection, PhishChain Evaluation Dataset
Dziedzina: Cybersecurity, URL Classification
Typ: URL data with verification metadata

Źródło

URL: https://www.phishtank.com/ (original platform)
Paper: PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs (WWW 2022)
Organizacja: Custom collection by PhishChain authors from PhishTank platform
Rok: 2020 (collection period: January 1 - December 23, 2020)

Charakterystyka

Rozmiar: 23,000 URLs total
- Raw data: 17,000 phishing URLs, 6,000 non-phishing URLs
- Balanced evaluation set: 12,000 URLs (6,000 phishing + 6,000 non-phishing)
Podział: Custom balanced subset for evaluation (6k/6k split)
Klasy/Kategorie: Binary - phishing vs non-phishing (legitimate)
Format: Scraped metadata including verifiers, verification order, timestamps
Licencja: PhishTank data (check PhishTank terms of use)

Opis

PhishTank 2020 Dataset to custom collection stworzona przez autorów PhishChain paper na podstawie PhishTank platform. Dataset zawiera URLs wraz z metadanymi crowd-sourced verification: lista verifiers dla każdego URL, kolejność weryfikacji (temporal ordering), i crowd-sourced labels (phishing/non-phishing).

Unikalna charakterystyka: W przeciwieństwie do standardowego PhishTank API access, ten dataset zawiera dodatkowo scraped temporal verification data - kto i w jakiej kolejności weryfikował każdy URL. Ta informacja była krytyczna dla testowania PageRank-based truth discovery algorithm w PhishChain.

Sparse verification problem: PhishTank retrospective analysis pokazuje, że tylko handful of verifiers weryfikuje każdy URL (mimo thousands registered users) - co było kluczową motywacją dla PageRank-based approach zamiast traditional truth discovery algorithms (EM, GLAD) zakładających majority verification.

Collection period: Dane zebrane z PhishTank platform dla URLs submitted/verified od 1 stycznia do 23 grudnia 2020 - reprezentuje phishing landscape przed widespread AI-generated phishing (2023+).

Zastosowania

Truth discovery algorithm evaluation dla sparse crowd-sourcing scenarios
Phishing URL detection benchmarking
Crowd-sourced verification system research
Temporal verification pattern analysis
Verifier behavior modeling

Używany w publikacjach

PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs - Evaluation dataset dla PageRank-based truth discovery algorithm; balanced 12k subset (6k phishing + 6k non-phishing); 95.45% accuracy achieved outperforming EM (93.71%) i GLAD (93.98%)

Benchmarki

Model	Metric	Score	Rok	Publikacja
PageRank-based Truth Discovery	Accuracy	95.45%	2022	PhishChain (WWW 2022)
PageRank-based Truth Discovery	Precision	96.74%	2022	PhishChain (WWW 2022)
PageRank-based Truth Discovery	Recall	94.31%	2022	PhishChain (WWW 2022)
EM (Expectation Maximization)	Accuracy	93.71%	2022	PhishChain baseline
GLAD	Accuracy	93.98%	2022	PhishChain baseline

Uwagi

Advantages:

Real-world crowd-sourced data (not synthetic)
Temporal verification metadata (enables time-based analysis)
Balanced evaluation set (avoids class imbalance bias)
Sparse verification pattern (realistic scenario dla crowd-sourcing research)
Multiple verifiers per URL (enables truth discovery algorithm testing)

Limitations:

Dated collection period (2020; phishing tactics evolved significantly 2020-2026)
Not publicly available as standalone dataset (custom scrape by authors)
PhishTank platform limitations acknowledged (inconsistent labeling, lack of transparency)
Limited to URLs (no email content, no full webpage screenshots)
Class imbalance w raw data (17k phishing vs 6k non-phishing)

Comparison to standard PhishTank:

Standard PhishTank API: binary labels (phishing/not phishing)
This dataset: labels + temporal verification metadata + verifier identities
Enables verifier-verifier graph construction dla PageRank-based truth discovery

PhishTank Transparency Issues (documented w PhishChain paper):

URL marked phishing despite all 5 verifiers agreeing phishing
URL marked safe despite zero verification
Decision process opaque → motivation dla blockchain-based transparent alternative

Recommendation:

Use dla sparse crowd-sourcing algorithm research
Combine with recent datasets (2024+) dla temporal robustness testing
Consider PhishTank platform evolution (API changes, verification process updates)

Tagi

dataset phishing-detection url-classification cybersecurity crowd-sourcing truth-discovery phishtank legacy-dataset verification-metadata

Research

Przeglądaj

PhishTank 2020 Dataset (Custom Collection)

PhishTank 2020 Dataset (Custom Collection)

Informacje podstawowe

Źródło

Charakterystyka

Opis

Zastosowania

Używany w publikacjach

Benchmarki

Uwagi

Tagi

Graf

Spis treści

Odnośniki zwrotne