PhishTank 2020 Dataset (Custom Collection)

Informacje podstawowe

  • Nazwa: PhishTank 2020 Dataset
  • Alias: PhishTank Custom Collection, PhishChain Evaluation Dataset
  • Dziedzina: Cybersecurity, URL Classification
  • Typ: URL data with verification metadata

Źródło

  • URL: https://www.phishtank.com/ (original platform)
  • Paper: PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs (WWW 2022)
  • Organizacja: Custom collection by PhishChain authors from PhishTank platform
  • Rok: 2020 (collection period: January 1 - December 23, 2020)

Charakterystyka

  • Rozmiar: 23,000 URLs total
    • Raw data: 17,000 phishing URLs, 6,000 non-phishing URLs
    • Balanced evaluation set: 12,000 URLs (6,000 phishing + 6,000 non-phishing)
  • Podział: Custom balanced subset for evaluation (6k/6k split)
  • Klasy/Kategorie: Binary - phishing vs non-phishing (legitimate)
  • Format: Scraped metadata including verifiers, verification order, timestamps
  • Licencja: PhishTank data (check PhishTank terms of use)

Opis

PhishTank 2020 Dataset to custom collection stworzona przez autorów PhishChain paper na podstawie PhishTank platform. Dataset zawiera URLs wraz z metadanymi crowd-sourced verification: lista verifiers dla każdego URL, kolejność weryfikacji (temporal ordering), i crowd-sourced labels (phishing/non-phishing).

Unikalna charakterystyka: W przeciwieństwie do standardowego PhishTank API access, ten dataset zawiera dodatkowo scraped temporal verification data - kto i w jakiej kolejności weryfikował każdy URL. Ta informacja była krytyczna dla testowania PageRank-based truth discovery algorithm w PhishChain.

Sparse verification problem: PhishTank retrospective analysis pokazuje, że tylko handful of verifiers weryfikuje każdy URL (mimo thousands registered users) - co było kluczową motywacją dla PageRank-based approach zamiast traditional truth discovery algorithms (EM, GLAD) zakładających majority verification.

Collection period: Dane zebrane z PhishTank platform dla URLs submitted/verified od 1 stycznia do 23 grudnia 2020 - reprezentuje phishing landscape przed widespread AI-generated phishing (2023+).

Zastosowania

  • Truth discovery algorithm evaluation dla sparse crowd-sourcing scenarios
  • Phishing URL detection benchmarking
  • Crowd-sourced verification system research
  • Temporal verification pattern analysis
  • Verifier behavior modeling

Używany w publikacjach

Benchmarki

ModelMetricScoreRokPublikacja
PageRank-based Truth DiscoveryAccuracy95.45%2022PhishChain (WWW 2022)
PageRank-based Truth DiscoveryPrecision96.74%2022PhishChain (WWW 2022)
PageRank-based Truth DiscoveryRecall94.31%2022PhishChain (WWW 2022)
EM (Expectation Maximization)Accuracy93.71%2022PhishChain baseline
GLADAccuracy93.98%2022PhishChain baseline

Uwagi

Advantages:

  • Real-world crowd-sourced data (not synthetic)
  • Temporal verification metadata (enables time-based analysis)
  • Balanced evaluation set (avoids class imbalance bias)
  • Sparse verification pattern (realistic scenario dla crowd-sourcing research)
  • Multiple verifiers per URL (enables truth discovery algorithm testing)

Limitations:

  • Dated collection period (2020; phishing tactics evolved significantly 2020-2026)
  • Not publicly available as standalone dataset (custom scrape by authors)
  • PhishTank platform limitations acknowledged (inconsistent labeling, lack of transparency)
  • Limited to URLs (no email content, no full webpage screenshots)
  • Class imbalance w raw data (17k phishing vs 6k non-phishing)

Comparison to standard PhishTank:

  • Standard PhishTank API: binary labels (phishing/not phishing)
  • This dataset: labels + temporal verification metadata + verifier identities
  • Enables verifier-verifier graph construction dla PageRank-based truth discovery

PhishTank Transparency Issues (documented w PhishChain paper):

  • URL marked phishing despite all 5 verifiers agreeing phishing
  • URL marked safe despite zero verification
  • Decision process opaque → motivation dla blockchain-based transparent alternative

Recommendation:

  • Use dla sparse crowd-sourcing algorithm research
  • Combine with recent datasets (2024+) dla temporal robustness testing
  • Consider PhishTank platform evolution (API changes, verification process updates)

Tagi

dataset phishing-detection url-classification cybersecurity crowd-sourcing truth-discovery phishtank legacy-dataset verification-metadata