Phishpedia 30k Phishing Benchmark

Metadane

Autorzy: Lin et al. (USENIX Security 2021)
Rok: 2021 (zbieranie 2019-2020)
Źródło: https://sites.google.com/view/phishpedia-site/home
GitHub: https://github.com/lindsey98/Phishpedia
Status: ✅ Pobrano lokalnie (18 GB ZIP)
Licencja: Academic/research use
Kategoria: Security / Visual Phishing / Benchmark

Zawartość

Zbiór	Liczba	Format
Phishing stron	29,496	URL + HTML + screenshot + brand label
Adnotacje	per strona	target brand (z 277 marek)

Każda strona: {site_id}/shot.png, {site_id}/info.txt (URL).

Plik lokalny

data/bank-brand-phishing-detection/phishpedia/phishpedia_30k_benchmark.zip

Wypakuj:

cd data/bank-brand-phishing-detection/phishpedia
unzip phishpedia_30k_benchmark.zip -d benchmark_30k/

Uwagi do użycia w EXP-5

Ważne: Phishpedia i PhishIntention były trenowane na podzbiorze tego datasetu. Aby uniknąć data leakage:

Użyj losowego 20% split jako test set (~5,900 stron)
Sprawdź overlap URL z Phishpedia training set przed ewaluacją
Lub użyj jako secondary benchmark obok phishtank-crawl-2026

# Sprawdź overlap
test_urls = set(open("phishpedia_test_urls.txt").read().splitlines())
train_urls = set(open("phishpedia_train_urls.txt").read().splitlines())
overlap = test_urls & train_urls
print(f"Overlap: {len(overlap)} / {len(test_urls)} ({100*len(overlap)/len(test_urls):.1f}%)")

Użycie w projekcie

M2a Logo Localization (training):
  Źródło: Phishpedia Labelled Logo (osobny dataset z bounding box)
  
EXP-5 secondary benchmark:
  Phishing: 20% split → ~5,900 stron
  Benign: Tranco top-5k (crawl)

Używany w publikacjach

publications/with-pdf/lin-phishpedia-usenix-2021/ — paper oryginalny
publications/with-pdf/liu-phishintention-usenix-2022/ — PhishIntention (ten sam dataset)
publications/references/ji-llm-phishing-detection-2025/ — Ji & Kim 2025 używa jako baseline

Research

Przeglądaj

phishpedia-30k-benchmark

Phishpedia 30k Phishing Benchmark

Metadane

Zawartość

Plik lokalny

Uwagi do użycia w EXP-5

Użycie w projekcie

Używany w publikacjach

Graf

Spis treści