Dataset Acquisition — FinPhishGuard
Data: 2026-05-12
Projekt: bank-brand-phishing-detection
Status pobierania
| Dataset | Status | Ścieżka lokalna |
|---|---|---|
| PhiUSIIL (~465k URL, 54 MB CSV) | ✅ Pobrano | data/.../phiusiil/PhiUSIIL_Phishing_URL_Dataset.csv |
| Tranco top-1M (benign URLs) | ✅ Pobrano | data/.../alexa-benign/top-1m.csv |
| Phishpedia code (repo GitHub) | ✅ Sklonowano | data/.../phishpedia/ |
| Phishpedia 30k phishing benchmark (18 GB ZIP) | ✅ Pobrano | data/.../phishpedia/phishpedia_30k_benchmark.zip |
| LogoSENSE code + SVM imageset | ✅ Pobrano | data/.../gdrive-logosense/ (348 plików, 21.5 MB) |
| gdrive-targetlist (T102_*.png loga marek) | ✅ Pobrano | data/.../gdrive-targetlist/ (21 plików, 45.3 MB) |
| gdrive-emd (pliki progów EMD) | ✅ Pobrano | data/.../gdrive-emd/ (9 plików, 28.6 MB) |
| expand_targetlist.zip (loga per brand, 277 marek) | ✅ Pobrano | data/.../phishpedia/models/expand_targetlist.zip |
| rcnn_bet365.pth (logo detector weights) | ✅ Pobrano | data/.../phishpedia/models/rcnn_bet365.pth |
| resnetv2_rgb_new.pth.tar (Siamese weights) | ✅ Pobrano | data/.../phishpedia/models/resnetv2_rgb_new.pth.tar |
| domain_map.pkl | ✅ Pobrano | data/.../phishpedia/models/domain_map.pkl (211 KB) |
| faster_rcnn.yaml | ✅ Pobrano | data/.../phishpedia/models/faster_rcnn.yaml (2 KB) |
| Phishpedia Targetlist (181 marek, loga + HTML) | ✅ Pobrano | data/.../phishpedia-targetlist-official/targetlist_fit_copy_half_rename/ (246 MB, 181 folderów) |
| Phishpedia Benign 30k (restricted GDrive) | ⬇️ Do ręcznego pobrania | patrz linki poniżej |
| Phishpedia Labelled Logo (30,649 BBox) | ⬇️ Do ręcznego pobrania | patrz linki poniżej |
| APWG EvalPhishing (pełny dataset) | ✅ Pobrano | data/bank-brand-phishing-detection/evalphishing/ — 152 GB, ~451 514 stron, 25 miesięcy (lip 2021–lip 2023) |
| LogoSENSE obrazy (~1.6 GB ZIP) | ✅ Pobrano | data/.../logosense/logosense_base_data.zip |
| Phishpedia 5-Brand (BoA/Chase/PayPal/DHL/MS) | ⬇️ Do ręcznego pobrania | patrz linki poniżej |
| PhishBlitz (13.8k stron 2025) | ⏳ Do pobrania gdy wolne miejsce | patrz link #3 |
Linki do ręcznego pobrania (otwórz w przeglądarce — wymagane logowanie Google)
Priorytet 1 — Modele Phishpedia (wymagane do uruchomienia systemu)
Otwórz każdy link → poczekaj na przygotowanie pliku → pobierz
| Plik | Link | Rozmiar |
|---|---|---|
expand_targetlist.zip ⭐⭐⭐ | https://drive.google.com/uc?id=1fr5ZxBKyDiNZ_1B6rRAfZbAHBBoUjZ7I | ~500 MB |
rcnn_bet365.pth ⭐⭐⭐ | https://drive.google.com/uc?id=1tE2Mu5WC8uqCxei3XqAd7AWaP5JTmVWH | ~300 MB |
resnetv2_rgb_new.pth.tar ⭐⭐⭐ | https://drive.google.com/uc?id=1H0Q_DbdKPLFcZee8I14K62qV7TTy7xvS | ~150 MB |
faster_rcnn.yaml | https://drive.google.com/uc?id=1Q6lqjpl4exW7q_dPbComcj0udBMDl8CW | ~5 KB |
domain_map.pkl | https://drive.google.com/uc?id=1qSdkSSoCYUkZMKs44Rup_1DPBxHnEKl1 | ~1 MB |
Po pobraniu → umieść wszystkie w:
data/bank-brand-phishing-detection/phishpedia/models/
Priorytet 2 — Datasety Phishpedia (ze strony projektu)
Strona główna: https://sites.google.com/view/phishpedia-site/home
| Co | Link | Rozmiar est. | Cel |
|---|---|---|---|
| Benign Dataset (30,649) | https://drive.google.com/uc?id=1yORUeSrF5vGcgxYrsCoqXcpOUHt-iHq_ | ~15 GB | Negative examples |
| Labelled Logo Part 2 | https://drive.google.com/uc?id=1bH3Yp6K1B37B_sS_MNMz7yvYcOhOu-J8 | ~8 GB | BBox training |
| Labelled Logo Part 3 | https://drive.google.com/uc?id=1u56I0IHBgM9glNJl2wcLfaihp1L_U7eD | ~8 GB | BBox training |
| Targetlist (181 brands) | https://drive.google.com/uc?id=1zxvXFKpLx816VfaGFISL6tod-zSEc6hY | ~50 MB | Brand KB seed |
| 5-Brand Phishing (BoA/Chase/PayPal/DHL/MS) | https://drive.google.com/uc?id=1EJnx9oX9wQieF7UPQJeTVg850nZsuxTi | ~2 GB | Financial brands eval |
Po pobraniu → wypakuj do:
data/bank-brand-phishing-detection/phishpedia-labelled-logo/
data/bank-brand-phishing-detection/phishpedia-targetlist-official/
data/bank-brand-phishing-detection/phishpedia-5brand/
Priorytet 3 — Twój GDrive (folder z EMD i targetlist)
Otwórz → prawy klik na folderze → “Download” → ZIP
| Co | Link | Rozmiar est. |
|---|---|---|
| EMD (loga per brand T*.png) | https://drive.google.com/drive/folders/1Dp6HDK0P9j51ojEuBkUZdQvdGBUxInIm | ~5 GB |
| merge_targetlist (rozszerzone loga) | https://drive.google.com/drive/folders/13IeT95fQkLcrfC1Du8BZ4VgoCVdFmCHs | ~3 GB |
| Visualphishnet (model weights) | https://drive.google.com/drive/folders/1yhnZu_G9oVCViG68-V-edZ2fWP0An_AI | ~1 GB |
Po pobraniu ZIP → wypakuj do:
data/bank-brand-phishing-detection/gdrive-emd/
data/bank-brand-phishing-detection/gdrive-targetlist/
Wiadomość #1 — Ji & Kim 2025 dataset (LinkedIn → Doowon Kim)
Do: Doowon Kim (LinkedIn)
Profil: szukaj “Doowon Kim University of Tennessee” na LinkedIn
Paper: arXiv:2511.09606
Hi Doowon,
I'm Kamil Warpechowski, IT Director at NASK (National Research Institute, Poland) —
we work on national cybersecurity infrastructure including phishing and fraud detection.
I came across your paper "How Can We Effectively Use LLMs for Phishing Detection?"
(arXiv:2511.09606) and found it very relevant to our research on adversarially robust
visual phishing detection targeting financial brand impersonation.
Would it be possible to access the dataset of 19,131 phishing websites used in your
evaluation? We're building a hybrid detection system (URL analysis + visual logo
matching) and would like to benchmark against your LLM baselines on the same data.
Happy to share results and acknowledge your work in any publications.
Best,
Kamil Warpechowski
IT Director, NASK — National Research Institute
Uwaga: NASK jako instytucja rządowa/badawcza zwiększa szansę odpowiedzi vs anonimowy doktorant.
Formularz #2 — LogoSENSE dataset (obrazy z bounding box)
Formularz dostępu: https://docs.google.com/forms/d/e/1FAIpQLSev9KwRshAXPzq8MPB05J104IMNUDyyWCSDy0El1JIUB1JcoA/viewform
Wypełnij formularz podając:
- Name: Kamil Warpechowski
- Institution: NASK — National Research Institute, Poland
- Purpose: Training and evaluation of a deep learning logo localization model for financial brand phishing detection. The dataset will be used as a held-out test set to evaluate generalization of a logo detector trained on Phishpedia Labelled Logo data.
- Publication: Will cite “LogoSENSE: A Companion HOG based Logo Detection Scheme…” (Computers & Security, 2020)
Alternatywa: Jeśli brak odpowiedzi — LogoSENSE można zastąpić Phishpedia Labelled Logo (30k stron), który jest większy i lepszy.
Link #3 — PhishBlitz dataset (13.8k stron 2025)
Status: Dostęp otrzymany — do pobrania gdy zwolni się miejsce na dysku.
Link do danych (OneDrive IIT Dharwad): https://iitdh-my.sharepoint.com/personal/222011001_iitdh_ac_in/_layouts/15/onedrive.aspx?ga=1&id=%2Fpersonal%2F222011001%5Fiitdh%5Fac%5Fin%2FDocuments%2FPhishBlitz%5FDataset%2FREADME%2Etxt&parent=%2Fpersonal%2F222011001%5Fiitdh%5Fac%5Fin%2FDocuments%2FPhishBlitz%5FDataset
Po pobraniu → wypakuj do: data/bank-brand-phishing-detection/phishblitz/
Priorytet akcji
- Natychmiast — przenieś pobrane modele (.pth, .zip) do
data/.../phishpedia/models/ - Natychmiast — LinkedIn do Doowona Kima (wiadomość #1) — dataset Ji & Kim 2025
- Ten tydzień — wypełnij formularz LogoSENSE (link w sekcji #2, niski priorytet jeśli masz Phishpedia Labelled)
- Opcjonalnie — pobierz Phishpedia Labelled Logo + Benign 30k ze strony projektu (wymagane do trenowania M2a)