DGraph-Fin
Informacje podstawowe
- Nazwa: DGraph-Fin
- Alias: DGraph Finance, DGraph Credit Default Dataset
- Dziedzina: Fraud Detection, Credit Scoring, Financial Risk Assessment
- Typ: Graph data (loan guarantor network)
Źródło
- URL: Dostępny publicznie
- Paper: DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection (Huang et al., 2022)
- Organizacja: Finvolution Group, Tsinghua University
- Rok: 2022
Charakterystyka
- Rozmiar: 3,700,550 nodes, 4,300,999 edges
- Podział: Określany przez użytkowników (typowo 5-fold cross-validation)
- Klasy/Kategorie: Binary (creditworthy users vs credit defaulters)
- Format: Graph structure with node features
- Licencja: Publicly available (NeurIPS Datasets and Benchmarks Track 2022)
- Feature dimension: 17 features (timestamps and user profiles details)
Opis
DGraph-Fin to large-scale financial dataset przeznaczony do credit default detection, stworzony przez Finvolution Group (leading fintech company w Chinach). Dataset reprezentuje loan guarantor network, gdzie nodes to borrowers/users a edges są konstruowane na podstawie guarantor contact information - jeśli dwóch borrowers ma tego samego guarantor, są połączeni edge.
Features obejmują timestamps (loan application time, account creation), user profile details (age, income level, employment status), oraz loan-related information. Dataset jest używany do przewidywania credit default risk wykorzystując graph structure patterns.
Zastosowania
- Credit default prediction
- Financial risk assessment
- Loan default detection
- Guarantor network analysis
- Graph-based credit scoring
- Large-scale financial fraud detection
- Risk propagation analysis w financial networks
Używany w publikacjach
- Global Attribute-Association Pattern Aggregation for Graph Fraud Detection - GAAP osiągnęło 7.73% Rec@K (best performance, +0.21pp improvement). Most challenging dataset ze wszystkich 7 - wszystkie methods osiągnęły low scores (<8%), likely z powodu severe class imbalance lub complex fraud patterns.
Benchmarki
| Model | Metric | Score | Rok | Publikacja |
|---|---|---|---|---|
| GAAP | Rec@K | 7.73% | 2025 | Duan et al. AAAI-25 |
| BWGNN | Rec@K | 7.57% | 2022 | Tang et al. |
| DGA-GNN | Rec@K | 7.52% | 2024 | Duan et al. |
| BGNN | Rec@K | 7.70% | 2021 | Ivanov et al. |
| GAT | Rec@K | 7.14% | 2018 | Veličković et al. |
| GCN | Rec@K | 7.05% | 2017 | Kipf & Welling |
| XGBGraph | Rec@K | 6.96% | 2024 | Tang et al. GADBench |
Uwagi
- Most challenging dataset: Wszystkie methods < 8% Rec@K
- Relation concept: Loan Guarantor (timestamps and user profiles details)
- Largest dataset by number of nodes (3.7M) w GAAP experiments
- Sparse graph: 4.3M edges dla 3.7M nodes (avg degree ~1.16)
- Severe class imbalance lub very complex fraud patterns
- Real-world industrial dataset z Finvolution Group (production environment)
- Part of NeurIPS 2022 Datasets and Benchmarks Track
Tagi
dataset fraud-detection credit-scoring financial-risk loan-default graph-data large-scale fintech benchmark neurips