Tolokers
Informacje podstawowe
- Nazwa: Tolokers
- Alias: Toloka Crowdsourcing Fraud Dataset
- Dziedzina: Fraud Detection, Crowdsourcing, Platform Security
- Typ: Graph data (work collaboration network)
Źródło
- URL: Dostępny publicznie
- Paper: A critical look at the evaluation of GNNs under heterophily: Are we really making progress? (Platonov et al., 2023)
- Organizacja: Yandex, HSE University
- Rok: 2023
Charakterystyka
- Rozmiar: 11,758 nodes, 519,000 edges
- Podział: Określany przez użytkowników (typowo 5-fold cross-validation)
- Klasy/Kategorie: Binary (legitimate users vs fraudulent users)
- Format: Graph structure with node features
- Licencja: Publicly available
- Feature dimension: 10 features (user profile with task performance statistics)
Opis
Tolokers dataset jest przeznaczony do wykrywania fraudulent users na platformie crowdsourcing Toloka (operated by Yandex). Dataset reprezentuje work collaboration network, gdzie nodes to users a edges to collaboration relationships based on working on similar tasks.
Features obejmują user profile information oraz task performance statistics takie jak completion rates, accuracy, time spent, behavioral patterns. Dataset jest używany do identyfikacji użytkowników którzy manipulują platform rewards lub provide low-quality work.
Zastosowania
- Crowdsourcing platform fraud detection
- Low-quality worker identification
- Platform abuse detection
- Work collaboration network analysis
- Behavioral anomaly detection
- Quality control w crowdsourcing systems
Używany w publikacjach
- Global Attribute-Association Pattern Aggregation for Graph Fraud Detection - GAAP osiągnęło 56.08% Rec@K (best performance, +0.94pp improvement). Relatively challenging dataset z moderate performance scores across all methods.
Benchmarki
| Model | Metric | Score | Rok | Publikacja |
|---|---|---|---|---|
| GAAP | Rec@K | 56.08% | 2025 | Duan et al. AAAI-25 |
| DGA-GNN | Rec@K | 55.14% | 2024 | Duan et al. |
| XGBGraph | Rec@K | 53.43% | 2024 | Tang et al. GADBench |
| RFGraph | Rec@K | 52.18% | 2024 | Tang et al. GADBench |
| BWGNN | Rec@K | 50.31% | 2022 | Tang et al. |
| GraphSAGE | Rec@K | 48.75% | 2017 | Hamilton et al. |
| GAS | Rec@K | 47.04% | 2019 | Li et al. |
Uwagi
- Challenging dataset: Moderate performance (~56%) nawet dla best methods
- Relation concept: Work Collaboration (user profile with task performance statistics)
- Smallest graph among 7 datasets (11,758 nodes, 519k edges)
- Represents crowdsourcing platform fraud - unique fraud type
- Dataset exhibits heterophily properties (może wyjaśniać moderate performance)
- Użyteczny do testowania GNN robustness under heterophilous graphs
Tagi
dataset fraud-detection crowdsourcing platform-security graph-data heterophily quality-control benchmark