Pobierz PDF

BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot Detection

Metadane

Autorzy: Daniel DeAlcala, Aythami Morales, Ruben Tolosana, Alejandro Acien, Julian Fierrez, Santiago Hernandez, Miguel A. Ferrer, Moises Diaz
Rok: 2023
Źródło: arXiv:2207.13394v3 [cs.LG] (11 Apr 2023)
DOI/Link: https://arxiv.org/abs/2207.13394v3
Status: read
Tagi: keystroke-dynamics synthetic-data bot-detection generative-models captcha biometrics gan kde

Streszczenie

BeCAPTCHA-Type proponuje trzy metody generowania syntetycznych danych keystroke biometric do trenowania systemów wykrywania botów. Praca porównuje podejścia statystyczne (Universal Model, User-dependent Model) z podejściem opartym na Generative Neural Network (GNN). Metody są walidowane na zadaniu bot detection używając syntetycznych próbek do trenowania klasyfikatorów (SVM, Random Forest, Gaussian Naive Bayes, LSTM).

Eksperymenty przeprowadzono na zbiorze Dhakal Dataset (136 milionów keystroke events od 168,000 użytkowników). Analiza obejmuje wpływ: (1) ilości dostępnych danych treningowych (20-500 użytkowników), (2) typu syntetycznych danych, (3) dostępności key-codes.

Wyniki pokazują, że w scenariuszach z dużą ilością labelowanych danych (500 użytkowników), syntetyczne próbki są wykrywalne z wysoką dokładnością (98-100% dla SVM, LSTM, RF). Jednak w scenariuszach few-shot learning (20-100 użytkowników) detekcja botów jest wyzwaniem - szczególnie dla User-dependent i GNN models, które generują bardziej realistyczne próbki. Praca pokazuje potencjał syntetycznych keystroke data do: (1) trenowania bot detectorów, (2) testowania odporności systemów biometrycznych, (3) passive CAPTCHA applications.

Kluczowe Wnioski

Trzy metody syntezy keystroke dynamics: (1) Universal Model (KDE na całej populacji), (2) User-dependent Model (KDE per-user), (3) Generative Neural Network (GNN learning distributions)
Realism syntetycznych próbek: One-Class SVM (trained only on human) osiąga ~50% accuracy → synthetic samples są trudne do odróżnienia od human w one-class setting
Binary classification z syntetykami: SVM/LSTM/RF osiągają 77-100% accuracy gdy trenowane na human+synthetic → syntetyczne próbki są użyteczne do trenowania bot detectorów
Hierarchia realizmu: User-dependent > GNN > Universal (w kontekście trudności detekcji)
Few-shot learning challenge: Z 20 użytkownikami treningowymi: RF najlepszy (88-95% acc), LSTM słabszy (50-53%), GNB ~65-68%
Large-scale performance: Z 500 użytkownikami: Universal i GNN osiągają perfect classification (100% SVM/RF/LSTM), User-dependent wymaga LSTM+key-codes dla 99%
Key-code dependency: GNN model uwzględnia key-codes w syntezie → LSTM lepiej wykrywa z key-codes; Universal/User-dependent niezależne od key-codes
Generalization ability: GNN samples trenują detektory które lepiej generalizują na unseen Universal samples (97-98% acc cross-database)
Passive CAPTCHA feasibility: Keystroke dynamics może służyć jako passive CAPTCHA - bez aktywności użytkownika

Metodologia

Dataset: Dhakal (2018)

Rozmiar: 168,000 użytkowników, 136 milionów keystroke events
Protokół: Semi-fixed text - użytkownik uczy się zdania, pisze jak najszybciej (15 zdań/user, 3-70 znaków)
Split: 100,000 users training, 68,000 users test
Features (5 total):
1. Hold Latency (f¹): czas między key press i key release
2. Inter-Press Latency (f²): czas między dwoma press events
3. Inter-Release Latency (f³): czas między dwoma release events
4. Inter-Key Latency (f⁴): czas między release(j) i press(j+1) - może być negative (rollover typing)
5. Key Code (f⁵): ASCII normalized 0-1

Metoda 1: Universal Model (Statistical KDE)

Idea: Jeden zestaw KDE functions dla całej populacji (4 KDE dla f¹-f⁴)

Proces:

Zbierz wszystkie keystroke timestamps z 100K users
Parametryzuj do 4 time features f
Dla każdego feature: trenuj KDE function F^i (Gaussian kernel, σ=1.0):
- F^i(x) = (1/N) Σ K(x - f^i_j; σ)
Generowanie synthetic sample:
- Input: sequence of K keys k=[k₀,…,k_K]
- Sample f¹, f⁴ from F¹, F⁴ (random sampling dla variability)
- Oblicz timestamps: t’₀=0, t’₁=t’₀+f¹₁, t’₂=t’₁+f⁴₁, t’₃=t’₂+f¹₂, …

Zalety: Proste, aproksymuje całą populację Wady: Nie modeluje intra-user dependencies, nie modeluje key-dependent features (np. ‘a’ po ‘q’ vs ‘a’ po ‘e’)

Metoda 2: User-dependent Model (Statistical KDE)

Idea: Osobne KDE functions dla każdego użytkownika (4 KDE × M users)

Proces:

Podziel dane po użytkownikach (u)
Dla każdego user u: trenuj 4 KDE functions F^u,i dla jego keystroke features f^u,i
M trained models (każdy reprezentuje jednego użytkownika)
Generowanie: sample z modelu F^u’ konkretnego użytkownika u’

Zalety: Modeluje intra-user correlations (każdy user ma unique typing pattern) Wady: Ograniczone dane per-user (mniej dokładne KDE niż Universal), nadal brak key-dependencies

Metoda 3: Generative Neural Network (GNN)

Idea: Neural network uczący się parametrów dystrybucji keystroke times dla każdego key-code

Architektura (4 osobne GNN dla f¹, f², f³, f⁴):

Input: key-code k
2× Fully Connected (100 units, tanh activation)
1× Fully Connected (1 unit, linear) → output: parametry dystrybucji (μ, σ dla Gaussian)
Sampling layer: tworzy PDF z parametrów i samplinguje

Loss function (Likelihood):

Loss = -log(Prob_D(X^i = x^i | P^i))

gdzie P^i = [μ^i, σ^i] to parametry learned distribution, x^i to real time feature

Proces:

Training: key-codes jako input → network uczy się dystrybucji czasów dla każdego key
Inference: input key-code → network generuje parametry → random sample z dystrybucji

Zalety: Uczy się key-dependent distributions, non-deterministic output (variability) Wady: Wymaga więcej danych niż statistical methods, tylko Universal (brak User-dependent GNN - za mało samples/user)

Bot Detection Framework

Scenariusze eksperymentalne:

Closed Set: Train i test na tym samym typie synthetic data (różne samples, ten sam generator)
Open Set: Train na jednym typie synthetic, test na innym (np. train=Universal, test=GNN)

Classifiers:

One-Class SVM: Trained tylko na human samples (baseline dla oceny realizmu)
SVM (RBF kernel): Binary classifier (human vs synthetic)
Random Forest: Ensemble tree-based
Gaussian Naive Bayes: Probabilistic, assumes feature independence
LSTM: Recurrent network dla sekwencji keystroke features

Data scenarios (liczba synthetic users do trenowania):

Limited: 20 users (300 synthetic + 300 human samples)
Medium: 100 users (1,500 + 1,500)
Large: 500 users (7,500 + 7,500)

Ewaluacja: 500 bot + 500 human samples (15,000 total) test set

Główne Koncepcje

Keystroke Dynamics: Behavioral biometrics - analiza wzorców pisania (timing between key press/release events), passive authentication
Kernel Density Estimation (KDE): Non-parametric algorithm do aproksymacji probability distributions - estymuje gęstość z samples
Rollover Typing: Efekt gdy następny klawisz jest naciśnięty zanim aktualny jest zwolniony (negative Inter-Key Latency) - naturalny w szybkim pisaniu
Semi-Fixed Text Protocol: Użytkownik uczy się zdania (fixed) ale każdy user pisze inne zdania (semi-fixed) - redukuje intra-class variability vs free-text
Generative Neural Network (GNN): Network uczący się parametrów dystrybucji (μ, σ) zamiast deterministycznych wartości - loss = likelihood function
One-Class Classification: Trenowanie tylko na jednej klasie (human), detekcja outliers (bots) - test realizmu syntetycznych próbek
Few-Shot Learning: Scenariusz z bardzo małą ilością labeled training data (20-100 users) - challenge dla deep learning
Passive CAPTCHA: Bot detection bez aktywnej interakcji użytkownika - monitoring keystroke dynamics w tle (vs traditional CAPTCHA: solve puzzle)
Web-Biometrics: Behavioral biometrics w kontekście web - keystroke, mouse, touch dynamics - transparent dla użytkownika

Wyniki

RQ1: Realism of Synthetic Samples (One-Class SVM)

One-Class SVM (trained only on human, test on bot detection):

Gen Model	# Train Users	Accuracy (K=0)	Accuracy (K=1)
User-dep	20	0.43	0.44
User-dep	100	0.54	0.54
User-dep	500	0.53	0.54
Universal	20	0.43	0.44
Universal	100	0.55	0.56
Universal	500	0.53	0.54
GNN	20	0.49	0.47
GNN	100	0.52	0.51
GNN	500	0.53	0.53

Wniosek: OC-SVM ~50% accuracy → syntetyczne próbki są bardzo zbliżone do human (chance level) - trudne do rozróżnienia bez labeled synthetic data

RQ2: Bot Detection Performance (Closed Set, K=1 with key-codes)

Large-scale (500 users):

Gen Model	SVM	GNB	RF	LSTM
User-dep	0.90	0.64	0.95	0.99
Universal	1.00	0.70	1.00	1.00
GNN	0.99	0.68	0.99	1.00

Medium-scale (100 users):

Gen Model	SVM	GNB	RF	LSTM
User-dep	0.79	0.63	0.92	0.51
Universal	0.98	0.67	0.98	0.79
GNN	0.97	0.64	0.98	0.60

Few-shot (20 users):

Gen Model	SVM	GNB	RF	LSTM
User-dep	0.77	0.65	0.88	0.50
Universal	0.82	0.68	0.94	0.53
GNN	0.80	0.68	0.95	0.52

Wnioski:

Large-scale: Perfect classification (100%) dla Universal/GNN z SVM/RF/LSTM
Medium-scale: RF najlepszy (92-98%), LSTM słabszy dla User-dep/GNN (wymaga więcej danych)
Few-shot: RF zdecydowanie najlepszy (88-95%), LSTM fails (~50% chance level)
GNB: Zawsze ~63-70% (nie uwzględnia correlations między features)

RQ3: Generalization (Open Set, 500 users, K=1)

Train=Universal, Test=GNN:

Classifier	Accuracy
OC-SVM	0.54
SVM	0.62
RF	0.54
LSTM	0.99

Train=GNN, Test=Universal:

Classifier	Accuracy
OC-SVM	0.54
SVM	0.98
RF	0.95
LSTM	1.00

Wniosek: Trenowanie na GNN samples pozwala lepiej generalizować na Universal samples (98-100% vs 54-99%) - GNN samples bardziej diverse/challenging

Porównanie z State-of-the-Art (500 users, K=1, Open Set)

Method	Train=Univ, Test=GNN	Train=GNN, Test=Univ
[5] Euclidean	0.49	0.49
[30] SVM	0.62	0.98
Ours (LSTM)	0.99	1.00
Ours (RF)	0.54	0.95

Wniosek: LSTM classifier przewyższa poprzednie metody (Alamri 2022, Stefan 2012), osiągając near-perfect detection (99-100%)

Key Findings

Realism hierarchy: User-dependent (~50% OC-SVM) ≈ GNN ≈ Universal - wszystkie generują realistic samples
Best detector: LSTM (large-scale), RF (medium/few-shot)
Key-codes impact: Minimal dla RF/GNB, helpful dla LSTM (key-time associations), harmful dla SVM w few-shot (tekst identyczny dla human i bot)
Training data requirements:
- LSTM: wymaga 500+ users dla convergence
- RF: działa od 20 users (tree-based → sparse data OK)
- SVM: 100+ users dla good performance
Passive CAPTCHA viability: Z 500 users training → 100% detection possible (keystroke dynamics as transparent bot detection)

Przydatne Cytaty

“The use of Artificial Intelligence (AI) in cyberattacks is an important concern for our society. Along with the massive use of the Internet, the usage of bots to access digital services and platforms has grown, being the detection of these bots an open challenge with a high worldwide economical impact.” (str. 1)

“Biometric technologies appear as a solution to distinguish between human and synthetic behaviors.” (str. 1)

“Keystroke biometrics play an important role in bot detection due to its suitability in digital environments. Keyboards and touchscreens are among the most common human-machine interfaces nowadays.” (str. 1)

“The low performance of the OC SVM suggests that synthetic samples present realistic patterns which can not be differentiate from those obtained in real data (using a one-class classification algorithm).” (str. 6)

“The high bot detection accuracy obtained for the binary SVM classifier answers the question about the usefulness of including synthetic samples in training.” (str. 6)

“From the way in which the synthetic samples are generated and the models trained, it is more relevant a coherence between the different keystroke time features than the coherence between each keystroke time feature and the key-code.” (str. 7)

“In this work we have analyzed the feasibility of using a behavioral trait (dynamic typing) such as passive CAPTCHA where the subject has no need to perform any activity in order for the system to determine if this subject is a bot or a human.” (str. 8)

Datasety

Dhakal Dataset (2018) - dhakal-typing-observations-2018 - 168,000 users, 136M keystroke events, semi-fixed text protocol (15 sentences/user, 3-70 chars), 100K train / 68K test split

Powiązane Tematy

Synthetic biometric data generation (GANs for fingerprints, faces, iris)
Keystroke dynamics for user authentication (fixed-text vs free-text)
Adversarial attacks on biometric systems (presentation attacks, synthetic forgeries)
Deep learning for time series synthesis (Flow-GAN, Neural Autoregressive)
Few-shot learning for biometrics (limited enrollment data)
Passive authentication systems (transparent to user)
Behavioral biometrics (gait, mouse dynamics, touch dynamics)
CAPTCHA evolution (traditional → behavioral → passive)
Privacy-preserving biometrics (keystroke without key-codes)
Transfer learning w keystroke biometrics (cross-device, cross-language)

Notatki

Ograniczenia:

GNN User-dependent brak: Nie zaimplementowano User-dependent GNN (za mało samples/user w Dhakal - 15 sentences)
No key-sequence dependencies: Żadna metoda nie modeluje dependencies między kolejnymi keys (np. ‘th’ vs ‘ql’ digraphs)
Semi-fixed text: Protokół zakłada learned sentence (nie fully free-text) - może nie generalizować na spontaneous typing
Single dataset: Tylko Dhakal - brak cross-dataset validation
Desktop keyboards only: Brak mobile/touchscreen keystroke dynamics

Kluczowe innowacje:

Novel GNN architecture: Network learning distribution parameters (μ, σ) zamiast deterministic values - likelihood loss function
Comprehensive comparison: 3 synthesis methods × 5 classifiers × 3 data scenarios × 2 protocols (with/without key-codes)
Few-shot analysis: Pierwszy work analizujący keystroke bot detection w few-shot settings (20-100 users)
Passive CAPTCHA validation: Empiryczne dowody na viability keystroke dynamics jako passive bot detection
Generalization study: Cross-database Open Set experiments (train=Universal, test=GNN i vice versa)

Potencjalne rozszerzenia:

Key-sequence modeling: LSTM/Transformer dla dependencies między kolejnymi keys (digraphs, trigraphs)
User-dependent GNN: Wymaga więcej samples/user (50+ sentences) - możliwe z continuous authentication datasets
Multimodal synthesis: Keystroke + mouse + scroll dynamics (full web-biometrics)
Adversarial training: GAN-based approach where generator tries to fool discriminator (bot detector)
Cross-device synthesis: Generate mobile keystroke from desktop patterns (transfer learning)
Language-aware synthesis: Model language-specific patterns (English vs Spanish typing speeds)

Powiązanie z RESEARCH-TOPICS-FROM-OPEN-RECOMMENDATION.md:

Topic 1.2: Behavioral Biometrics dla Bot Detection - synthetic keystroke data do trenowania detektorów
Topic 3.1: Transformer Models dla Behavioral Analysis - możliwe rozszerzenie z LSTM → Transformer dla key sequences
Topic 4.1: Emotion Detection z Behavioral Events - typing speed/corrections mogą wskazywać frustration
Topic 5.2: GDPR-Compliant Behavioral Analytics - keystroke bez key-codes (privacy-preserving, K=0 experiments)

Praktyczne zastosowania:

Passive CAPTCHA: Transparent bot detection (RF z 20 users = 88% acc, 500 users = 100%)
Biometric system testing: Synthetic attacks do testowania odporności keystroke auth systems
Data augmentation: Zwiększanie training data dla keystroke authentication (especially few-shot scenarios)
Privacy research: Badanie trade-offs między accuracy a privacy (with/without key-codes)
E-commerce bot detection: Integration z web forms (login, search, checkout) - passive monitoring

Research

Przeglądaj

BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot Detection

BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot Detection

Metadane

Streszczenie

Kluczowe Wnioski

Metodologia

Dataset: Dhakal (2018)

Metoda 1: Universal Model (Statistical KDE)

Metoda 2: User-dependent Model (Statistical KDE)

Metoda 3: Generative Neural Network (GNN)

Bot Detection Framework

Główne Koncepcje

Wyniki

RQ1: Realism of Synthetic Samples (One-Class SVM)

RQ2: Bot Detection Performance (Closed Set, K=1 with key-codes)

RQ3: Generalization (Open Set, 500 users, K=1)

Porównanie z State-of-the-Art (500 users, K=1, Open Set)

Key Findings

Przydatne Cytaty

Datasety

Powiązane Tematy

Notatki