Back to projects
SECURITY · Python · scikit-learn · pandas

Threat Detection System

Supervised ML pipeline that classifies network and system logs as benign or malicious, combining a Random Forest classifier with anomaly scoring to flag intrusions in real time.

scikit-learnPythonRandom ForestIsolationForest

Built a dual-signal detection pipeline: a Random Forest classifier scores each flow against its training labels while an IsolationForest flags structurally anomalous records that fall outside the learned distribution — combining supervised accuracy with unsupervised coverage for unknown attack patterns.

Wrote parsers for two real-world log formats: CICIDS2017 CSV (with column-name whitespace, 'Infinity' string values in rate columns, and label normalization) and Suricata eve.json (both newline-delimited and JSON array), so the same trained model can run against captures from different toolchains without pre-processing.

Feature engineering separates numeric flow statistics (duration, packet counts, byte volumes, flow rates) from categorical protocol fields via a ColumnTransformer — StandardScaler on numerics, OneHotEncoder on categoricals — then feeds into a balanced-class Random Forest with parallelized training across all cores.

The CLI (`threatdetect train / predict / evaluate`) handles the full lifecycle: training with an 80/20 stratified split, serializing the model bundle and a metadata JSON with top-20 feature importances, and producing per-record threat probability reports in JSON for downstream alerting or SIEM ingestion.

Trained on 2,264,594 records from the CICIDS2017 dataset — covering DoS, DDoS, PortScan, Brute Force, Web Attacks, Infiltration, and Botnet traffic — and evaluated on a held-out 566,149 records: 99.33% accuracy, 98.96% F1-macro, 99.89% ROC-AUC.