Supervised ML pipeline that classifies network and system logs as benign or malicious, combining a Random Forest classifier with anomaly scoring to flag intrusions in real time.
Built a dual-signal detection pipeline: a Random Forest classifier scores each flow against its training labels while an IsolationForest flags structurally anomalous records that fall outside the learned distribution — combining supervised accuracy with unsupervised coverage for unknown attack patterns.
Wrote parsers for two real-world log formats: CICIDS2017 CSV (with column-name whitespace, 'Infinity' string values in rate columns, and label normalization) and Suricata eve.json (both newline-delimited and JSON array), so the same trained model can run against captures from different toolchains without pre-processing.
Feature engineering separates numeric flow statistics (duration, packet counts, byte volumes, flow rates) from categorical protocol fields via a ColumnTransformer — StandardScaler on numerics, OneHotEncoder on categoricals — then feeds into a balanced-class Random Forest with parallelized training across all cores.
The CLI (`threatdetect train / predict / evaluate`) handles the full lifecycle: training with an 80/20 stratified split, serializing the model bundle and a metadata JSON with top-20 feature importances, and producing per-record threat probability reports in JSON for downstream alerting or SIEM ingestion.
Trained on 2,264,594 records from the CICIDS2017 dataset — covering DoS, DDoS, PortScan, Brute Force, Web Attacks, Infiltration, and Botnet traffic — and evaluated on a held-out 566,149 records: 99.33% accuracy, 98.96% F1-macro, 99.89% ROC-AUC.