This post takes a look at one of Microsoft’s interesting datasets for applying machine learning to cyber security analytics use-cases
We’ll look at some of the issues that have been found with it, and use a corrected version of this data
Specifically, we’ll mainly use PySpark, which is an API for using Apache Spark on Databricks, allowing processing of large-scale, distributed data, which we’ll apply to building a machine-learning model that can distinguish between benign traffic and attack traffic in the dataset.
We then see how to deploy this model as an API using the FastAPI library, and run it within our Databricks environment.