Intelligent zero-trust networking for IoT

Authors: Mikhail Zolotukhin (mizolotu@jyu.fi), Timo Hämäläinen (timoh@jyu.fi) and Pyry Kotilainen (pyjopeko@jyu.fi). Faculty of Information Technology, University of Jyväskylä. Finland. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the NGI_TRUST grant agreement no 825618.

Vision

With the recent progress in the development of low-budget sensors and machine-to-machine communication, the Internet-of-Things has attracted considerable attention. Unfortunately, many of today’s smart devices are rushed to market with little consideration for basic security and privacy protection, making them easy targets for various attacks. Once a device has been compromised, it can become the starting point for accessing other elements of the network at the next stage of the attack, since traditional IT security castle-and-moat concept implies that nodes inside the private network trust each other. For these reasons, IoT will benefit from adapting a zero-trust networking model which requires strict identity verification for every person and device trying to access resources on a private network, regardless of whether they are located within or outside of the network perimeter. Implementing such model can however become challenging, as the access policies have to be updated dynamically in the context of constantly changing network environment. Thus, there is a need for an intelligent enhancement of the zero-trust network that would not only detect an intrusion on time, but also would make the most optimal real-time crisis-action decision on how the security policy should be modified in order to minimize the attack surface and the risk of subsequent attacks in the future. In this research project, we are aiming to implement a prototype of such defense framework relying on advanced technologies that have recently emerged in the area of software-defined networking and network function virtualization. The intelligent core of the system proposed is planned to employ several reinforcement machine learning agents which process current network state and mitigate both external attacker intrusions and stealthy advanced persistent threats acting from inside of the network environment.

Increasing computing and connectivity capabilities of smart devices in conjunction with users and organizations prioritizing access convenience over security makes such devices valuable asset for cyber criminals. The intrusion detection in IoT is limited due to lack of efficient malware signatures caused by diversity of processor architectures employed by different vendors. In addition to that, owners use mostly manual workflows to address malware-related incidents and therefore they are able to prevent neither attack damage nor potential attacks in the future. Furthermore, since not all devices support over-the-air security updates, or updates without downtime, they might need to be physically accessed or temporarily pulled from production. Thus, many connected smart devices may remain vulnerable and potentially infected for long time resulting in a material loss of revenue and significant costs incurred by not only device owners, but also users and organizations targeted by the attackers as well as network operators and service providers. A potential solution to these and other emerging challenges in IoT is employing zero-trust networking model, that implies that all data traffic generated must be untrusted, no matter if it has been generated from the internal or external network.

In this research, we aim to design and implement an intelligent zero-trust networking solution capable of detecting attacks initiated by both external attackers and smart devices from the inside, adapt detection models under constantly changing network context caused by adding new applications and services or discovering new vulnerabilities and attack vectors, make an optimal set of real-time crisis-action decisions on how the network security policy should be modified in order to reduce the ongoing attack surface and minimize the risk of subsequent attacks in the future. These decisions that may include permitting, denying, logging, redirecting, or instantiating certain traffic between end-points under consideration, are based on behavioral patterns observed in the network and log data obtained from multiple intrusion and anomaly detectors and deployed on the fly with the help of cutting-edge cloud computing technologies such as software-defined networking and network function virtualization. Our implementation of the decision making mechanism in the system proposed is planned to rely on recent advances in reinforcement learning (RL), machine learning paradigm in which software agents automatically determine the ideal behavior within a specific context by continually making value judgments to select good actions over bad. RL algorithms can be used to solve very complex problems that cannot be solved by conventional techniques as they aim to achieve long-term results correcting the errors occurred during the training process.

Recent advent of cutting-edge technologies such as cloud computing, mobile edge computing, network virtualization, software-defined networking (SDN) and network function virtualization (NFV) have changed the way in which network functions and devices are implemented, and also changed the way in which the network architectures are constructed. More specifically, the network equipment or device is now changing from closed, vendor specific to open and generic with SDN technology, which enables the separation of control and data planes, and allows networks to be programmed by using open interfaces. With NFV, network functions previously placed in costly hardware platforms are now implemented as software appliances located on low-cost commodity hardware or running in the cloud computing environment. In this context, the network security service provision has shifted toward replacing traditional proprietary middle-boxes by virtualized and cloud-based network functions in order to enable automatic security service provision. Software-defined perimeter (SDP) is the second alternative architecture for zero trust that borrows concepts from SDN and NFV. An SDP controller functions as a broker of trust between a client and a gateway, which can flexibly establish a transport layer security tunnel terminating on the gateway inside the network perimeter, allowing access to applications. Each device establishes a unique VPN tunnel with the service that is requested, and the origin is cloaked from public view. Each device establishes a unique VPN tunnel with the service that is requested, while the origin is cloaked from public view. SDP relies on the concepts of network access control in an attempt to minimize the impact of existing and emerging network threats by adding authentication of the hosts. Similar to micro-segmentation, SDP enforces the principle of only providing access to the services that are required. Besides this authentication function the SDP controller can enforce authorization policies that may include host type, malware checks, time of day access, and other parameters. The data plane will typically rely on an overlay network to connect hosts via VPN tunnels. However, SDP approach has several drawbacks. For example, an SDP implementation usually requires usage of specific hardware and software gateways and controller appliances. Gateways may be needed at each site where applications are located making the deployment, management, and maintenance of this infrastructure challenging, especially in large globally distributed, high availability environments. In addition, security appliances are supposed to be configured to accept connections and allow traffic from the SDP gateways. Intrusion detection system and firewall rules introduce complexity, holes in the perimeter, and added IT maintenance. In this research project, we focus on solving these drawbacks with the help of state-of-art machine learning techniques. Further in this subsection, we briefly overview the systems found in research articles whose functionality overlaps with our intelligent network defense approach.

Implementation

Similarly to zero-trust networking frameworks, the system is required to authenticate and authorize every user and device connected to the network under protection using user and group databases and the device inventory. Certificates allow the system to identify the user and the device, however, they alone do not grant access privileges, in the sense, that a network connection can be allowed by an access engine and then immediately by the AI core via adding or removing certain SDN flows into the controller. The AI core takes into consideration properties of each device, its known vulnerabilities, connectivity with other devices as well as alerts and anomalous patterns related to its traffic flows and selects one or several security appliances to which traffic of the device should be forwarded. In addition, the AI core selects not only an optimal action for the current state of the environment, but also for all possible or the most probable outcomes of this action. Such proactive approach allows to launch additional VNFs in advance reducing the security downtime, but at the same time requires more computing and storage resources to be allocated. Since those resources are limited, cost efficiency is an important goal when providing security services. In our implementation, the computing, storage and network resources are used by the core as a constraint the solution must satisfy when making the decision. Speaking of the reward, there are three key metrics that can be used to implement this function, namely attack damage cost, countermeasure positive effect and countermeasure negative impact. The RL agent responsible for forwarding the traffic is supposed to be trained to maximize the security performance and minimize the security impact on service quality. In our attack model, we assume that an attacker can be located either outside or inside of the network under protection. The external attacker’s primary goal is to exploit and compromise vulnerable devices, whereas the internal attacker focuses on using infected devices as a tool to attack other services. Thus, the security performance is supposed to account for both the number of devices, routers and servers being compromised as well as volumes of malicious traffic sent towards external services. The security impact on service quality can be estimated by counting dropped connections and measuring jitter and latency in the network environment. The training procedure can be carried out similarly to the one for the detection modules: making multiple copies of the network environment under protection and initiating various attack vectors against its components. Training the AI to respond to attacks requires both modeling user and device interactions with the environment as well as imitating attacker’s strategies. The latter should be taken into consideration as a part of the AI engine proposed, i.e. the RL agent is supposed to be trained to maximize the minimal possible outcome in case of various attack vectors. The former can be implemented by analyzing prerecorded sessions taking into account both network and application protocol semantics.

The second task of the AI core is related to security middle boxes reconfiguration. We are planning to implement such middle boxes using various deep learning algorithms. Such algorithms usually rely on one or several hyper-parameters that drastically affect the detection accuracy. Models trained with different parameters result in different sensibility and resiliency of the detection engine. Another challenge related to intrusion detection-based approach is static nature of the models used to describe normal user behavior. Such models cannot be used as a permanent attack detection solution since traffic in any network system is constantly subject to changes that may be for example related to varying loads of network services during different time of the day or anomalous events caused by a service failure or malicious activity. An intelligent reinforcement learning agent is therefore required to select both a detection algorithm and its parameters that are expected to be the most optimal in the current state of the network environment. Such agent observes features extracted from sessions of network traffic flows and sends new configuration parameters via a cloud orchestrator. The RL agent learns behavioral patterns of benign devices in the network and how those patterns may change in case of one or several of them are compromised. This will allow the agent to select the detection method that is the most optimal in terms of accuracy in the current state of the network environment.

Machine learning for intrusion detection

First objective is to implement and evaluate several state-of-art deep learning algorithms that are planned to be used for the system proposed. The main research task in this work package is implementing and evaluating existing methods as well as developing novel algorithms that can be used to solve various cyber security challenges including intrusion and anomaly detection, ongoing attack mitigation and future attack prevention. The algorithms are supposed to be adjusted for the network data gathered and generated during this project. The algorithms can be divided in two following categories:

Classification and anomaly detection algorithms include standard deep learning methods the main purpose of which is to classify the network traffic either as legitimate or malicious. There are two main approaches: supervised training on the labeled data and unsupervised training only using the legitimate traffic samples. The latter can be achieved by one of the three following methods: clustering traffic flows and distinguishing the clusters with anomalous characteristics, training a machine learning model to predict traffic flows based on the historical data and searching for outliers based on the difference between the predictions and observations, and, finally, training an autoencoder to compress the data samples into a lower-dimensional representations and classifying traffic flows based on the difference in the reconstruction error between observations and their reconstructions.
Reinforcement learning algorithms aim to adjust parameters of the classification and anomaly detection models used for the traffic analysis depending on the network conditions and chaining the security appliances running these models. Both policy-based and value-based algorithms are planned to be studied and evaluated in order to find the most suitable option for the system implementation.

In order to test the algorithms, the following data sets can be used: Network packet captures from CICIDS2018 and UNSW-NB1 datasets. The former contains 560 Gb of traffic generated during 10 days by 470 machines. The dataset in addition to benign samples includes following attacks: infiltration of the network from inside, HTTP denial of service, web, SSH and FTP brute force attacks, attacks based on known vulnerabilities. The latter contains 100 Gb of traffic generated during two days by three servers, two routers and multiple clients. This dataset includes the following types of attacks: fuzzers, backdoors, DoS, exploits, reconnaissance, shellcodes and worms. Since in this stage of the project we cannot produce a network environment to evaluate the reinforcement learning approach proposed, to evaluate performance of different RL algorithms, we are planning to use OpenAI gym that has emerged recently as a standardization effort. OpenAI gym environments include multiple environments to test methods in terms of sample efficiency as well as ability of an intelligent agent to transfer its experience to novel situations.

We concentrate on the intrusion detection based on the analysis of network traffic flows. For each such conversation, at each time window, or when a new packet arrives, we extract the following information: flow duration, total number of packets in forward and backward direction, total size of the packets in forward direction, minimum, mean, maximum, and standard deviation of packet size in forward and backward direction and overall in the flow, number of packets and bytes per second, minimum, mean, maximum and standard deviation of packet inter-arrival time in forward and backward direction and overall in the flow, total number of bytes in packet headers in forward and backward direction, number of packets per second in forward and backward direction, number of packets with different TCP flags, backward-to-forward number of bytes ratio, average number of packets and bytes transferred in bulk in the forward and backward direction, the average number of packets in a sub flow in the forward and backward direction, number of bytes sent in initial window in the forward and backward direction, minimum, mean, maximum and standard deviation of time the flow is active, minimum, mean, maximum and standard deviation of time the flow is idle. All the features can have different scale and therefore they are transformed into range between zero and one with the help of min-max standardization. Performance evaluation of malicious traffic detection supervised deep learning algorithms shows that they allow us to detect malicious connections without many false alarms. Results for the classification models vary in terms of true positive and false positive rates depending on the number of trainable parameters. ROC curves below are respectively for the following malicious traffic detection: FTP brute-force attack, SSH brute-force attack, Web brute-force attacks and botnet attack.

Speaking of reinforcement learning, we experimented with three state-of-art algorithms A2C, ACKTR and PPO evaluated in several virtualized environments. We concentrated on these algorithms as they can be applied for both discrete and continuous environments. In our experiments, PPO consistently provides good results in terms of both average reward and convergence speed (see the first two figures below). There is only one environment ACKTR outperforms PPO. We further will use PPO as our baseline algorithm while using ACKTR in case PPO struggles to find any good policy. Furthermore, we run several experiments with different network architectures (see the last two figures below). The results show that the network with one shared layer followed by two separate streams for policy and value function looks the most promising architecture variant. We also experimented with using shared LSTM layer for both policy and value function, but the results showed that much more steps is required for the algorithm convergence in this case, which can be critical in case of more complicated environment that requires more time per iteration. Code is avilable on Github, experiment reports can be found here and there.

Traffic generation

We tested generative adversarial models for constructing network traffic flows. As mentioned in the introduction, we used conditional GAN. Features extracted from few previous packets of the flow are used as an input to a masked LSTM layer for the generator network. These features include direction (request or reply), inter-arrival time, payload size, TCP window size, and TCP flags The second input of the generator is a random noise vector. Outputs of these two layers are concatenated and the result is fed to an MLP, output of which is feature vector for the next packet. The discriminator network also takes features extracted from the previous packets of the flow as an input. The second input is the feature vector generated by the generator. The generator produces flows that are closer to the real flows extracted from the datasets while the discriminator network tries to determine the differences between real and fake flows. The ultimate goal is to have a generative network that can produce flows which are indistinguishable from the real ones. We trained such GANs separately for different attacks presented in the dataset and the normal traffic. Figures below show the results of applying the classifiers trained in the previous stage to the traffic generated with GANs. As one can see, the results for models which are trained with flow features are more or less inline with the ones obtained using the real data.

We also tried to apply reinforcement learning for spoofing a neural-network-based intrusion detection system by manipulating those flow parameters: probability of a packet to be sent in the current time window, size of the packet by padding the packet with zero bytes, time interval between two adjacent packets and size of the socket's receive buffer (in theory it should affects TCP window). The approach is to maintain a malicious TCP flow with the server as long as possible without it being detected. The idea is to add a packet to a malicious flow in such a way that it is classified by an IDS as normal. If at some point the flow is classified as a malicious, it is immediately blocked. In general, such approach works in the sense that the attacker becomes more and more efficient in keeping the connection alive, but the flow does not last forever. Figures below show some preliminary results for the HTTP password brute-force attack against the vulnerable application as used by the authors of CICIDS2017 dataset. As it can be seen from the figure, more than 80\% of the packets sent by the attacker become undetected after training the reinforcement learning agent, but sooner or later each malicious flow is classified as such anyway and it is blocked: average duration of a malicious flow is only 20 request-reply tuples which is enough for conducting the brute-force attack. As previously, code is available on Github and experiment reports can be found here and there.

Software-defined networking and network function virtualization

We implemented our own network security environment using SDN controller Opendaylight and various virtualization technologies. There are two versions of the environment.

Docker environment has both virtualized devices and security VNFs being created as Docker containers connected to each other via Openflow-enabled OpenVSwitches. Each copy of the environment is deployed in a dedicated virtual machine with applications belonging to different devices running in different containers. At each VM, there is a simple web server which is used to obtain necessary information features and to enter actions from a RL agent. The actions are implemented in form of SDN flows which result in redirecting certain traffic to one of the security containers. Reactive approach is used: security middle-boxes send a statistics report to the AI core, once the AI has received such report, it makes a decision on what actions should be applied to this flow and sends required reconfiguration commands to the corresponding security functions for optimal threat mitigation.
Vagrant environment runs SDN controller and every VNF in a separate virtual machine. Traffic as previously is generated using Docker containers which run on several virtual hosts each of which has its own virtual bridge under SDN control. Proactive approach is used: security functions tag the flows under interest, while the SDN rules pushed by the controller account for these flow tags in order to forward the packets to the destination or the next-hop virtual appliance, the AI core is therefore responsible for predicting the next possible states of the network environment enabling the NFV orchestrator to proactively reconfigure and launch required security functions and the SDN controller to modify flow tables in order to account for all possible tags assigned by the running VNFs.

In both cases, we use OpenAI gym to implement the frontend for the virtualized environment. We initialize flow tables of each SDN switch with one single flow that forwards packets to the next table, until the last table is reached. SDN flows to drop packets or redirect them to a particular security appliance are pushed to the dedicated flow table with priority higher than default forwarding rule. The last table outputs packet to the physical port of its destination as well as mirrors packet to a special virtual switch, on which we run a flow collector. Code is avilable on Github and the description can be found here.

RL agent training environment and preliminary results

In order to evaluate the framework proposed, we designed a simple Python application that sends a random text message to one of the external data servers. Every arbitrary amount of time (between one and three seconds) each device connects to a randomly selected update server and requests several files from it. Some devices are accessed via SSH by external entities imitating the administration process. DNS queries are resolved with the help of the internal DNS server. To generate malicious traffic, we implemented a simple Mirai-like malware with three attack capabilities. First, the malware is able to scan its local network looking for open SSH server ports and, in case such server is found, attempts to login to the server using a short list of user-password combinations. The password brute-forcing is carried out in multiple threads with the list of credentials shuffled randomly in the beginning of the attack. If the correct password has been found, the malware initiates download of its copy to the compromised device from an external server. When the download is complete, the malware initiates a HTTP connection to its C\&C server to inform that the attack has been successful. In case the C\&C server is not available via HTTP, the malware sends a query with specific domain name to the DNS server. The domain name is essentially an encoded version of the same report that the malware sends to the C\&C server via HTTP. The DNS server is configured in advance to forward such queries to the domain that belongs to the attacker. The second attack type performed by the malware uses DNS tunneling to exfiltrate a randomly selected file found on the device to the attacker server using scheme similar to the C\&C channel. Finally, once multiple devices are infected with the malware, the attacker performs an application-based slow DDoS attack Slowloris against one of the data servers used by the legitimate application by sending never ending HTTP requests.

We run multiple copies of the environment in parallel. The training process is divided into episodes. Each episode lasts a certain fixed amount of time, during which one of the attacks is performed. The RL agent is implemented using OpenAI baselines. The agent selects one of the actions for each flow that are sent to the environment back-end where they are transformed to SDN rules. The reward for the action is proportional to the number of packets transferred during the most recent time window and it is calculated for each flow separately. The proportion coefficients are positive for legitimate traffic and negative for the malicious one. The exact values of the coefficients are estimated by running the attack without the agent and counting the average number of packets that are sent by the application and the attacker. The coefficients are then calculated as such a way, that the total reward without the agent's intervention is equal to zero.

We first train the RL agent using DQN and PPO algorithms with multi-layer perceptron (MLP) as both policy and value function to detect and mitigate the attacks mentioned. In case of DQN, first 80 \% of the episodes are used for $\epsilon$-greedy exploration with $\epsilon$ value decreasing from one to zero. For PPO, we collect data from entire episode to calculate cumulative rewards and advantages for each unique host-to-socket tuple, before dividing the resulting dataset to mini-batches which are used to train both the critic and the actor network. With DQN, for each such tuple, we store the state vector, the action taken by the agent, the reward and the feature vector of the same tuple in the next time window, only if there is still active connection. Otherwise, the state is marked as the terminate for this particular tuple. Figures below show the evolution of the reward function throughout the training episodes for both DQN (red) and PPO (blue) in case of three attacks mentioned. As one can notice, both algorithms are able to identify and block malicious connections reducing the number of malicious flows to minimum and subsequently increasing the reward value.