Feature Engineering for Networking

Sameer Mahajan
6 min readMar 9, 2024

Introduction

Both Machine Learning (ML) and Networking are established and evolved fields in their way. It is interesting to see their intersection. You can read about why it is hard to apply machine learning to networking in this rather dated article:

https://www.sdxcentral.com/articles/news/machine-learning-hard-apply-networking/2017/01/

The first premise of the article is that there is no single unifying theory behind networking. Such a fundamental thing does not change overnight. Things have been changing and improving since then surely but very slowly. Hence any help in assisting the application of ML to the network is welcome. There has been a lot of research happening in the area. Existing tools and techniques are getting enhanced. A lot of new tools and techniques are getting developed.

One aspect of ML is dealing with many attributes of data that characterize its nature. From the collected data these attributes are extracted which are called features in the ML world. These features are transformed and processed so that they can be consumed by ML for visualization to provide insight, running ML algorithms to build models and make predictions. This process is called feature engineering. It is a very important process in any interdisciplinary application of ML to any domain like networking in this particular case. It brings domain-specific aspects into the analysis. This article focuses on feature engineering in networking.

Packet Capture

The first part of the ML pipeline is data. For networking, this data is provided by the network traffic and packets flowing through it. These packets are captured by tools like Wireshark, tcpdump etc. Once captured they can be visualized for scanning all information they contain. One way to do it is to use Wireshark UI. The following is a screenshot of Wireshark UI looking at a network capture or pcap file.

Wireshark UI

The top part is Packet List View summarizing all packets. It gives top-level fields for all packets like No, Time, Source, Destination, Protocol, Length and Info. The middle part is the Tree View for the packet selected in the list view above. It gives information about all the layers in the packet. You can click on any layer to expand it to get more information inside that layer. The picture shows an expanded view of the IPFIX part. It also marks the timestamp field within this frame as an illustration.

Converting to CSV

As ML practitioners know, most ML tools need data in Comma Separated Values (CSV) format. It is easy to do it from within Wireshark by clicking on File → Export Packet Dissections → As CSV. This saved CSV is loaded and previewed in pandas below:

import pandas as pd
df1 = pd.read_csv('data1.csv')
df1.head()
Dataframe

As you can see this CSV shows the same fields as those in the top-level Packet List View.

What if you want to export fields in other layers like the timestamp in IPFIX? Don’t worry. You can use Wireshark’s command-line tshark utility.

To tshark first, you specify input file with -r option as

-r data.pcap

Let’s say we are interested only in IP traffic. In that case, we would like to filter out ICMP and ARP traffic in this particular example. You can do it with -Y option as

-Y “!(icmp or arp)”

since we want a CSV file with a header we additionally specify

-E header=y -E separator=,

We start specifying individual fields we want to extract from the pcap file into CSV fields with the -T option as

-T fields

each field is specified by the -e option. In a complete command line reference we specify timestamp and len fields from IPFIX and source ip, destination ip and protocol fields from IP as below:

tshark -r data.pcap -Y "!(icmp or arp)" -E header=y -E separator=, -T fields -e cflow.timestamp -e ip.src -e ip.dst -e ip.proto -e cflow.len > data2.csv

Here is a screenshot of the data2 preview loaded in a pandas data frame:

Preprocessed dataframe

it shows the timestamp field in desired datetime format.

You can review the entire process in a video at:

The protocols enrich the packet information. There are techniques that instrument packets to add even more useful information in them. Some instrumentations add packet’s round trip time (RTT). RTT can closely reflect the underlying network latency which is a very useful metric for analyzing networks. There are around 271000 fields across 3000 protocols supported by Wireshark which make them a gold mine for ML features. You can refer to https://www.wireshark.org/docs/dfref/ for the complete reference to specify fields from. You can also read more about filtering in tshark at: https://tshark.dev/analyze/packet_hunting/packet_hunting/

Connection-Oriented Analysis

The network consists of connections that are identified by unique source and destination ip pairs. Captured packets can be grouped by these connections to identify unique connections in the network.

df1.groupby(["Source", 'Destination']).count().reset_index()

You can also perform aggregations across all packets for each connection and print a single aggregated value for every attribute as

df1.groupby(["Source", 'Destination']).agg(({'Time':'sum', 'Length':'mean'}))

Here we add up values for Time and average those of Length.

At times we need to process packets for only a particular connection. In that case we first filter packets for only specified connection and perform further processing on them. This is the way to do it in python:

def process(df, connection, further_process):
filtered_df = df[(df["Source"] == connection.Source) & (df["Destination"] == connection.Destination)]
return further_process(filtered_df)

Other Levels of Analysis

We can perform server-level analysis by filtering only on Destination (for requests to the server) or Source (in case of responses from the server). It can be done at an aggregated service level. A service consists of a group of servers. The filtering can be done for a group of Destination or Source. A comparison can be made across servers within a service or clients communicating to service to detect an outlier. This analysis can be extended to various other levels like subnet etc.

Time Series Modelling

Using the timestamp field we can perform some time series modelling. For the time series, we will use the fbprophet package. We choose the Length field for modelling. Fbprophet requires the timestamp field to be named ‘ds’ and modelled field to be named ‘y’. The data frame to be used for modelling is required to have only these two fields. This is how we do it in python:

b = df2.rename(columns={'cflow.timestamp':'ds','cflow.len':'y'})[{"ds", "y"}]

# This is how to build the time series model:

from prophet import Prophet
m = Prophet()
m.fit(b)

# This is how to make predictions for future and make a plot:

future = m.make_future_dataframe(periods=5)
forecast = m.predict(future)
fig1 = m.plot(forecast)

Conclusion

We started with why it is hard to apply ML to networking. We saw how to gather data in networking using packet capture. We saw that how captured packets are a gold mine for coming up with features for ML. We saw ways of extracting all the information from packets into flat CSV. We saw a couple of use cases of connection-oriented analysis and time series modelling for how to process the extracted features.

I hope these tips in feature engineering bring you a step closer to applying ML to networking.

--

--

Sameer Mahajan

Generative AI, Machine Learning, Deep Learning, AI, Traveler