Feature identification and processing strategies of machine learning techniques in big data traffic analysis
Online veröffentlicht: 24. Sept. 2025
Eingereicht: 28. Dez. 2024
Akzeptiert: 27. Apr. 2025
DOI: https://doi.org/10.2478/amns-2025-0997
Schlüsselwörter
© 2025 Ze Li, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
In the era of big data, the types of network attacks present diverse characteristics, the volume of security data presents explosive growth, and traditional network security analysis methods cannot meet the demand for massive data analysis [1-2]. In real-world scenarios such as network access logs, vehicle trajectory tracking, industrial defects, power loads, and medical intensive care, traffic is usually generated in real time and has more obvious temporal characteristics [3-5]. Due to the emergence of new attack methods or changing attack behaviors, traffic anomalies will also change, which puts higher requirements on the field of traffic anomaly detection including feature recognition and processing strategies [6-8].
Anomaly detection is also known as outlier detection, noise detection, novelty detection, deviation detection, etc [9]. Earlier the motivation for detecting outliers in data traffic was mainly to clean up the anomalous data, i.e., to identify and remove outliers from the dataset so that the model can fit the training data [10-11]. Nowadays, there is an increasing tendency to focus on the anomalies themselves, i.e., what type of anomalies they are, for early warning purposes [12]. In addition, real-time data processing is crucial under large-scale data flow, which usually requires real-time computation and online detection, making conventional anomaly traffic detection methods face great challenges [13-15]. Conventional detection methods require constant batch extraction of data for processing, which cannot adequately capture the fleeting anomalies in real-time data [16-17]. For the anomaly detection problem of large-scale data streams, online detection by applying multiple machine learning algorithms can adaptively match large-scale real-time traffic scenarios as well as improve online training and visualization interaction capabilities, which can help to enhance the security protection capability of information systems.
In this paper, the features of network big data traffic are first analyzed, and the principles of traffic feature extraction and identification are determined according to the characteristics of traffic features. The stacked self-coding neural network is introduced into the twin neural network model, which is used as the input to the twin neural network, and the convolutional layer and pooling layer are used to extract and identify the input big data traffic features. Then the extracted big data traffic features are input into the twin neural network, and the distance function is applied to calculate the distance between the features of each traffic sample to determine whether the traffic data belongs to the same category or not, so as to realize the feature recognition and processing of big data traffic. Finally, the big data traffic recognition and processing model architecture is designed by combining the big data traffic preprocessing and feature recognition processing model. In order to evaluate the performance of the traffic recognition and processing model, this study carries out performance analysis and evaluation on public datasets, and compares and analyzes the model with other models to verify the effectiveness and accuracy of the model. A simulation platform is also built to test the model’s ability to detect normal and abnormal traffic during application.
The big data traffic in the network follows a certain format when it is transmitted at the transport layer, with MAC header, IP header, TCP header and MAC tail in the TCP message. But the characteristics of the application layer network traffic are mainly focused on the payload net load portion of the message. First of all, the message is coarsely extracted, removing the MAC header, IP header, TCP header and MAC tail to retain only the net payload part of the message data, and according to the port number, the data is divided into HTTP data on port 80 and HTTPS data on port 443. Through the investigation of HTTP and HTTPS protocols, it can be seen that the HOST, URL and other fields in the HTTP protocol have the role of unique identification, which can be extracted as the characteristics of the traffic message.HTTPS protocol encrypts the message through the SSL layer on top of HTTP, which makes it difficult to extract the effective plaintext information from the effective static load. HTTPS protocol data transmission before the client and the server will first establish a handshake process, through the handshake protocol to establish encrypted safe and reliable transmission. The Server Name Indication extension in the Extension field of Client_hello request will carry HOST information, which is the domain name information that the client tells the server in advance through the SNI extension, and the server obtains the corresponding certificate according to the domain name and returns it to the client to complete the verification process. In many firewall software, the HOST information in the SNI domain is used to identify HTTPS traffic.
After determining the presence of Host information in the Client_hello request, the Host information in the Client_hello request is extracted as a feature of HTTPS traffic. After the above process, the traffic has been categorized into HTTP traffic and HTTPS traffic according to ports, and the HOST and URL information in the payload has been extracted for HTTP traffic, and the HOST field in the SNI field of the Client_hello request information has been extracted for HTTPS traffic. The traditional DPI approach is to build a feature library based on these feature fields and then do pattern matching with unknown traffic based on the feature library. But this DPI method has many defects. Firstly, the feature library needs to be updated, and it is troublesome to update the feature library. Secondly, the features in the feature library are limited, and the HOST and URL that are not exactly the same or not in the feature library can only be determined as negative classes, which will make the recall rate of the whole system very low. To address the above problems, this paper proposes to combine the machine learning model with DPI to complete the traffic identification, and further analyze the extracted HOST and URL information first.
From the above analysis, it is easy to find that both URL and host fields have good uniqueness for APP applications, which is in line with the principle of feature extraction. The path field in the URL makes it almost impossible for the same URL to appear in different apps, and the directory itself has certain rules. For the same app the closer to the root directory the more unique the filename is, so according to this rule you can appropriately trim the filename away from the root directory.
Twin neural networks [18] were first used for face recognition, which belongs to a type of machine learning that consists of two branches of a neural network, where the weights of the neural network are shared. The outputs of the two branches can then be used to determine how similar the inputs of the two branches are. Because the input to a twin neural network is a sample pair, the network can increase the amount of training data and make full use of the limited dataset to train the neural network, so it is very suitable for small sample learning. In this paper, we propose the full convolutional twin network (SiamFC) algorithm [19], which utilizes a deep convolutional neural network as a function to accomplish similarity learning using a deep network of twin architecture, this method greatly accelerates the computational speed, while simplifying the similarity computation process, which makes the twin neural network more practical in the field of large data traffic feature learning.
SCNN (Convolutional Twin Network) consists of two convolutional neural networks sharing parameters and weights, firstly, a pair of support set traffic data samples and query set traffic data samples are inputted into two identical convolutional neural networks, the traffic data passes through convolutional and pooling layers, and its features are extracted into the feature space, and finally, the feature distances between the two pairs of samples are calculated by the distance function, and if the feature distances of the two samples are are similar, then they are in the same category, otherwise they are in different categories. The samples are fed into the SCNN model and the features extracted by the convolutional neural (CNN) are as follows:
The output through the fully connected and SoftMax layers is represented as follows:
The contrast loss function is used in SCNN to process the relevant paired data of the twin neural network with the expression:
Where Equation (5) represents the Euclidean distance (two-paradigm number) between two big data flow samples features
The SAE-SCNN model consists of a stacked self-coding neural network (SAE) [20] and a twin neural network (SCNN), in which the neural network constructed by the SAE is used as the input to the twin network, the SAE is used for traffic feature extraction, and the twin network (SCNN) is used to classify the anomalous traffic, that is, the Euclidean distance function is used for the traffic samples to obtain a further measure of the similarity of the two sample pairs.The SAE model structure flowchart is shown in Fig. 1. The SAE is a neural network containing multiple layers of sparse self-encoders, where the output of each layer is used as the input of the next layer. The sparse self-encoder provides inputs to each layer of the SAE network through unsupervised parameter initialization, followed by self-encoding training to obtain the hidden features of each layer as learning outcomes. The next layer of the network needs to input the output of the previous layer of the network as a way to carry out the self-coding training. Unsupervised training of the SAE network is then carried out layer by layer and finally the whole network is validated using labeled traffic data and then the SAE network parameters are tuned using gradient descent method.

SAE model structure process
For feature extraction of big data traffic, a stacked self encoding neural network (SAE) is trained.SAE contains multiple implicit layers which are composed of encoding part and decoding part of multiple AEs.The encoding formula of SAE is as follows:
Its decoding formula is:
Where
The big data traffic detection model designed in this paper consists of three components: big data traffic collection, feature recognition and processing, and model construction, and its specific structural flow is shown in Figure 2. Big data traffic collection is done by replicating network behavior in a virtual environment and capturing big data traffic based on packet-capture software, which serves as a sample dataset for subsequent analysis. The feature extraction part is equipped with the SAE-SCNN model proposed above, and its main role is to preprocess the input raw data packets and perform feature extraction according to certain rules, in order to realize the separation of representative features from the level of the data flow, achieve the purpose of preliminary portrayal of the input data, and generate the corresponding feature vectors. In the big data flow detection part, in order to improve the accuracy and efficiency of the classification step, the associations between feature vectors are learned based on clustering algorithms, and the associated features are generated. In the feature representation and processing part, the feature self-learning network is constructed by combining the clustering idea and utilizing the machine learning model, and through iteration, the network learns the low-dimensional embedded representation of the features and performs the fusion operation with the original traffic features as the sample correlation features, which serves as the final feature representation of the big data traffic. Finally, the classifier module is constructed, and the extended traffic features are input into the classifier module, which finally realizes the classification of the input big data traffic.

Large number according to traffic identification processing process
Normal traffic on a network is usually uniform, continuous and protocol compliant. While abnormal traffic such as DoS attack traffic is irregular, sudden, large amount of traffic that violates the protocol specification.DoS attack [21] is a kind of and very destructive attack, which is a threat to the Internet and has a strong concealment, DoS attack traffic and typical normal traffic are highly similar and not easy to be detected, so the overall performance of the big data traffic identification model is evaluated by using the type of DoS attack as the Example of abnormal traffic detection for research and analysis. The experiments in this paper use the intrusion detection evaluation dataset CICIDS2017 from the Canadian Cybersecurity Group Research Institute and the public intrusion detection dataset UNSW-NB15 created by the Australian Center for Cybersecurity (ACCS) labs.The number of big data traffic instances in the CICIDS2017 dataset and the UNSW-NB15 dataset are 2563416 and 174852, respectively. 2563416 and 174852, the number of features contains 75 and 45, and the number of categories is 14 and 11, respectively. Both datasets can reflect the characteristics of contemporary big data network intrusion detection traffic, and the DoS traffic data from the two datasets are collected as research instances during the overall performance analysis of the traffic identification model. The entire data was used for experiments during step-by-step independent validation.
The CICIDS2017 dataset is the largest type of dataset available online and contains the most important characteristics of a valid IDS, i.e., anonymity, attack diversity, complete captures, complete interactions, complete network configurations, available protocols, complete traffic, feature sets, metadata, heterogeneity, and tagging. The dataset also contains necessary and updated examples of attacks such as port scanning, botnets, SQL injection, and distributed denial of service (DDoS). Previous publicly available datasets lacked traffic diversity, capacity, anonymized packets, and message payload issues. Instead, this dataset spans eight different files with 79 characteristics per flow, including destination port, flow duration, SYN flag counts, etc. The CICIDS2017 dataset and the UNSW-NB15 dataset contain a detailed description of the types of big data traffic as shown in Table 1. The CICIDS2017 dataset and the UNSW-NB15 dataset contain six and five types of traffic, the ratio of training set to test set is 7:3 for both datasets, and both datasets have the highest number of normal traffic.
Test the data centralized flow type statistics
| CICIDS2017 | |||
|---|---|---|---|
| Tags | Flow type | Training set | Test set |
| 0 | Normal | 3256245 | 1395534 |
| 1 | DoS GoldenEye | 7524 | 3225 |
| 2 | DoS Hulk | 152634 | 65415 |
| 3 | DoS SlowHTTPTest | 3526 | 1511 |
| 4 | DoS SlowLoris | 4528 | 1941 |
| 5 | Heartbleed | 15 | 6 |
| UNSW-NB15 | |||
| Tags | Flow type | Training set | Test set |
| 0 | Normal | 54222 | 23238 |
| 1 | DoS GoldenEye | 5963 | 2556 |
| 2 | DoS Hulk | 6852 | 2937 |
| 3 | DoS SlowHTTPTest | 7724 | 3310 |
| 4 | DoS SlowLoris | 1524 | 653 |
The traffic classification confusion matrices of the big data traffic feature recognition and processing model proposed in this paper for the two test datasets are shown in Fig. 3, (a) and (b) represent the classification confusion matrices in the tests on the CICIDS2017 dataset and the UNSW-NB15 dataset, respectively. By observing the diagonal data of the confusion matrices, it can be seen that the model has a higher classification accuracy for all types of big data traffic when classifying and identifying the traffic in the two experimental datasets, except for the malicious traffic in the Heartbleed category in the CICIDS2017 dataset, which has a lower classification accuracy (88.74%).DoS GoldenEye, DoS Hulk, DoS SlowHTTPTest and DoS SlowLoris can reach more than 95% accuracy in identifying the four types of malicious traffic, probably because these four types of attacking malicious traffic have strong temporal correlation, and the model can extract their features well. The normal class of traffic is recognized with an accuracy of 99.68% and 97.84%. This indicates that these classes of traffic have high class consistency, and the traffic feature recognition and classification model fully learns their features.The Heartbleed class of malicious attack traffic has less training data, so the recognition accuracy is lower.

The model of the classification confusion matrix in the test set
Performance evaluation index In order to verify the accuracy and generalization ability of the big data traffic recognition method proposed in this paper, relevant experiments are subsequently conducted in this paper. In terms of hardware, the experiments used devices equipped with E5-2678 CPUs and NVIDIATITAN RTX GPUs. In terms of the software environment, version 3.8.16 of the Python interpreter was used in combination with several key python libraries, including but not limited to numpy, pandas, scapy, scikit-learn, and torch, for data processing and machine learning-based big data traffic identification tasks. The above hardware and software configurations can ensure the reproducibility and scientific validity of big data traffic recognition experiments. In this paper, the model performance is evaluated using Precision, Recall, Accuracy and F1-Score, which provide a comprehensive assessment of the model capability. In these four metrics, Positive (Positive) refers to large data traffic anomaly samples and Negative (Negative) refers to normal traffic samples. Precision measures the proportion of all samples predicted by the model to be Positive that are also Positive in terms of their true labels. The calculation is shown in (10):
The Recall assessment measures the proportion of all samples in which the true label is a positive case and the prediction is also a positive case. The calculation is shown in (11):
Accuracy measures the proportion of predictions that are correct across all samples. It is calculated as:
The F1-Score is the reconciled mean of precision and recall, and is suitable for evaluating the model’s balanced identification effectiveness for unbalanced category distributions. It is calculated as in (13):
Analysis of results In this paper, BiAE-KNN, BiAE-MLP, BiAE-RF, GBDT, and AdaBoost models are selected as the comparison models, and the big data traffic recognition model is first trained on the CICIDS2017 dataset and the UNSW-NB15 dataset training set, and validated on the test set of the two experimental datasets.CICIDS2017 dataset The performance evaluation results of the model in the test are shown in Table 2 and the performance evaluation results in the UNSW-NB15 dataset are shown in Table 3. According to the experimental results, it can be seen that the big data traffic feature recognition and processing model proposed in this paper based on the model performs well in all performance indicators, and the performance of F1-Score in the two datasets (95.63% and 96.35%) reaches more than 95%, which demonstrates an obvious performance enhancement compared to other comparison models. In the dataset CICIDS2017, the precision and recall metrics are improved by 7.04% and 5.04% compared to the BiAE-KNN model, and the accuracy and F1-Score metrics in the dataset UNSW-NB15 are improved by 4.52% and 2.89% compared to the AdaBoost model, respectively.
Experimental results of different models in CICIDS2017
| Model | Precision (%) | Recall (%) | Accuracy (%) | F1-Score (%) |
|---|---|---|---|---|
| BiAE-KNN | 90.01 | 92.84 | 94.78 | 91.21 |
| BiAE-MLP | 91.39 | 89.19 | 93.88 | 91.85 |
| BiAE-RF | 90.47 | 91.16 | 92.1 | 93.34 |
| GBDT | 91.19 | 92.27 | 91.29 | 93.56 |
| AdaBoost | 90.15 | 90.93 | 92.02 | 90.83 |
Experimental results of different models in UNSW-NB15
| Model | Precision (%) | Recall (%) | Accuracy (%) | F1-Score (%) |
|---|---|---|---|---|
| BiAE-KNN | 92.06 | 92.17 | 92.54 | 93.65 |
| BiAE-MLP | 93.78 | 93.78 | 92.09 | 93.87 |
| BiAE-RF | 92.78 | 93.94 | 92.53 | 93.4 |
| GBDT | 93.07 | 93.57 | 92.93 | 92.09 |
| AdaBoost | 92.15 | 92.72 | 93.07 | 93.64 |
The previous section has carried out experimental validation of the model effectiveness on the dataset, comparing the recognition effect of the model on the public dataset. This section focuses on building a system in the OPNET simulation environment and applying the traffic detection model proposed in this paper to the system to accurately reproduce and analyze various behaviors and performances during traffic attack and defense under OPNET, and to validate the reliability of the model in this paper in real network traffic attack and defense scenarios. The network nodes in the simulation environment can be managed in the virtual machine by connecting the OpenDayLight controller deployed in the virtual machine to the OpenFlow switch nodes in OPNET. The Discrete Event System (DES) used by OPNET is based on event-driven simulation of network events, and the controller, which also uses the DES engine, can be clocked to control events to ensure consistency between the controller and the progress of the simulation scenario. In the network traffic simulation attack scenario, the main nodes that need to be included are the network basic nodes (OpenFlow switch nodes, SDN-Sitl interfaces, and external controllers) and the network attack nodes (hacker nodes and infected machines). In this paper, we mainly model the node layer, process layer and network layer through a three-layer modeling mechanism, respectively. In addition, the deployment of the defense model in a virtual machine is not a modeling consideration in this paper.
Five kinds of DDoS attacks are set up in the simulation test, and the attack strength is gradually increasing from I to V. Using the IMPORT TRAFFIC FLOW module in OPNET and the northbound interface of the SDN controller to collect 1263 DDoS attack traffic and 12,452 normal traffic data alternately, a total of 20 simulation tests. The normal traffic sample category label is set to C0, and the attack traffic sample categories are set to C1, C2, C3, C4, and C5 according to the intensity Ⅰ to Ⅴ, respectively. The detection results of the traffic data in the model simulation experiments are shown in Figure 4. For normal data traffic, the detection accuracy can reach up to 99.99%, and the average accuracy of detection is 97.54%. For attacking abnormal big data traffic, the detection accuracy of five kinds of intensity malicious traffic of C1 (99.09%), C2 (97.02%), C3 (94.94%), C4 (92.20%) and C5 (90.03%) shows a decreasing trend, and it can reach up to 99.86%, and the overall accuracy of abnormal traffic data detection can reach 94.66%. It can be seen that most of the data traffic in the simulation test can be effectively recognized, which verifies the application effectiveness of the traffic recognition model in this paper.

Online Detection Accuracy
This section verifies the performance improvement of the system by asynchronous updating of the model. First, novel attacks that do not appear in the model training are replayed into the anomaly detection system to test the detection accuracy of the detection entity model for new attacks before and after the update. The experimental results of new traffic attack detection before and after the update are shown in Table 4, and the updated model improves the overall data traffic detection accuracy by 2.69%. Categorically, the detection accuracy is improved from 91.00% to 97.16% for newly injected attack types, and the detection accuracy is improved by 0.71% for normal traffic. In addition to that, from the results, it can be seen that the accuracy of detecting new types of attacks injected before the model update can reach 91.00%, which indicates that the traffic recognition model proposed in this paper has a certain degree of generalization and can detect unknown attacks.
New attack detection changes before and after the update (%)
| Test frequency | Normal flow | New attack | Attack flow | |||
|---|---|---|---|---|---|---|
| Before | After | Before | After | Before | After | |
| 1 | 98.15 | 98.88 | 91.21 | 97.38 | 97.23 | 99.71 |
| 2 | 97.89 | 98.45 | 90.66 | 97.86 | 97.61 | 99.14 |
| 3 | 97.5 | 98.14 | 90.2 | 97.78 | 98.03 | 98.53 |
| 4 | 97.57 | 98 | 90.23 | 97.87 | 98.5 | 98.49 |
| 5 | 97.71 | 98.46 | 91.94 | 96.92 | 97.52 | 98.96 |
| 6 | 98.26 | 98.94 | 90.69 | 96.31 | 97.85 | 98.04 |
| 7 | 98.35 | 98.43 | 90.69 | 96.28 | 97.02 | 98.13 |
| 8 | 97.58 | 98.1 | 91.83 | 97.17 | 98.03 | 98.34 |
| 9 | 97.98 | 98.46 | 90.79 | 98 | 98.25 | 99 |
| 10 | 98.09 | 98.49 | 90.44 | 96.43 | 97.72 | 98.97 |
| 11 | 98.07 | 98.01 | 90.71 | 97.56 | 97.57 | 99.96 |
| 12 | 97.31 | 98.29 | 90.19 | 96.36 | 98.17 | 98.16 |
| 13 | 98.1 | 98.25 | 91.85 | 97.97 | 97.73 | 99.31 |
| 14 | 97.9 | 98.92 | 91.61 | 97.07 | 98.27 | 98.78 |
| 15 | 97.2 | 98.92 | 91.59 | 96.77 | 97.94 | 99.48 |
| 16 | 98.17 | 98.25 | 91.28 | 97.89 | 97.58 | 99.71 |
| 17 | 97.16 | 98.94 | 90.16 | 96.95 | 97.02 | 98.65 |
| 18 | 97.67 | 98.95 | 91.92 | 96.62 | 97.5 | 99.55 |
| 19 | 97.82 | 98.69 | 90.56 | 96.14 | 97.87 | 98.91 |
| 20 | 97.36 | 98.47 | 91.54 | 97.86 | 97.11 | 98.73 |
| Mean value | 97.79 | 98.50 | 91.00 | 97.16 | 97.73 | 98.93 |
In this paper, based on the stacked self-coding neural network model and twin neural network model in machine learning technology, we realize feature extraction and recognition in big data traffic analysis and design to get the big data traffic analysis model. The model validity is examined through performance evaluation experiments and it is found that this paper’s model exhibits significantly improved performance compared to other models in the datasets CICIDS2017 and UNSW-NB15, and in the dataset CICIDS2017, the precision and recall indexes are improved by 7.04% and 5.04% compared to the BiAE-KNN model. The simulation platform test results show that for normal data traffic, the detection accuracy can reach up to 99.99% when equipped with this paper’s model. Although the detection accuracy of high-intensity malicious traffic is slightly reduced, the average accuracy of abnormal traffic detection can still reach 94.66%. In addition, the traffic identification model has a certain degree of generalization and can detect unknown attacks. The big data traffic detection model proposed in this paper based on machine learning effectively improves the performance of anomalous traffic identification, and provides certain reference for maintaining network security.
