Research on Data Compression and Efficient Transmission Technology in the Framework of Big Data Processing
Publicado en línea: 17 mar 2025
Recibido: 20 oct 2024
Aceptado: 13 feb 2025
DOI: https://doi.org/10.2478/amns-2025-0339
Palabras clave
© 2025 Shuang Chen et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
In recent years, with the development of the Internet, mobile Internet, and the rise of social networks, the data generated globally will grow at a rate of 40% per year, and the storage of massive data and the analysis based on massive data will be one of the main driving forces that will lead to the future development of productivity, innovation, and growth of consumer demand [1-3]. Utilizing big data technologies can improve the allocation and coordination of human and material resources as well as resources, reduce waste, increase transparency, and facilitate the generation of new ideas and insights [4-5].
Due to the 3V nature of big data platforms, traditional data warehouse solutions that scale by stacking expensive hardware devices are becoming increasingly difficult to meet the demands of big data platforms. Instead, utilizing platforms, big data platforms built based on frameworks such as open source frameworks are increasingly becoming the industry’s preferred big data platform solution [6-8]. In addition, massive data storage will consume a large amount of storage space, and at the same time, the growth of massive data is quite rapid, and simply increasing the storage capacity does not fundamentally solve the problem [9-10]. First, the procurement budget of storage equipment is getting higher and higher, and most enterprises can hardly afford such a huge expenditure. Second, as the big data platform expands, the management cost, occupied space, cooling capacity, and energy consumption all become higher and higher, with energy consumption being particularly prominent [11-12].
Therefore, how to reduce and manage the spreading growth of massive data is also a challenge to be faced. Data compression technology in the big data environment can effectively reduce the demand for storage capacity of massive data, alleviate the growth rate of massive data, and optimize the demand for system expansion of big data platforms under the premise of meeting the speed of data processing and analysis [13-14].
On the other hand, big data has become a key factor in driving social progress and technological innovation with its massive, high-speed, diverse and low value density. From user behavior analysis in social media to machine data monitoring in industrial IoT, the application scenarios of big data are becoming increasingly rich, which puts unprecedented demands on data processing and analysis capabilities. However, the realization of the value of data is highly dependent on its effective and efficient transmission, which poses a great challenge to the existing network transmission capacity. Therefore, it is particularly important to explore network technologies and optimization strategies that can effectively respond to the demands of big data transmission [15-17].
The cutting-edge research on data compression methods is broader, involving the evaluation and research of new data compression techniques, as well as the practice and optimization of data compression techniques, of which the main research directions are specifically the drive class image coding methods, the evaluation of data DC techniques, machine learning algorithms empowerment and so on. Literature [18] reveals that traditional compression methods cannot solve the problem of image homogeneity and similarity, and introduces a soft compression new type of number to drive image coding, which has excellent performance and can be from hard to soft, from pixel to shape and other characteristics, so a comprehensive and practical analysis of soft compression is carried out. Literature [19] discusses the realization of scene production and other functions based on virtual reality technology requires a large amount of data transmission, but the current data transmission technology can not be satisfied, so the concept and development trend of content chain efficient compression technology are introduced, and finally the research process of open standards for compact representation of 3D point cloud adapted to content chain efficient compression technology is introduced. Literature [20] describes that the emergence of data compression technology effectively reduces the size of data in the process of data storage and data transportation, and analyzes the current data quality, coding scheme and other aspects of data DC technology. Literature [21] proposed a large-scale scientific data lossy compression algorithm based on error control, which effectively improves the data point prediction accuracy, and introduced an adaptive error-control quantization encoder, which further improves the data point prediction accuracy, and carried out comparative experiments, which confirmed that the proposed data compressor has a better performance in the same category. Literature [22] conceptualized an energy-efficient IoT data collection and analysis strategy, which includes compressing the data and reconstructing it at the edge nodes, and then processing it through supervised machine learning algorithms, and corroborated the feasibility of the proposed method through examples, where the data transmission was reduced by 103 times, while also ensuring the quality of the data. Literature [23] dissected the discrete cosine variation based DCT as a potential data compression technique that facilitates the recovery of undersampled vibration signals from structural systems, which in turn alleviates the burden of analyzing large-volume data for structural health monitoring, and verified on numerical evaluations that the DCT is a well-performing data compression tool. Literature [24] elucidates the development of medical technology, medical data collection, processing urgently need good performance data mining model, while analyzing the existing computational compression methods, while analyzing the medical impact data performance based on this computational compression technology, and finally also the medical image compression field classification, performance performance of a comprehensive compendium.
Scholars have thought about the optimization of data transmission methods, and the optimization path includes transmission data efficiency, data loss prevention, and data transmission security. The focus is on the optimization of the protocol of EAACK MANET, the construction of the data transmission framework and so on. For example Literature [25] conceived an energy-efficient two-tier data transmission scheme and conducted a simulation and evaluation test using omnet++ network simulator, which confirmed that the proposed scheme outperforms other methods in terms of reducing data loss rate, energy consumption, and data loss. Literature [26] envisioned a protocol (FDCRP) that merges the three-layer framework into an EAACK MANET to ensure transmission efficiency and incorporates a three-layer filtering scheme to solve the problem of routing the packets to their destinations, in summary the proposed protocol effectively improves the overall performance of the network. Literature [27] builds a general structure of DTSN for remote monitoring of railroad infrastructure adapted to C-ITS, and also designs a page propagation framework to correct the model data prediction as well as the actual data deviation of code propagation, and proposes a firefly algorithm to further improve the data transmission efficiency, and based on simulation experiments, it confirms that the simulation results of the proposed model are no different from the prediction results. Literature [28] aims to create an intelligent media data transmission network, studied the related open source tools involving CC analyzer and simulator, and built a preliminary network topology and presented no duplicate packet results in the experimental results, and the study provides an important reference for media packet transmission. Literature [29] proposed a framework with discrete event simulation as the underlying logic and attempted to optimize the medical data transmission through e-medical setups, and compared and analyzed the traditional and CBEDE methods, pointing out that the CBED method has a great potential for application in the field of medical data transmission. Literature [30] designed an energy and service level agreement efficient cyber-physical system for e-medical data transmission incorporating an improved self-organizing on-demand distance vector (AODV) protocol to ensure the security of the data transmission process.
This paper focuses on the improved LSW data compression algorithm to achieve high efficiency and high quality data compression. Based on the study of existing compression algorithms, experiments are conducted for static, trigger, process, and network transmission reliability reliability data to explore the effectiveness of the improved LSW algorithm. On this basis, this paper designs an efficient transmission system for Internet big data including sense nodes, aggregation nodes, sense transmission management and other structures. The big data fusion algorithm of gray correlation and BP neural network is used to achieve college-level big data transmission. Finally, the superiority of its algorithm is discussed through simulation experiments.
With the improvement of information technology, the amount of data that people need to deal with is getting bigger and bigger. New applications represented by big data can mine a large amount of high-value information, which is of great significance to economic and social development. Due to the progress of science and technology, data storage has increased to different degrees in time and space, in order to optimize the storage capacity and improve the storage efficiency, it is necessary to carry out a reasonable compression of data, which saves the storage space and reduces the load of the processor and other hardware devices. The data compression process, on the one hand, removes redundant information, on the other hand, on the basis of the original data size is reduced to a minimum, on the basis of ensuring data quality, the amount of data that needs to be retained is reduced to the minimum value, so as to improve the efficiency of the use of efficiency and quality. Data compression is essentially the use of as little coded information as possible to represent the original data, which is particularly important in today’s era characterized by information.
Lossy compression, meaning that the original data cannot be fully recovered from the compressed data, is an approximation with some distortion. These types of coding are widely used in image transmission systems and video entertainment devices. Commonly used lossy data compression techniques include discrete cosine coding, predictive coding, and wavelet transform. This type of coding has a relatively high compression, typically from 2:1 to 1000:1. Lossless compression means that the original data can be completely recovered from the compressed data, the basic idea: small probability of input symbols expressed in long code words, large probability of input symbols expressed in short code words. Lossless compression is the most widely used method for general medical and biological research, aerospace data survey, and other types of data.Lossless compression is usually used for database storage and file processing. Commonly used distortion-free techniques are Huffman coding, arithmetic coding, and LZ series coding.
LZW algorithm belongs to the dictionary coding in lossless data compression, based on the data itself contains the characteristics of the repetitive code, the code of the already existing string is used to replace the same string, so as to realize the data compression.The dictionary used by LZW is dynamically created according to the original file during the reading process, moreover, in the process of decompression, there is no need to build a dictionary, and the dictionary can be recovered by itself.
Let the source symbol
The average yardage length is:
Expanding for upward rounding symbols yields equation (4):
There are
When
Substituting into Eq. (6) and Eq. (7), we get:
Assuming that the distribution probabilities of the smooth memoryless source symbols are
Taking the logarithm of both the left and right sides at the same time gives:
Ignore shorter segment types:
That is, the total number of segments,
As the source sequence approaches infinity,
When decompressing, the LZW algorithm loops through the encodings and adds an item to the table while outputting the string corresponding to the encoding from the dictionary. There are four variables
The LZW algorithm does not need to transmit the dictionary during transmission, and its decompression process is similar to the compression process, which generates the dictionary while decoding to avoid the waste of space.
This section mainly analyzes the LZW compression algorithm systematically, improves the problem of compression ratio reduction caused by dictionary storage structure and dictionary updating method, establishes the dictionary decay index model, analyzes the influence of initial value and decay rate in the model and optimizes it, and summarizes the advantages of this paper’s model by comparing and contrasting the simulation experiments on different types of files. The improved LZW algorithm is implemented by integrating FSE based on ANS algorithm, which utilizes the advantage of FSE to save all the information with a single digit, encodes the output single character, saves the memory space and reduces the time complexity, and improves the compression ratio at the same time.
At present, the Internet usually introduces compressed sensing methods in order to improve transmission efficiency when transmitting big data. In this paper, we design an efficient transmission system for big data on the Internet, including perception nodes, aggregation nodes, relay nodes, perception transmission management platform, and cloud processing platform. The number of perception nodes is several. Each perception node is automatically networked to form a self-organizing network of perception nodes. The aggregation node is located in the coverage area of the self-organizing network of perception nodes. The relay nodes are situated in the area that covers the aggregation nodes. The sensing transmission management platform is connected to the relay node. The cloud processing platform and the sensing transmission management platform are linked.The system composition box is depicted in Fig. 1. The whole system contains multiple aggregation nodes and relay nodes with huge amount of data. In the figure: 1 is the perception node, 2 is the aggregation node, 3 is the relay node, 4 is the perception transmission management platform, 5 is the cloud processing platform, 6 is the coverage area of the self-organized network of the perception node, and 7 is the coverage area of the aggregation node.

System frame
Data convergence technology is a crucial technology for the efficient transmission of Internet data. The current Internet has problems such as high energy consumption of network models, short network survival periods, narrow scope of application, weak convergence, and poor practical applicability. Data correlation analysis fusion algorithms can eliminate redundant data and prolong the survival period of networks.
Multiple types of big data on the node often appear physical attributes related to each other, big data will affect each other, in order to better utilize the physical attributes of the data correlation, data correlation needs to be measured. The gray correlation method can measure the correlation between big data attributes according to the similarity and dissimilarity of data trends, and determine the strength of big data correlation by analyzing the degree of change of data curves. This method is usually used to analyze the degree of influence of the main target by other factors, in this section is used to analyze whether there is a strong physical attribute correlation between the data, the core of which is to establish a rule to determine the main target as the parent sequence needs to be analyzed as a sub-sequence of the correlation factors, and finally solve the correlation between the sequence of the relationship between the strength of the weak. The procedure for gray correlation analysis is as follows:
1) Firstly, it is necessary to confirm the fusion output big data as well as other associated big data as reference sequence and comparison sequence. There is Where 2) Collected Internet big data due to different attributes so the data meaning is inconsistent, so it is not possible to quantify the big data uniformly, not conducive to uniform comparison. Therefore, in order to facilitate data comparison, it is necessary to perceive the data for dimensionless processing. 3) The comparison sequence composed of the output big data is used to calculate the absolute difference from the reference sequence composed of the associated big data. The reference sequence is traversed sequentially, and the maximum and minimum values of the absolute difference are calculated. As shown in equation (15):
Based on the maximum absolute difference Where 4) Calculate the gray correlation between big data according to the results obtained in equation (16), as shown in equation (17):
By analyzing the collected big data, the big data with strong correlation are filtered out to ensure that the data used for neural network training have high physical attribute correlation. When the gray correlation is lower than 0.7, it means that the physical attributes of the two are poorly correlated and not suitable for model training.
1) Training samples: The accuracy of BP neural network fusion results is related to the quality of the training dataset. In the experiments in this chapter, the three most common temperature and humidity as well as light intensity in the Internet domain are selected as the training samples, while the three big data have strong physical correlation between them, and the correlation is high after gray correlation analysis.
2) Input-output layer design. The gray correlation method is used to analyze the physical attribute correlation between big data, and the big data with strong physical attribute correlation is screened out as the input layer of the BP neural network model, and the big data that needs to improve the accuracy is used as the output, so the output layer is 1. The design of the input layer should be based on the actual operation situation to dynamically design the number of input data, which is decided according to the correlation of the surrounding data and the type of input big data. At the same time, it is necessary to consider the problem of resource limitation of the terminal itself, and too many input layers will increase the resource consumption of the neural network, resulting in the inability to run in some Internet terminals with fewer resources, and the model overhead needs to be considered.
3) Hidden layer design. It has been proven through experiments that one or two hidden layers are sufficient to solve any non-linear problem. Too many hidden layers will increase the complexity and training time of the neural network as well as increase the resource consumption of Internet terminals, and too few hidden layers will result in fusion results that do not meet the demand. Hidden layer neurons should not be more than twice the number of input layer neurons, and the right hidden layer can better ensure the training time while reducing the model resource overhead, so one layer of hidden layer neural model is chosen to be constructed. The number of hidden layer neurons is usually based on empirical formulas combined with actual model testing to determine the final parameter values, and the specific number of hidden layer nodes will be given in the parameter settings of the simulation experiments.
4) Selection of activation function.
Relu function function to take the maximum value, the calculation is faster, only need to determine whether the input is greater than 0. The expression of the function is equation (18):
The Sigmoid function is a commonly used nonlinear activation function, the function is characterized by its ability to convert the input to an output between 0 and 1. Its expression is given in equation (19):
The horizontal coordinate represents the input function for the neural network input or hidden layer, and the vertical coordinate represents the output function for the hidden layer or output layer.The Sigmoid function was chosen due to the nonlinear correlation characteristics presented among the Internet data.
5) Selection of learning rate.
In neural network model training, the learning rate has a significant impact on the accuracy of the fusion model results. In neural networks, the learning rate is set too high, which leads to slower convergence and causes the phenomenon of oscillation, which has negative impacts on model training. Therefore, the learning rate is usually set as a constant, and the learning rate is adjusted through continuous training so that the model achieves satisfactory results.
6) Neural network initial weights selection
BP neural network training requires the selection of initialized weights, but there is no unified theoretical guidance for the selection of initialized weights. It is usually carried out using an empirical method. The current common selection method is related to the selection of the activation function, because the activation function selects the Sigmoid function, so the initial value can be set to be between [–1.1] and selected by means of random numbers.
The advantages and disadvantages of compression algorithms are mainly judged from three aspects: compression rate, absolute error and time, so the test of lossless compression algorithms mainly includes compression time, decompression time and compression rate, and the test of lossy compression algorithms mainly includes compression time, decompression time, compression rate and absolute error.
The experiment is tested by using LZW, LZSS, and the improved LZW algorithm to compress static data respectively, and the test results are shown in Table 1. As seen from the test results, the LZW algorithm combined with the static dictionary reduces the process of gradually building a dictionary for noun strings with high frequency of occurrence, so that these strings can be directly encoded and compressed without the need for the adaptive process of gradually adding characters. Compared to LZSS and LZW algorithms reduce the compression time and increase the compression rate (50.75%), which has better compression effect.
Compression efficiency under different compression algorithms
| Algorithms | The original size (B) | Compression file size | Compression time (ms) | Decompression time | Compression rate(%) |
|---|---|---|---|---|---|
| Improvement algorithm | 5600 | 2842 | 4 | 3 | 50.75% |
| LZSS | 5600 | 2423 | 6 | 4 | 43.27% |
| LZW | 5600 | 2240 | 3 | 2 | 40.00% |
According to the characteristics of the trigger class data, it is tested for compression using the improved LZW algorithm and LZW algorithm, and the results are shown in Table 2. From the test results, it can be seen that because the trigger class data has a large number of consecutive characters, a high compression rate is obtained using the improved LZW algorithm. And through the compression improvement of discontinuous characters, the compression rate is further improved compared with the original LZW algorithm, and the compression rate reaches 9.07%. The data obtained after the difference preprocessing of the time value has continuity characteristics, and compression using the improved LZW algorithm also achieves a good compression effect.
Trigger the compression result of the class data
| Algorithms | The original size (B) | Compression file size | Compression time (ms) | Decompression time | Compression rate(%) |
|---|---|---|---|---|---|
| Improvement algorithm | 3240 | 294 | 4 | 6 | 9.07% |
| LZW | 3240 | 252 | 3 | 6 | 7.78% |
Experiments on the coordinate displacement data collected from the CNC system using the revolving door algorithm and the improved revolving door algorithm for compression testing, in which the deviation points are set at t=150 and t=1000, and the compression results are shown in Figure 2. And it can be found that the improved algorithm carries out special processing for the first deviation point, save the deviation point that exceeds the threshold value, and save the processing of the adjacent data before and after the deviation point, so that it does not affect the recovery of other data when decompressing, so that the absolute error of the data is kept in a certain range, while the second data is not processed using the improved algorithm for the deviation point, and the recovery of the data before and after the data decompression will affect the recovery of the data.

Process data compression efficiency
The experiments are conducted by compressing the network transmission data using Huffman, LZW, LZSS and the improved LZW algorithm respectively, and the test results are shown in Table 3. The test results show that the algorithm constructed in this paper shortens the compression time by 45ms compared with Huffman coding, although compared with the LZW algorithm and the LZSS algorithm in the compression and decompression time will be a little insufficient, but in the compression rate has a greater advantage, which is very helpful in saving network bandwidth.
Network transmission data compression efficiency
| Algorithms | The original size (B) | Compression file size | Compression time (ms) | Decompression time | Compression rate(%) |
|---|---|---|---|---|---|
| Huffman | 1440 | 692 | 63 | 30 | 48.06% |
| LZW | 1440 | 1108 | 3 | 2 | 76.94% |
| LZSS | 1440 | 742 | 3 | 2 | 51.53% |
| Improvement algorithm | 1440 | 285 | 18 | 12 | 19.79% |
In order to verify the performance of the algorithm, this paper uses the MATLAB platform to conduct simulation experiments on the proposed algorithm.MATLAB, which is short for Matrix Laboratory, is a commercial mathematical software from an American company that is used as a high-level technical computational language and interactive environment for algorithm development, data visualization, data analysis, and numerical computation. It mainly consists of two main parts, MATLAB and Simulink.
The basic configuration of the simulation modeling: number of nodes: 250~550, transmission radius: 25,30,35,40m, packet length: 6 byte, energy consumption for transmitting data: 0.5
In the experimental evaluation, the algorithms in this paper are first compared with the DAS algorithm and the IAS algorithm, in terms of delay performance.The DAS algorithm and the IAS algorithm first construct the data fusion tree of the connected dominating set, and then use different scheduling methods and analyze them to obtain the upper bound of the delay of 24

Fusion scheduling delay contrast
The fusion scheduling delay of the algorithm as the number of sensor nodes, transmission radius and number of sensor nodes varies is shown in Fig. 4. As the number of sensor nodes increases, the number of layers in the fusion tree increases accordingly, and the time required for scheduling increases.As the transmission radius of sensor nodes increases, the number of competing nodes increases.At this point, each node then needs to compete with more nodes, which takes more scheduling time.

Fusion delay under different transmission radius
We deploy 400 and 500 nodes in the region, where the dots are the Sink nodes and the black lines are the connectivity between all the sensor nodes (transmission radius is 20 meters). The topology of the sensor network is shown in Fig. G(V,E), and the data fusion tree with balanced shortest paths is shown in Fig. 5, (a) and (b) for 400 and 500 nodes respectively It can be found that with the increase of the number of nodes, the load distribution of each fusion node will be more balanced, and the balanced load distribution is conducive to prolonging the network’s lifecycle.

Random point
The variation of network lifetime with the number of nodes is shown in Fig. 6. From the figure it can be seen that the algorithm in this paper has the longest network lifetime, and with a slight change in the number of nodes, the network lifetime of the algorithm increases gradually. The fusion tree is constructed on the basis of the shortest path tree by balancing the allocation of nodes layer by layer, with the increase in the number of nodes the connectivity of the network increases force mouth, the balanced distribution of the load can be fully reflected, so with the increase in the number of nodes the network life cycle, but the number of nodes is 500, the life cycle is about 1.8 × 107.

Network life contrast
In this paper, an improved LZW compression algorithm is proposed for the problem of reducing compression ratios, and the compression efficiency of the improved algorithm is tested using indicators such as compression ratio and time. In terms of big data transmission, the data fusion algorithm combining gray correlation analysis and BP neural network is used, and simulation experiments show that:
1) In terms of static data (compression rate of 50.75%) and trigger class data, the improved LSW algorithm improves the compression rate. And in the process class data in decompression, will not affect the recovery of other data, and save the network bandwidth. 2) Compared with other algorithms, the data fusion algorithm has the best delay performance and fusion scheduling delay, so that the load distribution will be more balanced and increase the network life cycle.
