Research on the Utilization Pattern Mining and Impact Mechanism of Open Government Data Based on Deep Learning Algorithms

In the era of digitalization, data has become an important resource for promoting economic and social development. In order to support public monitoring and enhance data reuse in the public sector, more and more government and public sector organizations rely on their open data platforms to disclose information about data in government, science, and other fields and unify the data in the form of datasets posted on the Web [1-2]. The openness and sharing of data is continuously improving the effectiveness of government governance and the quality of public services, and is also a prerequisite for the modernization of the national governance system in the era of big data [3-5].

With the increase in the total number of open government data platforms, there are higher requirements for comprehensive and accurate search and recommendation of open data sets. This is from an overall perspective, as a national-level unified open data platform has not yet been built, datasets are still scattered on top of provincial and municipal platforms, and there is a lack of unified search tools for open datasets, which leads to the inability of data users to realize cross-platform, cross-level, and cross-departmental searches of open datasets [6-9]. In terms of details, although most local government open data platforms provide data users with open dataset search functions based on faceted search and keyword matching, the existing open dataset faceted search systems provided by different platforms differ in the division of domains, departments, etc., which makes it difficult to realize the alignment and aggregation of open datasets from different platforms [10-14].

Nowadays, the new generation of information search and recommendation technologies, such as deep learning, has become a new trend in information acquisition and dissemination and is also the basic content and core task of the national big data strategy [15]. Deep learning techniques can automatically extract semantic information by learning the features and representations of input data and have been widely used in natural language processing, recommender systems, and other fields [16-19]. Its application in open dataset search and recommendation can enhance the discoverability, accessibility and reusability of open datasets, optimize the search and recommendation performance of datasets, and satisfy the user’s demand for efficient, accurate and intelligent search and recommendation of open datasets [20-22].

Based on the content of the ecosystem theory, the study combs through the subjects and connecting links of open government data utilisation, builds an analytical framework for open government data utilisation, and summarises the open government data utilisation model. As an example, open government data is used to analyze and predict traffic accidents, and a prediction model is constructed using deep learning algorithms. Specifically, after building the traffic road network structure, the spatio-temporal features of the accident are captured using the gated graph convolution module, and the attention mechanism is applied to obtain the dynamic weight expression of spatio-temporal. Meanwhile, in order to solve the data sparsity problem and spatial heterogeneity problem of accident prediction, the scale reduction module is introduced to guide the accident risk prediction at the road segment level using the accident risk of coarse-grained regions. Traffic flow data, traffic accident data and weather data from a city’s open government data are used as the dataset for model comparison analysis and case study to explore the model’s prediction performance on traffic accidents and the influence of different features on its prediction error. Then, based on the AMO-TPB theory, the research model of open government data utilisation is constructed, the causes are clarified through regression analysis, and the paths of causes are determined based on the explanatory structural model so as to obtain the influencing mechanism of open government data utilisation.

2

Open Government Data Utilisation Model

In the era of the digital economy, data has been formally included in the category of factors of production and plays an increasingly important role in enabling high-quality development of the economy, promoting the transformation of the mode of national governance, and facilitating the change of social lifestyles. The information and data resources in the hands of government departments contain enormous political, economic, and social values. Combined with the basic content of innovation ecosystem theory, we design an ecosystem analysis framework for open government data (OGD) and explore the utilisation patterns of OGD.

2.1

OGD utilisation framework

The OGD ecosystem analysis framework is shown in Figure 1. This analysis framework is mainly composed of ecosystem subjects and external environment, with openers, developers, and consumers as ecosystem subjects, and the subjects are connected with data chains, achieve interaction and feedback through open platforms, realise data utilisation and value creation and are affected by the economic, social and technological environments.

Openers are OGD management organizations, mainly government departments in China, which are responsible for data collection, release, and management. Established studies show that the organisational level of openers, their ability to set standards, policies and regulations, etc., have a significant impact on the level of OGD utilisation. The stronger the openness ability of data providers, the more they can provide huge amounts of high-quality data, thus creating a good foundation for data utilization.

Developers refer to the developers of OGDs, such as enterprises, scientific research groups, industry groups, etc. They are one of the main bodies of data demand and use OGDs to develop data products and services, which promote the utilisation of OGDs and at the same time, serve as a bridge between openers and consumers. The stronger the developer’s data development capability, the more they can explore the value contained in OGD and provide rich and high-quality data products and services, thus enhancing the level of OGD utilisation.

Consumers are the end-users of OGD products and services, which mainly include the public, social groups, and the government itself.OGD can benefit from data products and services, but consumers are also a significant source of feedback and data.Consumers’ ability to consume data products and services influences whether OGD can ultimately realize value-added utilisation.

Open platforms refer to OGD portals, which are data carriers and interaction channels, mainly online platforms, such as the OGD platforms built by Chinese provincial and municipal governments.The OGD ecosystem’s various subjects are connected through the open platform.

In the OGD ecosystem, data are integrated into the open platform and flow between the various ecosystem subjects throughout the whole process, from openness to value creation to data return, forming a data chain connecting the whole ecosystem. Specifically, data is initially released and managed by openers through the open platform, then flows to developers and is processed into corresponding data products and data services, and then flows to consumers to realise value creation in corresponding scenarios. Accompanied by the feedback and interaction between the actors in the OGD ecosystem, the flow process of each link also creates and accumulates new data, and the new data becomes the object of the next round of data opening to realise the return flow and cycle.

2.2

Analysis of utilisation patterns

Different government open data utilisation models can be divided into government-led open data utilisation models, enterprise-led open data utilisation models, and citizen-led open data utilisation models.

Government-led open data utilisation models can be divided into internal management utilisation models, social innovation application models, and commercial development co-creation models according to different value objectives.Enterprise-led open data utilisation models can be divided into product co-creation models, service co-creation models, and knowledge co-creation models.A citizen-led Open Data Utilisation Mode is the practice of diversified exploitation of government open data by individual citizens or citizen groups in collaboration with different stakeholders based on their or common interests.

Specifically, for example, in the internal management utilisation model, a municipal government needs to set up a city operation monitoring system for dynamic monitoring in order to achieve comprehensive management, command, decision-making and scheduling of the city. Although different government departments have opened up data reflecting the city’s situation, including traffic flow, population flow, public security, water, electricity and gas situation, etc., there is a lack of unified development and utilisation planning methodology and realisation technology. In order to realise the value of these open data, the municipal government takes the lead and provides financial and policy support, and sets up a data studio to research and put forward demands, and cooperates with data companies, internet enterprises, application developers to carry out open data-based development and utilisation. Developers are implementing a project to develop a city operation monitoring system based on open data. In this process, open data can be analysed using data mining and deep learning algorithms, resulting in the data products needed by the government, which can assist the government’s urban management agencies in conducting real-time remote monitoring of the city’s operation and provide accurate data support for decision-making. The following is an example of traffic accident risk prediction in the city, using deep learning algorithms to construct a risk prediction model for analysis, and mining the internal management and utilisation model of open government data.

3

Deep learning-based traffic accident risk prediction

This chapter proposes a scale-reduced attention and graph convolution-based prediction model (SAGCN) using open government data to divide urban areas into road segments.

3.1

General structure of the model

The details of the model for the accident risk prediction part are shown in Fig. 2, where the historical long-term and short-term accident risks are fed into the gated graph convolution unit separately, and the weather features of the corresponding time slices are fused in its output results, and finally, the splicing is done. The attention mechanism is used in the time layer, and this result and the coarse-grained accident risk are fed into the feature layer, and the output result of the scale reduction module is fused with the output result of the feature layer.

3.2

Gating map convolution module

The gated graph convolution module allows for the simultaneous extraction of spatio-temporal features. For the historical long-term input, the input d_inϵR^l×N×Z, denoted d_in belongs to a vector of dimension l×N×Z, where l denotes the first l time slices of the history, N denotes the number of segmented road sections, and Z denotes the feature dimensions of the graph signalling inputs, where there is only the accident risk value, so Z=1 that. Spatiotemporal information is extracted k at each time step by the graph convolution GCN, and in order to get the temporal outputs of the same dimensionality, the splicing operation needs to be done in the beginning. The gated activation unit can extract features for all time steps of the history after $\frac{l - 1}{k - 1}$ $\frac{l-1}{k-1}$ operations by setting multiple layers of graph convolution extraction layers in the middle. For the short-term input of the history, the graph convolution operates on a similar principle.

The implementation of GCN is based on spectral graph theory: given an abstract graph G=(V,ξ,A), and given the signals on the graph, i.e., d_in above, the Laplacian matrix L=D–W. Where $D_{i i} = \sum_{j} A_{i j}$ ${{D}_{ii}}=\sum\limits_{j}{{{A}_{ij}}}$ is the diagonal matrix, the regularised Laplacian matrix is denoted as (I is the unit matrix): (1) $L = I - D^{- \frac{1}{2}} A D^{- \frac{1}{2}}$ \[L=I-{{D}^{-\frac{1}{2}}}A{{D}^{-\frac{1}{2}}}\]

As shown in Eq. (2), according to the convolution theorem, the signal d_in is mapped to the spectral domain, the convolution kernel y also needs to be mapped to the spectral domain, and after the dot product is completed, the inverse transformation is performed to achieve the convolution in the spectral domain, where U is the eigenvector after the decomposition of the Laplace matrix, and U^T denotes its transpose. Then: (2) $d_{i n} *_{G} y = U ((U^{T} d_{i n}) \otimes (U^{T} y))$ $${d_{in}}{*_G}y = U\left( {\left( {{U^T}{d_{in}}} \right) \otimes \left( {{U^T}y} \right)} \right)$$

Given the complexity of the eigenvalue decomposition of the Laplace matrix, U is a dense matrix and the influence in the node domain comes from all nodes rather than localisation, an approximate approximation is done using the Chebyshev polynomial method (third order Chebyshev approximation is chosen for this paper, P=3) as shown in Eq. (3): (3) $d_{i n} *_{G} y = U g_{β} (Λ) U^{T} d_{i n} = \sum_{i = 0}^{P - 1} β_{k} L^{i} d_{i n}$ \[{{d}_{in}}{{*}_{G}}y=U{{g}_{\beta }}\left( \Lambda \right){{U}^{T}}{{d}_{in}}=\sum\limits_{i=0}^{P-1}{{{\beta }_{k}}}{{L}^{i}}{{d}_{in}}\]

In order to model the non-linear spatio-temporal correlations in accident prediction, the final outputs use gated linear unit activation outputs as shown in Eq. (4), which selects the partial information in the linear variations by multiplying the linear transformations with the non-linear activations: (4) $f (X) = (X * W_{1} + b_{1}) \otimes δ (X * W_{2} + b_{2})$ \[f\left( X \right)=\left( X*{{W}_{1}}+{{b}_{1}} \right)\otimes \delta \left( X*{{W}_{2}}+{{b}_{2}} \right)\]

Where X denotes the input to the graph convolution, W₁, W₂ denote the learned weight parameters, b₁, b₂ denote the bias parameters, denote the dot product, and δ denotes the activation function. Finally the output of each layer is normalised by batch normalisation to speed up network convergence.

3.3

Attention mechanisms

The long-term features obtained by the gated graph convolution unit and their corresponding normalised weather features (W_t–l,W_t–l+1,⋯,⋯W_t–1) are fused in the time dimension to obtain the first fusion data, and the short-term features are fused in the time dimension with their corresponding normalised weather features (W_t–s,W_t–s+1,⋯,⋯W_t–1) to obtain the second fusion data. The first fusion data and the second fusion data are spliced to obtain the spliced fusion data Y_l+sϵR^{(l+s)×1×(N×d_out)}. In order to obtain the dynamics representation in the temporal and spatial feature dimensions, it is necessary to use the attention mechanism in the temporal and spatial feature layers, respectively.

3.3.1

Time-Level Attention

Input the temporal feature E_t (parsed unique heat encoding) of the moment to be predicted and compute the importance score for each time slice at the time layer: (5) $α = s o f t \max (\tanh (Y_{l + s} W_{Y} + E_{t} W_{E} + b_{1}))$ \[\alpha =soft\max \left( \tanh \left( {{Y}_{l+s}}{{W}_{Y}}+{{E}_{t}}{{W}_{E}}+{{b}_{1}} \right) \right)\]

Where αϵR^(l+s), the learnable parameter dimension is denoted as follows: (6) $W_{Y} \in R^{(l + s) \times (N \times d_{o u t}) \times 1}$ \[{{W}_{Y}}\in {{R}^{\left( l+s \right)\times \left( N\times {{d}_{out}} \right)\times 1}}\] (7) $W_{E} \in R^{d_{e} \times (l + s)}$ \[{{W}_{E}}\in {{R}^{{{d}_{e}}\times (l+s)}}\] (8) $b_{1} \in R^{l + s}$ \[{{b}_{1}}\in {{R}^{l+s}}\]

The above equation notation is to illustrate the dimension size of each parameter. d_out denotes the output dimension of the output convolutional features of the gated graph convolution module, W_Y denotes the weight parameter learnt for the input hidden layer vector Y_l+s (fused data), W_E denotes the weight parameter learnt for the moment to be predicted, and b₁ denotes the bias parameter learnt by the neural network.

The output of the temporal layer attention mechanism is accumulated from the time score results of each step, which are calculated as follows: (9) $Y_{α} = \sum_{m = 1}^{l + s} α^{m} y_{m} \in R^{N \times d_{o u t}}$ \[{{Y}_{\alpha }}=\sum\limits_{m=1}^{l+s}{{{\alpha }^{m}}}{{y}_{m}}\in {{R}^{N\times {{d}_{out}}}}\]

Where y_m is the result of slicing Y_l+s in the time dimension, Y_l+s=(y₁,…y_l+s), y_mϵY_l+s, and m refer to the indexes in the time dimension, and summation from 1 to l+s gives the result Y_α.

3.3.2

Feature layer attention

In the result Y_α obtained from the temporal layer attention mechanism, the attention mechanism is continued at the feature layer, analogous to the temporal attention approach, and is computed as follows: (10) $β = s o f t \max (\tanh (Y_{α} W_{Y_{F}} + E_{t} W_{E_{F}} + b_{2}))$ \[\beta =soft\max \left( \tanh \left( {{Y}_{\alpha }}{{W}_{{{Y}_{F}}}}+{{E}_{t}}{{W}_{{{E}_{F}}}}+{{b}_{2}} \right) \right)\]

Where β denotes the feature layer attention mechanism score, βϵR^d_out, W_{E_F} denotes the weight parameter learnt at the feature layer for input Y_α, W_{E_F} denotes the weight parameter learnt at the feature layer for input E_t, and b₂ denotes the bias, and the dimensions of each parameter are set as follows: (11) $W_{Y_{F}} \in R^{d_{o u t} \times N \times 1}$ \[{{W}_{{{Y}_{F}}}}\in {{R}^{{{d}_{out}}\times N\times 1}}\] (12) $W_{E_{F}} \in R^{d_{e} \times d_{o u t}}$ \[{{W}_{{{E}_{F}}}}\in {{R}^{{{d}_{e}}\times {{d}_{out}}}}\] (13) $b_{2} \in R^{d_{o u t}}$ \[{{b}_{2}}\in {{R}^{{{d}_{out}}}}\]

The preliminary results obtained from the final feature layer are shown in Eq. (14), where F_h denotes the vector of Y_α in the spatial feature dimension above, Y_α is a N×d_out vector that can be expressed as Y_α=(F₁,…F_{d_out}), h refers to the index in the feature dimension, and h sums from 1 to d_out. Then: (14) $Y = \sum_{h = 1}^{a_{o u t}} β^{h} F_{h} \in R^{N}$ \[Y=\sum\limits_{h=1}^{{{a}_{out}}}{{{\beta }^{h}}}{{F}_{h}}\in {{R}^{N}}\]

Due to the large number of accident risk values in the sample that are zero, normal outputs will tend to predict all-zero values in order to reduce the error, a phenomenon known as the zero-inflation problem. In order to solve this problem, the scale reduction module is designed in this paper to combine the above results with the output.

3.4

Scale reduction module

The Scale Reduction Module takes as input accident risk values in coarse-grained areas with large spatial scales. The structure of this module is a three-layer feed-forward fully connected layer. The input layer of the triage module is denoted by L_input, the hidden layer by L_hidden, and the output layer by L_out. The parameters are set as follows: (15) $L_{i n p u t} \in R^{1 \times C}$ \[{{L}_{input}}\in {{R}^{1\times C}}\] (16) $L_{h i d d e n} \in R^{C \times N}$ \[{{L}_{hidden}}\in {{R}^{C\times N}}\] (17) $L_{o u t} \in R^{N \times 1}$ \[{{L}_{out}}\in {{R}^{N\times 1}}\]

L_out represents the result obtained from the output layer, i.e., the value of the coarse-grained accident risk reduction, the dimension of the output is the number of road segment divisions, and the final predicted value Ŷ fuses this result for output as shown in Equation (18): (18) $\hat{Y} = Y + L_{o u t}$ \[\hat{Y}=Y+{{L}_{out}}\]

3.5

Experimental design and validation

In this experimental design, the massive urban traffic accident data open to the government are fused with multi-source data such as real-time traffic flow and weather, and the effects of multi-source input data and single-source traffic flow data on the risk of traffic accidents occurring in the relevant area are calculated separately, in order to find out and compare the extent of the effects of various models and different input data on the risk of traffic accidents.

3.5.1

Data sets

The datasets used in this paper include three: traffic flow data, traffic accident data and weather data (rainfall and visibility). It is collected from open government data from June to December 2022 of a city. The processed raw dataset is split into training data, validation data and test data according to the ratio of 7:1:2, which is used for model training and validation.

3.5.2

Evaluation indicators

In the experiments of this paper, the error functions of mean absolute error (MAE), mean squared error (MSE) and mean relative error (MRE) are used as the evaluation indexes of the algorithm. The smaller the value of the above three evaluation indexes, the more accurate the prediction results will be. This experiment predicts the risk of traffic accidents in the grid area, and the range of the predicted value is located between 0 and 1. When the predicted value is closer to 1, it means that the probability of traffic accidents in the area in the next time interval is greater.

3.5.3

Comparative Experimental Results

This paper employs Linear Regression (LinR), Logistic Regression (LogR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Stack Noise Reduction Autocoding (SDAE to compare various algorithms.

The comparison of the experimental results of different models in the case where the input source is only traffic flow is shown in Figure 3. It can be found that (1) the prediction accuracy of deep learning models is overall higher than that of machine learning models. The reason for this is that the deep learning model is better at learning high-dimensional features and has a stronger learning ability than the machine learning model. (2) The prediction error of the SAGCN algorithm proposed in this paper is smaller than that of the SDAE model, and the values of its MAE, MSE, and MRE are 0.082, 0.038, and 0.808, respectively.

3.5.4

Utility analysis of factors

The proposed SAGCN model is combined with dynamic factors such as traffic accident data and weather data to carry out large-scale experiments to compare the impact of different influencing factors on the prediction of traffic accident risk in a city urban area.

A comparison of the experimental results based on various factors is shown in Table 1, where F-flow rate, A-number of traffic accidents, R-rainfall, and V-visibility are shown. The errors of the combination of flow, the number of historical accidents, rainfall and visibility are not the smallest, and the combination model of flow, the number of historical accidents, and rainfall has the smallest overall error, with MAE, MSE, and MRE of 0.075, 0.035, and 0.756, respectively, while the combination model of flow, visibility and rainfall has the largest error, with MAE, MSE, and MRE of 0.085, 0.044, and 0.851.

Table 1.

Consider the experimental results of various factors

Input dimension	MAE	MSE	MRE
SAGCN(F)	0.081	0.041	0.784
SAGCN(F+A)	0.077	0.039	0.776
SAGCN(F+R)	0.078	0.040	0.768
SAGCN(F+V)	0.083	0.038	0.836
SAGCN(F+A+R)	0.075	0.035	0.756
SAGCN(F+A+V)	0.083	0.038	0.836
SAGCN(F+R+V)	0.085	0.044	0.851
SAGCN(F+A+R+V)	0.082	0.042	0.844

Considering single factors other than traffic flow, factors such as the number of historical accidents and rainfall are beneficial in reducing the error of the experimental results, while the visibility factor leads to an increase in the error of the experimental results. Therefore, it is not the case that the more combinations of factors are considered, the more favourable it is to reduce the prediction error of the model, and there are differences in the results of different combinations of factors. In this paper, rainfall and the number of accidents are favorable factors for reducing the prediction error of the risk of traffic accidents in the experiments.

3.5.5

Case studies

Since different causal factors have different impacts in different scenarios, this section conducts model training for different scenarios, such as morning peak and flat peak, weekdays and holidays, and sunny and rainy days, in that order.

The comparison of the experimental results (MRE) for the morning peak and the flat peak is shown in Figure 4. The MRE of the prediction results for the morning peak is between 0.734 and 0.798, which are all lower than that of the flat peak.The prediction errors of various factors on the risk of traffic accidents are generally in line with those in the utility analysis mentioned earlier. There are more vehicles on the road during the morning peak.If a road traffic accident occurs, traffic congestion will be more serious than in the flat peak. Therefore, it is particularly important to accurately predict the risk of traffic accidents during the morning peak. The prediction errors of this paper’s model on the risk of traffic accidents in the morning peak are lower than those in the flat peak, which can help the personnel of the traffic management department to deploy the police to the high-risk areas of accidents in the morning peak in advance, and deal with the accidents in a timely manner, and can also help the drivers to avoid high-risk areas.

The effect of each different input on the model prediction error for weekdays and holidays was compared, and the comparison of the experimental results (MRE) for weekdays and holidays is shown in Figure 5. The model prediction MRE values for weekdays and holidays are 0.736~0.813 and 0.769~0.839, respectively. There are more vehicles on weekdays, and traffic congestion caused by road traffic accidents is more serious than that on ordinary holidays, so effective prediction of the risk of traffic accidents on weekdays is particularly important, and the prediction errors of this paper’s model on weekdays are lower than that of holidays, which can help the public to plan their routes to avoid high-risk areas. Citizens travelling on weekdays should plan their routes to avoid high-risk areas.

The comparison of experimental results (MRE) between sunny and rainy days is shown in Figure 6. The model will show slightly better prediction results on rainy days than on sunny days, and the MRE for the prediction of traffic accident risk on rainy days is 0.745~0.781, and the MRE on sunny days is 0.772~0.793.

In summary, open government data such as traffic flow, weather data and traffic accident data can be used to predict the risk of traffic accidents using deep learning models to help the relevant personnel make decisions and plans to reduce the occurrence of traffic accident incidents.

4

Analysis of the mechanisms influencing the use of open government data

Open government data itself does not bring economic value and social benefits, but the public, as developers and utilisers of data, can provide valuable information and collective wisdom to create economic value for society, and public participation largely influences the value realisation of government data. After applying deep learning algorithms to deduce the process of open government data utilisation, this chapter explores the influence mechanism of open government data utilisation based on the perspective of the paradox of public utilisation willingness and behaviour, combined with regression analysis.

4.1

Regression analysis methods

The main object of regression analysis is the statistical relationship between the variables of objective things, which is based on a large number of experiments and observations of objective things and is used to find the statistical regularity hidden in those phenomena that seem to be uncertain. Examining the interdependence of one variable (the dependent variable) with many other variables (the independent variables) is a multiple regression problem.

Once a multiple regression model has been determined, it is obviously not prudent to apply the model immediately for prediction, control and analysis, because whether the model reveals the relationship between the explanatory variables and the explanatory variables must be determined by testing the model. The testing of regression models generally requires the use of statistical tests.

Statistical tests are usually tests of the significance of the regression equation and the regression coefficients, as well as tests of goodness-of-fit and multicollinearity of the explanatory variables.

4.1.1

Multiple linear regression models

Let the linear regression model of random variable y and general variable x₁,x₂,⋯x_p be: (19) $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ε$ \[y={{\beta }_{0}}+{{\beta }_{1}}{{x}_{1}}+{{\beta }_{2}}{{x}_{2}}+\cdots +{{\beta }_{p}}{{x}_{p}}+\varepsilon \]

Where β₀,β₁⋯,β_p is the p+1 unknown parameter, β₀ is the regression constant, and β₁⋯,β_p is the regression coefficient. Y are called the explanatory variables (dependent variables), and x₁,x₂,⋯x_p are p general variables that can be precisely measured and controlled, called explanatory variables (independent variables).

For a practical problem, if n a set of observations (x_i1,x_i2,⋯x_ip;y)(i=1,2⋯n) is obtained, the linear regression model can be expressed as: (20) $y = X β + ε$ \[y=X\beta +\varepsilon \]

Where $y = [\begin{array}{l} y_{1} \\ y_{2} \\ ⋮ \\ y_{3} \end{array}]$ $y=\left[ \begin{array}{*{35}{l}} {{y}_{1}} \\ {{y}_{2}} \\ \vdots \\ {{y}_{3}} \\ \end{array} \right]$, $X = [\begin{matrix} 1 & x_{11} & \dots & x_{1 p} \\ 1 & x_{21} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{matrix}]$ $X=\left[ \begin{matrix} 1 & {{x}_{11}} & \cdots & {{x}_{1p}} \\ 1 & {{x}_{21}} & \cdots & {{x}_{2p}} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & {{x}_{n1}} & \cdots & {{x}_{np}} \\ \end{matrix} \right]$, $β = [\begin{matrix} β_{1} \\ β_{2} \\ ⋮ \\ β_{p} \end{matrix}]$ $\beta =\left[ \begin{matrix} {{\beta }_{1}} \\ {{\beta }_{2}} \\ \vdots \\ {{\beta }_{p}} \\ \end{matrix} \right]$, and $ε = [\begin{matrix} ε_{1} \\ ε_{2} \\ ⋮ \\ ε_{n} \end{matrix}]$ $\varepsilon =\left[ \begin{matrix} {{\varepsilon }_{1}} \\ {{\varepsilon }_{2}} \\ \vdots \\ {{\varepsilon }_{n}} \\ \end{matrix} \right]$ are random errors.

4.1.2

Goodness of fit

Goodness of fit is used to test the fit of the regression equation to the sample observations, as reflected in the value of the sample coefficient of determination $R^{2} = \frac{S S R}{S S T}$ ${{R}^{2}}=\frac{SSR}{SST}$.

The sum of squares decomposition formula is: (21) $\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} + \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}$ \[\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}=\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}}-{{{\hat{y}}}_{i}} \right)}^{2}}}+\sum\limits_{i=1}^{n}{{{\left( {{{\hat{y}}}_{i}}-\bar{y} \right)}^{2}}}\]

The regression sum of squares, $S S R = \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}$ $SSR=\sum\limits_{i=1}^{n}{{{\left( {{{\hat{y}}}_{i}}-\bar{y} \right)}^{2}}}$, reflects the magnitude of fluctuations in the n estimates of ŷ₁,ŷ₂,…,ŷ_n which is due to the fact that there is indeed a linear relationship between Y and the independent variable x₁,x₂,…,x_m and through the variation of x₁,x₂,…,x_m.

The residual sum of squares $\sum_{j = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}$ $\sum\limits_{j=1}^{n}{{{\left( {{y}_{i}}-{{{\hat{y}}}_{i}} \right)}^{2}}}$, which is caused by everything other than the linear relationship between x₁,x₂,…x_n and Y (including the non-linear relationship between x₁,x₂,…x_m and Y and random error).

The total sum of squared deviations $S S T = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}$ $SST=\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}$ reflects the magnitude of the total fluctuations in Y the observations y₁,y₂,…y_n.

From the significance of the regression sum of squares and the residual sum of squares, it can be known that the greater the proportion of the regression sum of squares in the total sum of squares of deviations, the better the linear regression, which indicates that the regression straight line fits the sample observations better, and if the proportion of the residual sum of squares is large, the regression straight line does not fit the sample observations satisfactorily. The ratio of the sum of squared regressions SSR to the sum of squared total deviations SST is defined as the coefficient of determination: (22) $(c o e f f i c i e n t o f \det e r \min a t i o n) — — R^{2} = \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$ \[\text{ }\left( coefficient\text{ }of\text{ }\det er\min ation \right){{R}^{2}}=\frac{\sum\limits_{i=1}^{n}{{{\left( {{{\hat{y}}}_{i}}-\bar{y} \right)}^{2}}}}{\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}}-\bar{y} \right)}^{2}}}}\]

The coefficient of determination R² is a relative indicator of the goodness of fit of a regression line to the sample observations, reflecting the proportion of the variation in the dependent variable that the independent variable can explain. If the coefficient of determination R² is close to 1, it means that the regression equation can explain the vast majority of the uncertainty in the dependent variable, and the regression equation is a good fit; conversely, if R² is not too large, it means that the regression equation is not working well, and should be modified.

4.1.3

Multicollinearity test

Effect of multicollinearity on regression models

Set up the regression model: (23) $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ε$ \[y={{\beta }_{0}}+{{\beta }_{1}}{{x}_{1}}+{{\beta }_{2}}{{x}_{2}}+\cdots +{{\beta }_{p}}{{x}_{p}}+\varepsilon \]

There exists complete multicollinearity, i.e., for the column vectors of the design matrix X there exists a set of numbers C₀,C₁,⋯C_p that are not all zero such that: (24) $C_{0} + C_{1} x_{i 1} + \dots C_{p} x_{i p} + 0, i = 1, 2, \dots n$ \[{{C}_{0}}+{{C}_{1}}{{x}_{i1}}+\cdots {{C}_{p}}{{x}_{ip}}+0,i=1,2,\cdots n\]

The rank rank(X)<p+1 of the design matrix X, at which point $X' X | = 0$ $X\prime X\left| =0 \right.$, the solution to the system of regular equations X′Xβ=X′y; is not unique, (X′X)⁻¹ does not exist, and the expression for the least squares estimate of the regression parameters $\hat{β} = {(X' X)}^{- 1} X' y$ $\hat{\beta }={{\left( X\prime X \right)}^{-1}}X\prime y$ does not hold.

2) Diagnosis of multicollinearity

Centre normalising the independent variables, then X^*′X^*=(r_ij) is the correlation array of the independent variables. Denote C=(c_ij)=(X^*X^*)⁻¹ as its main diagonal element VIF_j=c_ij as the variance expansion factor of the independent variable x_j. $var (\hat{β}) = C_{i j} σ^{2} / L_{i j}$ \[\operatorname{var}\left( {\hat{\beta }} \right)={{{C}_{ij}}{{\sigma }^{2}}}/{{{L}_{ij}}}\;\], j=1,2,⋯p where L_ij is the sum of squared deviations of x_j and c_ij is used as a measure of the variance expansion factor of the independent variable x_j.

Since $R_{j}^{2}$ $R_{j}^{2}$ measures the degree of linear correlation between x_j and the remaining p–1 independent variables, the stronger this correlation is, the more severe the multicollinearity between the independent variables is, $R_{j}^{2}$ $R_{j}^{2}$ and the closer it is to 1, VIF and the greater it is. When VIF_j≥10, $\bar{V I F} = \frac{1}{p} \sum_{i = 1}^{R} V I F_{j}$ $\overline{VIF}=\frac{1}{p}\sum\limits_{i=1}^{R}{V}I{{F}_{j}}$ is much greater than 1, it is an indication of severe multicollinearity between the independent variable x_j and the remaining independent variables.

4.2

Study design

4.2.1

Research Modelling

AMO theory is the ‘ability-motivation-opportunity’ theory, which believes that an individual’s ability, motivation and opportunity together determine their behavioural tendencies, while TPB theory (Theory of Planned Behaviour) believes that perceived behavioural control, behavioural attitudes and subjective norms together influence the emergence of behaviours. Combining AMO-TPB theory, this paper constructs the theoretical framework of AMO-TPB.

The research model of open government data utilisation is shown in Figure 7. The capability dimension contains perceived behavioural control and data literacy, i.e., the degree of control felt by the public when using government data and the public’s data awareness and capability. The motivation dimension contains behavioural attitudes and subjective norms, i.e., the public’s evaluation of the behaviour of utilising government data and the social pressure felt by the public to perform the utilisation behaviour. The opportunity dimension includes facility readiness, data quality and platform quality, i.e., the readiness of the infrastructure that the public has to support the utilisation behaviour, and the availability and ease of use of government data and open government data platforms. In addition, individual characteristics such as gender and age affect the consistency between individual willingness and actual behavior. Risk attitude is the individual’s tendency to choose between risky and safe options. In this paper, risk attitude refers to the public’s weighing and choosing between risky and safe options when facing whether to make use of government data or not, and it is also an important factor that affects the relationship between individual willingness and actual behaviour. Therefore, this paper argues that gender, age, and risk-taking attitude also have an impact.

4.2.2

Questionnaires

The independent variables in this paper are 10 variables that fall under the dimensions of ability, motivation, opportunity, and individual characteristics, and the dependent variable is the utilization of open government data. Considering that the research object is the general public, the questionnaire needs to be widely applicable, so the questions are set in accordance with the principle of simplicity and comprehensibility. The total number of questionnaires returned was 525, with 483 of them being valid, with a validity rate of 92%.The survey sample is representative of different provinces, age groups, and genders in the PRD region.The collected data passed the reliability and validity tests.

4.3

Analysis of causes

In this paper, the variance inflation factor (VIF) was applied to test the independent variables for multicollinearity. The estimation results of the research model are shown in Table 2. *** denotes p<0.001 and ** denotes p<0.01. The maximum value of VIF for the respective variables is 4.537, and there is no multicollinearity or weak covariance between the variables. The regression model was screened in 7 steps using the backward LR strategy, and the log-likelihood value of -2 times the last step was 309.239, which passed the significance test of the regression equation. The Hosmer-Lemeshow statistic was 0.531, and the model fit was good.

Table 2.

Estimate results of the research model

Variable class	Variable	Regression coefficient	Standard deviation	Exp(B)
Ability	Perceptual behavior control	0.133^***	0.034	0.912
Motive	Behavior attitude	0.322^**	0.055	0.817
Opportunity	Complete facility	0.208^***	0.049	1.172
Opportunity	Data quality	0.156^***	0.124	0.924
Individual characteristics	Risk attitude	0.695^***	0.537	0.915
Constants		0.776^**	0.489	3.226
Model(Sig.)		0.000	The logarithmic likelihood of the minus double	309.239
Hosmer-Lemeshow		0.531

Perceived behavioral control has a significant positive effect (0.133) on the utilization of open government data. A high level of perceived behavioural control helps to reduce uncertainty and anxiety during public utilisation of open government data, a finding that reveals the key role of public confidence and sense of self-control in utilisation behaviour. Meanwhile, behavioural attitude (0.322) in motivation, facility completeness (0.208) and data quality (0.156), an opportunity and risk attitude (0.695) in individual characteristics also have a significant positive effect on open government data utilisation. In comparison, other variables did not have a significant effect on the utilization of open government data.Among them, risk attitude in individual characteristics has the greatest effect on the use of open government data.

4.4

Analysis of impact mechanisms

Perceived behavioral control, behavioral attitudes, facility completeness, data quality, and risk attitudes are denoted by S1, S2, S3, S4, and S5, respectively, and S0 denotes the utilization of open government data. After the element hierarchy analysis through the adjacency multiplication matrix, the influencing factors of open government data utilisation are divided into 2 layers: layer 1 L1 = {S3, S4, S5} and layer 2 L2 = {S1, S2}. According to the factor level situation and the complex interrelationship between the factors, the explanatory structure model is obtained. The explanatory structure model of open government data utilisation is shown in Figure 8. Open government data utilisation includes both shallow and deep factors, which form three paths of action:

Path I: Facility Completion → Perceived Behavioural Control / Behavioural Attitude → Open Government Data Utilisation

Path 2: Data quality → perceived behavioural control/behavioural attitude → open government data utilisation

Path 3: Risk Attitude → Perceived Behavioural Control / Behavioural Attitude → Open Government Data Utilisation

5

Conclusion

Effective utilization of open government data is beneficial for both promoting urban economic development and enhancing public services. In this study, we analyze the open government data utilization model, starting with the government-led internal management utilization model, taking traffic accident risk prediction as an example, constructing an intelligent model based on scale-reduced attention and graph convolution, and conducting experimental analysis. Then, using the regression analysis method, based on the constructed research framework on open government data utilization, the influence mechanism of open government data exploitation is explored.

In traffic accident risk prediction, the MAE (0.082), MSE (0.038), and MRE (0.808) of the SAGCN model in this paper are smaller than those of the comparison models and have better traffic accident prediction performance. For the feature analysis, the number of historical traffic accidents and rainfall help to reduce the prediction error of the model, while the visibility increases the model’s error, which is not conducive to improving the prediction accuracy. In addition, the SAGCN model in this paper has better performance during busy traffic times (morning rush hour, weekdays) and when there is an unusual event (rainfall), corresponding to MREs of 0.734-0.798, 0.736-0.813 and 0.745-0.781. The deep learning model based on open government data can predict the potential risks of traffic, which makes it easier for relevant people to make proper precautionary efforts to reduce casualties and property losses.

Perceived behavioral control, behavioral attitudes, facility completeness, data quality, and risk attitudes all have a significant positive effect on open government data utilization (p<0.01). Risk attitudes among individual characteristics have the most obvious effect on open government data utilization, with a regression coefficient of 0.695. The way that open government data is used is affected by three paths: facility completeness, data quality, and risk attitudes. All three have a significant positive effect on open government data use (p<0.01), with a regression coefficient of 0.695. Perceived behavioral control/behavioral attitude has a facilitating effect, which promotes the utilization of open government data.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Research on the Utilization Pattern Mining and Impact Mechanism of Open Government Data Based on Deep Learning Algorithms

Ying Zhang

Tianhao He

Publié en ligne: 17 mars 2025

Reçu: 04 nov. 2024

Accepté: 13 févr. 2025

DOI: https://doi.org/10.2478/amns-2025-0183

Mots clésGraph convolutional networks, Deep learning, Attention mechanisms, Risk prediction, Regression analysis, Open government data

© 2025 Ying Zhang et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Graph convolutional networks, Deep learning, Attention mechanisms, Risk prediction, Regression analysis, Open government data