Application and Accuracy Improvement of Big Data Analytics in Market Demand Forecasting in Tourism Economy

Tourism is a large and growing industry that plays an important role in driving economic development. In the post epidemic era, tourism has reached a new high.

The tourism economy contributes significantly to economic growth. First, it provides employment opportunities [1-4]. Tourism is a labor-intensive industry, which can provide a large number of employment opportunities. Whether directly engaged in tourism services, or closely related to the tourism industry, such as hotels, restaurants, transportation and other fields of employees, can be from the tourism industry to obtain stable employment opportunities. Secondly, it enhances the added value of the economy [5-7]. Tourism can drive the development of many related industries, such as transportation, catering, retail and so on. The development of these industries will increase the total output value and wealth creation of the national economy and promote economic growth. Next, it increases foreign exchange earnings [8-10]. Tourism usually attracts a large number of foreign tourists to spend money. The consumption behavior of foreign tourists will bring a large amount of foreign exchange income and increase the country’s foreign reserves. Furthermore, it promotes regional economic balance [11-13]. The development of tourism can attract tourists to visit a specific region, promote the development of regional economy, and improve the economic pattern between regions. This has positive significance in promoting regional economic balance. Finally, the promotion and protection of cultural heritage [14-16]. The development of tourism requires the protection and inheritance of rich cultural heritage. Through the promotion of tourism, these cultural heritages can be better protected and inherited, further enriching the diversity of human civilization.

Tourism market has the plurality of supply, diversity of demand and uncertainty of market demand [17-20]. The supply of tourism market includes transportation, accommodation, catering, scenic spots and other fields, forming a complex industrial chain. Collaboration and competition among different suppliers promote the development of the tourism market. Travelers have different needs, some pursue leisure and relaxation, some like to explore and adventure, and some are fond of history and culture. This diversity prompts enterprises to provide a variety of tourism products and services to meet the needs of different travelers. And since tourism demand is affected by a variety of factors, such as policy, weather, vacation, etc., there is uncertainty in tourism market demand. Tourism enterprises need to respond flexibly to the uncertain environment in order to improve their market competitiveness.

For the dynamic and time-sensitive tourism industry and the hospitality industry for tourist accommodations, both need market intelligence for their promotion, and predicting customer movements is the main direction, which can improve the competitive advantage [21]. Literature [22] mentions that search engines in internet data are the most used in tourism forecasting because engines are more aware of the user’s desire to know a certain information at that time. In addition, in 2017, literature [23] proposed to build a big data-based tourism forecasting framework in the hope of bringing a shift in the tourism business situation. Literature [24] analyzed and predicted the daily passenger flow of an attraction by using the long and short-term memory network of big data to synthesize multiple sources of data such as historical visitor numbers, engine platforms, and weather, etc. The method provided a reference and a preparation plan for the management of the attraction’s reception, and provided a reference for the safety of tourists and the itinerary arrangement. Literature [25] Big data can effectively predict the demand for cruise tourism in China, reducing the financial risk of multiple parties such as ports, investors, promotional activities, and cruise ships. Literature [26] used big data to give timely and granular data on the decline in travel rate of tourist flights under the epidemic, which provided ideas for marketing strategies for tourist local airlines. And literature [27] utilized good big data to predict the number of people in tourist places, proposed the mutual prediction ability between web engine search results and prediction results, predicted the final number of people more accurately, and improved the management and marketing strategy of tourist places. Literature [28], on the other hand, predicted the number of tourists by analyzing the data from three platforms, Baidu, Ctrip, and GoWhere.com, either by single or multi-source analysis, and the results showed that multi-source analysis was better than single analysis. Both studies provide reference directions regarding the accuracy of big data in predicting market demand in the tourism economy.

In this paper, we first summarize four market demand forecasting analysis steps based on tourism consumption behavior to construct a forecasting framework for tourism market demand. For the time series data of the number of tourists, dynamic time regularization is used to carry out the time series similarity measurement, and the absolute level and dynamic change distance are combined to measure the similarity between the number of travelers in each country. Aiming at the shortcomings of ARIMA in extracting relevant information from time series data, a seasonal product tourism demand forecasting model was constructed, and the inbound tourism market of Xiamen City, Fujian Province, was forecasted modeled and analyzed through difference operations and non-stationary time series analysis. The integrated distance method was selected to cluster the tourism data, and the different characteristics of each source were analyzed. The seasonal fluctuation of tourism data was reduced by taking logarithmic and seasonal differencing, and the forecasting model was determined by adjusted R-squared value, AIC, and MAPE, and the model was utilized to forecast the tourist arrivals in Xiamen City.

2

Analysis framework for market demand forecasting based on tourism big data

To study the relationship between tourism big data and tourism market demand, we can start from the research framework of tourism consumption behavior.

Tourism consumption behavior refers to the behaviors and activities of people consuming tourism products or services (mainly including the six links of food, housing, transportation, tourism, shopping and entertainment), and the research framework of consumer tourism consumption behavior can be derived by analogy with the research framework of consumer consumption behavior (i.e., the personal consumption behaviors of people purchasing means of living). In general, the study of consumer consumption behavior allows for the identification of various factors that influence consumers’ decisions before and during the purchase of goods or services. It is generally believed that the formation of consumer behavior consists of a combination of back-and-forth and interacting activities, including relevant activities carried out in a series of processes from the actual post-purchase evaluation to the confirmation of demand.

The first part is to collect information, refers to the consumer through the friends and relatives around the recommendation, mass media publicity or through personal experience and other methods to collect information related to the product. Rational decision-making can not be separated from information, tourism purchasing decision is even more so, generally speaking, the amount of money spent on tourism consumption is large, so consumers will want to more comprehensive understanding of tourism goods or services information, from a variety of channels to collect tourism information, especially in the wan network, mobile Internet is highly developed today, the amount of information explosive growth, has been changed from the previous difficult to collect information to the present day to get the effective information is difficult. In addition, the unavailability of tourism goods and services is a major factor. In addition, the characteristics of tourism goods and services such as non-storability, non-transferability, inseparability of production and consumption, and large demand elasticity lead to high risk in tourism consumption, and therefore, consumers also prefer to reduce this risk through comprehensive information collection.

After comprehensive information collection, the second part is assessment of choice, which refers to analyzing and weighing the information obtained and making preliminary choices. Consumers’ evaluation is based on the information collected in the previous part, mainly comparing and choosing tourism goods or services and making personal value judgment. Consumers’ personal value judgment of tourism goods or services varies because there are many influencing factors, including price, quality, time, location, and so on. After the consumer’s evaluation is completed and the most satisfactory tourism goods or services are selected, the next part of the purchase decision, which refers to the final expression of the consumer’s intention to purchase, will occur naturally. Either joining a tour group, or purchasing an air ticket, or buying a boat ticket, or completing a hotel room reservation.

The third part of the ending of tourism is the evaluation of post-purchase consumption effect, including post-purchase satisfaction and attitude towards whether to re-purchase, a good experience of tourism goods or services can make consumers produce positive evaluation and word-of-mouth publicity, while a poor experience will have a negative evaluation. Especially in the role of the Internet and mobile Internet, people share the experience more conveniently, and this publicity and evaluation effect will be big.

The fourth part is to confirm the demand, which means that consumers have some kind of demand due to their own feelings or external stimuli. Consumers’ demand for tourism is not created out of thin air, but due to some short-term or long-term reasons or stimuli short-term such as the introduction of friends and relatives, seeing the photos of tourism sharing in the circle of friends, the tourism information published in newspaper advertisements, and so on. In the long term, such as planning for a future relaxation period. With the intrinsic or extrinsic travel demand triggers, the consumer’s travel purchase decision is formed.

To sum up, from the perspective of consumer behavior and people’s habit of using the Internet in general nowadays, all the five parts of tourism consumption behavior will leave traces on the Internet and form Internet search keywords. Therefore, this paper will construct an analytical framework for the interconnection between the keyword network search index and tourism market demand in tourism big data to elucidate the correlation between the two and lay the foundation for the next stage of tourism market demand forecasting. The analytical framework is shown in Figure 1.

3

Data processing and seasonal product modeling

Cluster analysis is an unsupervised classification process that can be applied without a priori knowledge, and plays an extremely important role in data analysis, pattern recognition, detection of outliers and outliers, and refinement services. Using suitable clustering methods, data can be analyzed and studied at a deeper level to find the hidden patterns within the data.

3.1

Data processing method based on K-means clustering

3.1.1

K-means clustering methods

The K-means algorithm is a classical clustering algorithm that minimizes the similarity measure between points within each cluster and maximizes the similarity measure between clusters. The steps of the K-means algorithm are as follows: first, the number of clusters K is set artificially by the writer or user of the algorithm, and then the initial K center-of-masses locations are randomly generated, and the distance from each point to each center-of-mass is calculated by choosing the appropriate distance measure and traversing the clusters to the nearest center-of-mass. Each point is assigned to the center of mass closest to its distance according to the distance minimization rule, so that each point is assigned to a cluster. However, since the initial K center-of-mass positions may not be optimal, the centers of each cluster need to be constantly updated for optimization purposes. The method of updating the center of mass position is to find the mean value of each point in the cluster, and the position where the mean value is located is the new cluster center of mass position, and after continuously finding the mean value to continuously update the cluster position, until the center of mass of the cluster does not change the magnitude of the change reaches the optimization criterion, the algorithm is over, and the final clustering result is the K clusters formed by K clusters with the center of mass that no longer changes [29].

For the K-mean clustering algorithm, the distance metric can be chosen in various forms, such as the Euclidean distance, Manhattan distance, Minkowski distance, and so on. Which distance method is used in practical applications should be selected according to the characteristics of the actual application. Since this paper is for time series to cluster, distance metrics such as Euclidean distance are not very suitable, so Dynamic Time Warping (DTW) is used for the measurement of time series similarity.

Dynamic Time Warping (DTW) is an algorithm that uses the idea of dynamic programming to sequentially match the similarity points between two time series, and uses the sum of the distances of all the similarity points to determine the similarity of the two time series, and the smaller the total distance is, the higher the similarity is [30].

Suppose there are two time series T and R, T and R of lengths n and m respectively(T = t₁, t₂, …, t_i, …, t_n; R = r₁, r₂, …, r_j, …, r_m). Construct a distance matrix D of n × m. Matrix element D(i, j) represents the distance between t_i and r_j. The regularization path is represented in the form of W = w₁, w₂, …, w_k, Max(n, m) ≤ k ≤ n + m − 1, w_k in the form of (i, j), where i, j is the sequence number of the similarity point of the two time series. DTW requires that the first point of the two time series is aligned, i.e., w₁ is represented as (1, 1), and w_k as (n, m). In addition, the regularization path has to be monotonically increasing, which ensures that every sequence point of the two sequences can be taken and that the regularization paths do not fold and cross. Monotonically increasing of the regularized path means: (1) $w_{k} = (i, j), w_{k + 1} = (i^{'}, j^{'}), i \leq i^{'} \leq i + 1, j \leq j^{'} \leq j + 1$ $${w_k} = (i,j),{w_{k + 1}} = (i',j'),i \le i' \le i + 1,j \le j' \le j + 1$$

Construct the regularized distance matrix WD to record the cumulative distance by the idea of dynamic programming, matrix element WD(i, j) is denoted as: (2) $W D (i, j) = D (i, j) + \min [W D (i - 1, j), W D (i - 1, j - 1), W D (i, j - 1)]$ $$WD(i,j) = D(i,j) + \min \left[ {WD(i - 1,j),WD(i - 1,j - 1),WD(i,j - 1)} \right]$$

Finally, the points of the regularized path can be found by backtracking on the regularized distance matrix WD w_k. WD(n, m) is the regularized distance of the two time series for similarity comparison, and the smaller the distance of the regularized path, the higher the similarity.

3.1.2

Systematic clustering methods

In this paper, we construct the following statistics to measure similarity between individuals. 1)

The full-time “absolute horizontal” distance between Individual i and Individual j, abbreviated as d_ij = (AQED): (3) $d_{i j} (A Q E D) = \sqrt{\sum_{t = 1}^{T} {(x_{i t} - x_{j t})}^{2}}$ $${d_{ij}}(AQED) = \sqrt {\sum\limits_{t = 1}^T {{{({x_{it}} - {x_{jt}})}^2}} }$$

2)

The full-time “dynamic change” distance between Individual i and Individual j, i.e., the growth rate distance, abbreviated as d_ij(ISED): (4) $d_{i j} (I S E D) = \sqrt{\sum_{t = 1}^{T} {(\frac{Δ x_{i t}}{x_{i, t - 1}} - \frac{Δ x_{j t}}{x_{j, t - 1}})}^{2}}$ $${d_{ij}}(ISED) = \sqrt {\sum\limits_{t = 1}^T {{{(\frac{{\Delta {x_{it}}}}{{{x_{i,t - 1}}}} - \frac{{\Delta {x_{jt}}}}{{{x_{j,t - 1}}}})}^2}} }$$

where Δx_it = x_i,t − x_i,t−1, $\frac{Δ x_{i t}}{x_{i, t - 1}}$ $$\frac{{\Delta {x_{it}}}}{{{x_{i,t - 1}}}}$$ measure the relative level of rate of increase between two periods d_ij(AQED). Absolute level distance portrays how far apart individuals are from each other in terms of absolute level over the entire period, and dynamic change distance d_ij(ISED) measures how much individuals differ from each other in terms of rate of increase over the entire period. If the absolute level difference is larger, the d_ij(AQED) larger it is. If the difference in dynamic variation is greater, d_ij(ISED) will also be greater.

3)

The “combined” distance between individual i and individual j is abbreviated as d_ij(CED): (5) $d_{i j} (C E D) = α \cdot d_{i j} (A Q E D) + β \cdot d_{i j} (I S E D)$ $${d_{ij}}(CED) = \alpha \cdot {d_{ij}}(AQED) + \beta \cdot {d_{ij}}(ISED)$$

The composite distance is d_ij(CED) a weighted combination of the degree of absolute horizontal variation and the degree of dynamic variation. Weights α and β are calculated using the following idea: for all data, make the difference between all absolute level distances the same or similar to the difference between dynamic change distances. Specifically, the standard deviation of d_ij(AQED) and d_ij(ISED) is used as a reference indicator, i.e., a set of α and β is determined so that: (6) $α \cdot S D [d_{i j} (A Q E D)] = β \cdot S D [d_{i j} (I S E D)] i \neq j \in {1, 2, ..., N}$ $$\alpha \cdot SD[{d_{ij}}(AQED)] = \:\beta \cdot SD[{d_{ij}}(ISED)]\quad i \ne j \in \{ 1,2,...,N\}$$ (7) $α + β = 1$ $$\alpha + \beta = 1$$

Standard deviation of the distances of changes SDl[d_ij(AQED)r] and SDl[d_ij(ISED)r]. Calculations are obtained: (8) $α = \frac{s D [d_{i j} (I S E D)]}{s D [d_{i j} (A Q E D)] + s D [d_{i j} (I S E D)]}$ $$\alpha = \frac{{sD[{d_{ij}}(ISED)]}}{{sD[{d_{ij}}(AQED)] + sD[{d_{ij}}(ISED)]}}$$ (9) $β = \frac{s D [d_{i j} (A Q E D)]}{s D [d_{i j} (A Q E D)] + s D [d_{i j} (I S E D)]}$ $$\beta = \frac{{sD[{d_{ij}}(AQED)]}}{{sD[{d_{ij}}(AQED)] + sD[{d_{ij}}(ISED)]}}$$

The systematic clustering method is one of the most commonly used cluster analysis methods, and its clustering process depends on the definition of the distance between individuals and the distance between classes. The steps of the systematic clustering method are as follows: first, each sample of the clustering is treated as a class, then the similarity statistic between classes is determined, the closest two classes or a number of classes are merged into a new class, and then the aggregated subclasses are merged again according to their interclass distances, and the given data samples are disaggregated layer by layer until all the samples are merged into a single class. Using this type of clustering method a clustering tree consisting of data samples can be obtained with the characteristic of stopping the clustering division at any time [31]. The main interclass distances commonly used in practice are: the shortest distance method, the longest distance method, the middle distance method, the center of gravity method, the class average method, the variable class average method, and the Ward method. In this paper, Ward’s method is used as a measure of inter-class distance.

3.2

Seasonal Product Tourism Demand Forecasting Model

In real life, most time series data, especially quarterly or monthly data, usually have strong seasonality. There exists a more complex interaction between the long-term trend, seasonal fluctuations, and stochastic fluctuations in the series, and the common ARIMA model is not sufficient to extract the relevant information in the time series data. Therefore, the ARIMA model is extended to construct the seasonal product model, also known as the seasonal difference autoregressive sliding average model.

3.2.1

Smoothness test of time series data

For time series {X_t, t ∈ T}, time series {x_t} is said to be a wide smooth time series if the following three conditions are satisfied simultaneously: 1)

∀t ∈ T, with $E X_{t}^{2} < \infty$ $$EX_t^2 < \infty$$.

2)

∀t ∈ T, which satisfies that the mean is constant, i.e.: EX_t = μ, where μ is a constant.

3)

∀t, s, k ∈ T, with k + s − t ∈ T, satisfies that the self-covariance function is independent of the starting and ending points in time. Rather, it depends only on the translation length of time, i.e., γ(t, s) = γ(k, k + s − 1, where the self-covariance function γ(t, s) = E(X − μ_t)(X_s − μ_s).

Smoothness in time series analysis generally refers to wide smoothness, which is also known as weak smoothness or second-order smoothness. A series that does not satisfy the smoothness condition is called a non-smooth series. The smoothness test is the basis and premise of time series modeling, and its test methods are mainly the following two. 1)

Graph test method

Time series chart refers to the construction of the time as the horizontal axis, the sequence of values for the vertical axis of the plane of the two-dimensional coordinate chart, which can intuitively reflect the basic distribution characteristics of the time series. When the time series of a time series is always around a constant value of random fluctuations, and its fluctuation range has more obvious boundaries, then the time series is usually a smooth time series. On the contrary, if the time series plot shows more obvious periodic characteristics or trend characteristics, then the series is usually non-semi-stable. To be on the safe side, autocorrelation plots should be used to further assist in identification after observing the time series plot.

An autocorrelation plot is a planar two-dimensional coordinate hanging line plot in which the horizontal axis represents the autocorrelation coefficient and the vertical axis represents the number of delay periods, while the magnitude of the autocorrelation coefficient is represented by the hanging line. Usually, a smooth time series has short-term correlation, which means that the autocorrelation coefficient ${\hat{ρ}}_{k}$ $${\hat \rho _k}$$ of a smooth series will decay to zero in a relatively short period of time as the number of delay periods k increases, while the autocorrelation coefficient ${\hat{ρ}}_{k}$ $${\hat \rho _k}$$ of a non-smooth time series will decay to zero relatively slowly.

2)

Hypothesis testing

The use of graphical tests to determine the smoothness of the time series is highly subjective, the use of hypothesis testing can overcome this limitation. In various types of statistical methods to test the smoothness of the series, the most widely used is the ADF unit root test. The principle is as follows:

For any pth order autoregressive model AR(p) process: (10) $x_{t} = ϕ_{1} x_{t - 1} + ϕ_{2} x_{t - 2} + \dots + ϕ_{p} x_{t - p} + ε_{t}$ $${x_t} = {\phi _1}{x_{t - 1}} + {\phi _2}{x_{t - 2}} + \cdots + {\phi _p}{x_{t - p}} + {\varepsilon _t}$$

Its characteristic equation is: (11) $λ^{p} - ϕ_{1} λ^{p - 1} - ϕ_{2} λ^{p - 2} - \dots - ϕ_{p} = 0$ $${\lambda ^p} - {\phi _1}{\lambda ^{p - 1}} - {\phi _2}{\lambda ^{p - 2}} - \cdots - {\phi _p} = 0$$

Time series {x_t} is smooth if all the characteristic roots in Eq. (11) are within the unit circle, i.e. |λ_i| < 1, where i = 1, 2, 3, ⋯, p. If one unit root exists, then time series {x_t} is not smooth and the sum of autoregressive coefficients is exactly equal to 1. It may be useful to set λ_i = 1, and then 1 − ϕ₁ − ϕ₂ − ⋯ − ϕ_p = 0 can be deduced from Eq. (11), which in turn can be deduced: (12) $ϕ_{1} + ϕ_{2} + \dots + ϕ_{p} = 1$ $${\phi _1} + {\phi _2} + \cdots + {\phi _p} = 1$$

Assuming Order ρ = ϕ₁ + ϕ₂ + ⋯ + ϕ_p − 1, the hypothesis conditions for the AR(p)-process ADF unit root test can be determined as: (13) $\begin{array}{l} H_{0} : ρ = 0 (Sequence {x_{t}} is non - stationary) \\ H_{1} : ρ < 0 (Sequence {x_{t}} is stationary) \end{array}$ $$\begin{array}{l} {H_0}:\rho = 0\:({\text{Sequence}}\{ {x_t}\} {\text{is non - stationary)}} \\ {H_1}:\rho < 0\:({\text{Sequence}}\{ {x_t}\} {\text{is stationary)}} \\ \end{array}$$

Its test statistic is: (14) $τ = \hat{ρ} / S (\hat{ρ})$ $$\tau = \hat \rho /S(\hat \rho )$$

where $S (\hat{ρ})$ $$S(\hat \rho )$$ is the sample standard deviation of parameter ρ. The value of the calculated test statistic τ is compared with the table of critical values and if the original hypothesis is rejected, the series is smooth.

3.2.2

Differential operations

In time series analysis, the first step in analyzing time series observations, regardless of the method of analysis, is to take effective means to extract the deterministic information embedded in the time series observations.

From Cramer’s decomposition theorem, it can be seen that all time series can be decomposed into two parts, i.e., a deterministic trend determined by a polynomial and a smooth zero-mean error: (15) $x_{t} = μ_{t} + ε_{t} = \sum_{j = 0}^{d} β_{j} t^{j} + ψ (B) a_{t}$ $${x_t} = {\mu _t} + {\varepsilon _t} = \sum\limits_{j = 0}^d {{\beta _j}} {t^j} + \psi (B){a_t}$$

where {a_t} is a white noise sequence with zero mean, B is a delay operator, β₁, β₂, …, β_d is a constant coefficient, and d < ∞. Thus, {ε_t: ε_t = ψ(B)a_t} in Eq. (15) denotes the stochastic influence on the time series {x_t}, while the mean sequence ${\sum_{j} = 0^{d} β_{j} t^{j}}$ $$\left\{ {\sum\limits_j = {0^d}{\beta _j}{t^j}} \right\}$$ reflects the deterministic influence in the time series {x_i}.

Performing the dst order difference operation on the discrete series is equivalent to performing the dnd order derivation on the continuous variable. From Eq. (15), the dth order difference operation on the time series {x_t} can fully extract the deterministic information in {x_t}. Expanding any dth order difference using the lag operator B yields Eq. (16): (16) $\nabla^{d} x_{t} = {(1 - B)}^{d} x_{t} = \sum_{i = 0}^{d} {(- 1)}^{i} C_{d}^{i} x_{t - i}$ $${\nabla ^d}{x_t} = {(1 - B)^d}{x_t} = \sum\limits_{i = 0}^d {{{( - 1)}^i}} C_d^i{x_{t - i}}$$

It can be obtained after conversion: (17) $x_{t} = \sum_{i = 1}^{d} {(- 1)}^{i + 1} C_{d}^{i} x_{t - i} + \nabla^{d} x_{t}$ $${x_t} = \sum\limits_{i = 1}^d {{{( - 1)}^{i + 1}}} C_d^i{x_{t - i}} + {\nabla ^d}{x_t}$$

From equation (17), it can be seen that the essence of dst order differencing is an dnd order autoregressive process. The historical data ${x_{t - d}}$ $$\left\{ {{x_{t - d}}} \right\}$$ from the delayed d period is used as the independent variable to explain the movement of the time series values ${x_{t}}$ $$\left\{ {{x_t}} \right\}$$ in the current period, i.e.: the deterministic information is extracted using autoregression. 1)

pth order differencing

pth order differencing is the operation of performing another 1st order differencing operation on the time series after p − 1th order differencing. It is denoted as: (18) $\nabla^{p} x_{t} = \nabla^{p - 1} x_{t} - \nabla^{p - 1} x_{t - 1}$ $${\nabla ^p}{x_t} = {\nabla ^{p - 1}}{x_t} - {\nabla ^{p - 1}}{x_{t - 1}}$$

For example, 1st order differencing is the operation of subtraction between two time series values separated by one period, i.e.: ∇x_t = x_t − x_t−1, where ∇x_t denotes the 1st order differencing of x_t. 2nd order differencing is the process of performing one 1st order differencing operation on a time series followed by another 1st order differencing operation, i.e.: ∇²x_t = ∇x_t − ∇x_t−1, where ∇²x_t denotes the 2nd order differencing of x_t. 2)

k-step differencing

The k-step differencing operation is a subtraction operation between two time series values that are separated by k periods. It is denoted as: (19) $\nabla_{k} x_{t} = x_{t} - x_{t - k}$ $${\nabla _k}{x_t} = {x_t} - {x_{t - k}}$$

where ∇_kx_t denotes the k-step difference of x_t.

3.2.3

Non-stationary time series analysis

1)

ARIMA model

Difference operation has a strong ability to extract deterministic information, most of the non-smooth time series after difference operation will show the nature of smooth time series. If the series ${x_{t}}$ $$\left\{ {{x_t}} \right\}$$ is a non-stationary time series, but after the difference operation shows the nature of a stable series and through the series smoothness test, then the non-stationary series ${x_{t}}$ $$\left\{ {{x_t}} \right\}$$ is called the difference of a stable series, you can establish a summed autoregressive moving average model (ARIMA). The essence of the ARIMA(p, d, q) model is a combination of the difference operation and the autoregressive sliding average ARMA(p, q) model [32].

The ARIMA model can be abbreviated as: (20) $ϕ (B) \nabla^{d} x_{t} = ϑ (B) ε_{t}$ $$\phi (B){\nabla ^d}{x_t} = \vartheta (B){\varepsilon _t}$$

where ϕ(B) is the autoregressive coefficient polynomial, i.e., ϕ(B) = 1 − ϕ_lB − ϕ₂B² − ⋯ − ϕ_pB^pϑ(B) is the shifted smoothing coefficient polynomial, i.e., ϑ(B) = 1 − θ₁B − θ₂B² − ⋯ − θ_qB^q. {ε_t} is the white noise series with zero mean, d denotes the number of differences, ∇ denotes the difference operator, and ∇ = 1 − B, ∇x_t = x_t − x_t−1. 2)

Seasonal product model (SARIMA)

Seasonal product model is the original non-stationary time series for differential transformation and then seasonal differential and through the smoothness test after the establishment of the model. The fitted model is actually the product of ARMA(p, q) and ARMA(P, Q) due to the multiplicative relationship between seasonal effects and short-term correlations. Combining the dth-order trend differencing and the Dth-order seasonal differencing with period S as the step size, the complete seasonal product model is obtained: (21) $\nabla^{d} \nabla_{S}^{D} x_{t} = (ϑ (B) ϑ_{S} (B)) / ϕ (B) ϕ_{S} (B) * ε_{t}$ $${\nabla ^d}\nabla _S^D{x_t} = (\vartheta (B){\vartheta _S}(B))/\phi (B){\phi _S}(B)*{\varepsilon _t}$$

In the formula: (22) $ϑ (B) = 1 - θ_{i} B - \dots - θ_{q} B^{q}, ϕ (B) = 1 - ϕ_{l} B - \dots - ϕ_{p} B^{p}$ $$\vartheta (B) = 1 - {\theta _{\text{i}}}B - \cdots - {\theta _q}{B^q},\phi (B) = 1 - {\phi _{\text{l}}}B - \cdots - {\phi _{\text{p}}}{B^p}$$ (23) $ϑ_{s} (B) = 1 - θ_{1} B^{s} - \dots - θ_{Q} B^{Q S}, ϕ_{s} (B) = 1 - ϕ_{1} B^{S} - \dots - ϕ_{P} B^{P S}$ $${\vartheta _s}(B) = 1 - {\theta _1}{B^s} - \cdots - {\theta _Q}{B^{QS}},{\phi _s}(B) = 1 - {\phi _1}{B^S} - \cdots - {\phi _P}{B^{PS}}$$

The seasonal product model can be simply notated as ARIMA(p, d, q)(P, D, Q)_S where p is the autoregressive term. d is the number of times the original series becomes a smooth time series of differences. q is the moving average term. P is the seasonal autoregressive order. D is the seasonal difference order. Q is the seasonal sliding average order.

4

Forecast analysis of demand in the tourism economy market

4.1

Fujian Province Inbound Tourism Market Cluster Analysis

In this paper, based on the number of inbound tourists received in Fujian Province from 2014 to 2019 in Fujian Statistical Yearbook 2019 as statistical information, the inbound tourism source market in Fujian Province is clustered and analyzed, and the comprehensive distance method is selected for clustering. Table 1 shows the number of tourists from each region of inbound tourism in Fujian Province. It is not difficult to find that the number of inbound tourists from Japan and South Korea, China, Hong Kong and Taiwan are at the top of the list, both exceeding 400,000 people.

Table 1.

The number of visitors to the tourist area in Fujian Province

Country	2014	2015	2016	2017	2018	2019
Japan(1)	162902	202580	262992	223202	348256	422070
South Korea(2)	89334	121731	216002	202790	297193	454552
Malaysia(3)	40004	48655	99031	81375	151483	177845
United States(4)	65679	72984	90742	75071	129820	176730
Singapore(5)	32430	37698	69847	57267	99759	108600
Thailand(6)	20452	26852	48884	32991	67780	78972
German(7)	23737	27613	30073	27600	55078	71956
Italy(8)	15345	21770	25838	26992	44086	62361
France(9)	18096	22111	28512	22391	41139	55919
Indonesia(10)	14788	18034	22338	24002	37792	42593
Australia(11)	13068	16741	19630	20180	34600	45941
England(12)	14154	16336	20863	19936	30685	43617
Chinese Hongkong(13)	180728	221070	265659	289380	409393	451129
Chinese Taiwan(14)	271175	399666	524182	427180	533778	640391

The above data were inputted into SPSS statistical software, and the clustering results are shown in Figure 2. Using the shortest distance method to analyze the clustering of inbound tourism source country markets in Fujian Province, the following conclusions are drawn: 1)

In the categorization of 14 source countries from 2014 to 2019, it is obvious that the source market of Taiwan, China is a separate category. Taiwan’s proximity to Fujian Province, the fact that there are many Taiwanese whose ancestry is from Fujian, the proximity in terms of geography and bloodline, and the fact that there are many Taiwanese who have been investing and doing business in Fujian in recent years, all of these have laid a good foundation for Taiwan to become an important source of tourists in Fujian. At present, Taiwanese tourists coming to Fujian mainly visit their relatives, sightseeing and business tours.

2)

Japan’s source market and Hong Kong, China’s source market for a class. Japan has long been a major overseas source market for Fujian Province, and the situation in China as a whole is roughly the same. The rapid development of the Japanese economy after the war and the increase in residents’ leisure time makes outbound tourism become a symbol of Japanese fashion and high quality of life. The spatial proximity and cultural similarity have made more and more Japanese tourists choose China as one of the major overseas tourist destinations. Fujian Province, as a more economically developed region in mainland China, is particularly favored by Japanese overseas tourists for its beautiful natural and humanistic environment.

Hong Kong, China has also been one of the key inbound source markets for Fujian Province, and its growth trend is in line with that of Japan. The Mazu Cultural Festival held in Fujian has expanded the influence of Fujian’s tourism destination, in addition to Hong Kong and Macao tourists are interested in Gulangyu Island and other famous attractions.

3)

South Korea’s source market is a category. Korea has developed close relations with China, with very frequent economic and cultural exchanges, and the inbound tourism market has been maintaining a rapid growth trend. To Fujian, Korean tourists basically to business tourism, in addition to Fujian and South Korea in the economy, cultural exchanges continue to increase, study, exchanges, etc. has become a fast-growing source of customers. From the statistics, it can be seen that South Korea, as an emerging source country for inbound tourism in Fujian Province, has a huge growth rate and potential, and is the top priority for Fujian Province in the future in terms of publicity and marketing.

4)

The U.S. source market and the Malaysian source market are in one category. The United States is currently the world’s largest source of one of the exporting countries, followed by Japan, South Korea is the third largest source of Fujian Province. The U.S. visitor source market in Fujian is relatively stable and has been maintaining a steady growth trend. U.S. tourists traveling to China are mainly for sightseeing and leisure, business conference tourism, but also visiting friends and relatives, the U.S. tourists to Fujian, sightseeing is still the main purpose, followed by business activities and the proportion increased significantly.

Malaysia’s source market as the emerging market of inbound tourism in Fujian Province, its development potential should not be ignored. Malaysia has a large population base and a large number of middle-class people. Like the U.S. source market, the two source markets maintain equally important growth potential.

5)

Thailand source market and Singapore source market for a class. Southeast Asia is the traditional source of tourism in China. Singapore has a very good investment base in Fujian area, every year Singapore to Fujian to a variety of study tours more, there are economic and trade visits, science and technology and cultural exchanges, and the Singapore Buddhist Association to Xiamen and other places every year exchanges, etc., the local customs of southern Fujian is also quite attractive to Singaporeans. Thai tourists to Fujian age span, the purpose of the main tour of Xiamen and Putian commodity markets, trade activities in recent years has become an important motivation for tourism. These two source markets are grouped together because of their equal geographic location and the trend of tourism to Fujian.

6)

Germany, Italy, France, Indonesia, Australia, Britain and other source markets are categorized. Tourists from Western European countries come to China mainly for business and sightseeing vacation. Italy in recent years, the number of tourists in Min has been a steady growth trend, the market outlook is favorable, tourists to business casual visitors mainly France in recent years, the number of tourists in Min has also grown faster, the United Kingdom is the largest number of people in the European region to travel to China, but to Min, the United Kingdom tourists are relatively small. Germany is also a market worthy of vigorous tourism promotion. The Australian source market is an emerging source of tourism in China. According to the forecast of the Australian Tourism Forecasting Council, the average annual growth rate of Australian travelers to China will be higher than that of any other tourist destination, which is a market with great potential. It can be said that the Western European and Australian source markets are a newly emerging category of source markets in Fujian Province, and the number of tourists from these source markets will not grow rapidly in the short term, but will have potential to be explored in the longer term.

4.2

Tourism demand forecasting based on seasonal product modeling

4.2.1

Data processing

The data selected for this paper are the monthly data of Xiamen tourist trips from January 2014 to May 2019 (data source: data published by Xiamen Tourism Network). The original data series of tourism attendance is shown in Figure 3.

Analyzing the time series characteristics of the data, the data is obviously non-stationary and seasonal, accompanied by certain cyclical fluctuations, and there are obvious peak seasons in a year, with peaks around April to May, July to August, and October.

In order to eliminate the trend and at the same time reduce the fluctuation of the sequence, the logarithm of the original sequence is taken and the sequence is named ly, whose time series is plotted in Fig. 4, and it is found that the sequence is still not smooth.

The first order difference is done on the sequence ly, the sequence is named ily, and its autocorrelation and partial autocorrelation analysis is shown in Fig. 5. From the figure, it can be seen that the trend of the series is basically eliminated, but when k=12, the sample autocorrelation coefficient and partial autocorrelation coefficient of the series are significantly not 0, which indicates the existence of seasonality.

Seasonal differencing is done on the sequence sily to obtain the new sequence sily, and the autocorrelation and partial autocorrelation analysis of the sequence sily are plotted, as shown in Fig. 6. The sample autocorrelation coefficients and partial autocorrelation coefficients of the sequence sily quickly fall into the random interval, so the sequence trend has been basically eliminated.

The unit root test is a formal method for testing the smoothness of a time series, and to further test whether the sequence sily is smooth or not, the ADF unit root test for the sequence sily is performed. Table 2 shows the results of unit root test for the sequence sily. The value of the t-statistic of the test is -7.86162, which is smaller than any critical value with a significance level of 1%, so the original hypothesis is rejected and it is concluded that there is no unit root in the sequence, and therefore the sequence sily is smooth.

Table 2.

Unit root test of sily sequence

Test result	t	p
Critical value	-7.86162	0.000
	-3.51473
	-2.88265
	-2.59831

4.2.2

Predictive Model Identification and Selection

In the “identification” stage of the model, we find that after the first-order logarithmic period-by-period differencing, the period of the sequence is basically eliminated, so d = 1, and after the first-order seasonal differencing, the seasonality is basically eliminated, so D = 1. Therefore, we initially choose the product-seasonal model (p, d, q) × (P, D, Q) = (p, 1, q) × (1, 1, 1). In addition, we observe the biased autocorrelation diagram of the sequence SILY, and the combinations of p = 2 and 1 are more appropriate. The autocorrelation diagram shows that q = 1 or 0 is more appropriate. Taken together, the available (p, q) combinations are (2, 0), (2, 1), (1, 0), and (1, 1).

The Akaike’s Criterion of Minimum Information (AIC) from the Best Criterion Function Order Fixing method was used to determine the order of the models. The relevant test results of the four selected models were summarized and Table 3 shows the comparison of the test results. After calculation, all four models satisfy the smooth condition and reversible condition of ARMA process, and the model setting is reasonable. In addition, the concomitant probability of the white noise test of the residual series shows that the residuals of each model satisfy the independence assumption and the models fit well. Comparing the test results of Table 3: Individual Models compared to each model, the first model (2, 1) has the largest adjusted R-squared value (0.83085), the smallest AIC and MAPE, and a smaller SC value. Thus the selection of the first model i.e. ARMA (2, 1, 1) (1, 1, 1) model is appropriate.

Table 3.

Test results of each model

(p, q)	Adjusted R²	AIC	SC	p-Q	MAPE
(2,1)	0.83085	-3.24701	-3.29563	0.893	4.57
(2,0)	0.80135	-3.62486	-3.36381	0.736	6.01
(1,1)	0.80279	-3.36972	-3.13434	0.854	5.19
(1,0)	0.82164	-3.43296	-3.31292	0.892	4.95

4.2.3

Tourism Market Demand Forecasting Model Construction

Based on the identification and selection of the model above, we choose ARMA (2, 1, 1) (1, 1, 1)1 as our best prediction model, and estimate the parameters of the model and the correlation test results of the model are shown in Table 4. The results show that the parameter estimates of the model ARMA (2, 1, 1) (1, 1, 1) are statistically significant. The prediction model is: (1 − 0.2046B¹²)(1 − 0.51131B − 0.19991B²) × (1 − B)(1 − B)¹²ln (IP_t) = (1 + 0.88047B¹²)ε_t.

Table 4.

Model parameter estimation and relevant test results

Variable	Coefficient	Se.	t	p
AR(1)	-0.51131	0.19472	-3.11772	0.01832
AR(2)	-0.19991	0.21130	-1.71485	0.08535
MA(1)	-0.47858	0.19733	-2.59064	0.01897
SAR(12)	-0.20460	0.20054	-2.32730	0.04667
MA(12)	0.88047	0.09894	-9.10831	-0.00464
R²	0.87094	The mean of the dependent variable		-0.01129
Adjusted R²	0.80094	The standard deviation of the dependent variable		0.08314
Regression standard error	0.01898	Red pool information (AIC)		-3.63257
Residual sum of squares	0.01663	Schwarz information (SC)		-3.37128
Logarithmic likelihood ratio	78.67004	D-W statistics		2.03874

4.2.4

Model predictions

The above model is used to forecast the tourist trips in Xiamen. Among them, the prediction results from 2014 to 2019 are shown in Figure 7. It can be seen that the predicted value is basically consistent with the actual value, and the results show that the model in this paper has a good fitting effect.

Using the above model, the predicted value of tourist arrivals to Xiamen for June-December 2019 is then given, as shown in Table 5. Comparing with the actual data, the situation is also basically consistent. Therefore, the model in this paper has obvious reference value for predicting and analyzing the tourism reception of tourist destinations.

Table 5.

Forecast results for the number of visitors from June to December 2019

Month	Real value	Predicted value
June	113.77	113.91
July	149.09	149.04
August	167.51	167.51
September	147.64	147.46
October	210.73	210.81
November	115.19	115.18
December	102.37	102.40

5

Conclusion

In view of the current difficulties in predicting the demand of the tourism market, this paper carries out systematic clustering of tourism big data through the comprehensive distance method, and constructs a seasonal product model that can predict the market demand based on the clustering results. Taking the tourism data of Fujian Province from 2014 to 2019 as an example, Taiwan, China is the most obvious and independent category of travelers in Fujian Province, while the Australian source has a great potential of tourism market. Significantly smooth time series data (p<0.01) were obtained by logarithmic processing and seasonal differencing, and the ARMA (2, 1, 1) (1, 1, 1) model was chosen for tourism demand forecasting. Between January 2014 and May 2019, the predicted value of the number of travelers to Fujian is basically consistent with the real value, and it also has a good predictive performance for June-December 2019 travelers in Xiamen City. So it can be considered that the prediction accuracy of the market demand of tourism economy can be better improved by the clustering processing of tourism data and the seasonal product model.

Język:: Angielski

Częstotliwość wydawania:: 1 razy w roku
Dziedziny czasopisma:: Nauki biologiczne, Nauki biologiczne, inne, Matematyka, Matematyka stosowana, Matematyka ogólna, Fizyka, Fizyka, inne

Kanał RSS czasopisma

Application and Accuracy Improvement of Big Data Analytics in Market Demand Forecasting in Tourism Economy

Yonghe Yang

Data publikacji: 21 mar 2025

Otrzymano: 09 lis 2024

Przyjęty: 14 lut 2025

DOI: https://doi.org/10.2478/amns-2025-0586

Słowa kluczoweSeasonal product model, ARIMA, Systematic clustering, Tourism demand forecasting

© 2025 Yonghe Yang, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
Seasonal product model, ARIMA, Systematic clustering, Tourism demand forecasting