Optimization of used price assessment model for new energy vehicles using machine learning
Pubblicato online: 17 mar 2025
Ricevuto: 12 nov 2024
Accettato: 20 feb 2025
DOI: https://doi.org/10.2478/amns-2025-0226
Parole chiave
© 2025 Han Wang, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
With the popularity of social media, the transaction volume of new energy vehicles in the used car market is gradually increasing. Used car trading platforms continue to emerge, providing consumers with more choices, but also bringing more competitive pressure to the market. In this market environment, the price optimization of new energy vehicles is especially important [1-4]. A high price will affect the possibility of reaching a deal, while a low price may lose profit and even cause unnecessary suspicion and doubt. Therefore, predicting the price of used cars is helpful for both transaction participants and consumers in the market, and machine learning algorithms can solve this problem [5-8].
Machine learning is the study of how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills and reorganize the existing knowledge structure so that it continuously improves its performance [9-10]. It is the core of artificial intelligence, is the fundamental way to make the computer with intelligence, its application throughout the various fields of artificial intelligence, it mainly uses induction, synthesis rather than deduction [11-13]. As machine learning is a technology that allows computers to automatically learn data and make predictions, it can make predictions on many types of data, including the price of used cars, and its advantage is that it can quickly and accurately extract useful information from massive amounts of data, which makes its application in used car trading quite popular [14-17]. As for now, the field of machine learning application in used car trading is still relatively narrow, but the potential of this market is huge. Through the continuous innovation and application of data, detection, search and other technologies, machine learning technology is most likely to become an important part of used car trading solutions [18-21].
The article uses machine learning algorithms to evaluate the second-hand price of new energy vehicles, and after collecting and pre-processing the data of new energy second-hand vehicle trading market, it screens the characteristic variables of new energy second-hand vehicle valuation from the three aspects of physical, functional and social factors, and establishes the index system of new energy second-hand vehicle valuation. The Stacking algorithm is fused with other machine learning algorithms to construct a new energy used car valuation model based on the Stacking fusion, and the mean absolute error (MAE), root mean square error (RMSE), R-squared value (R2), and mean absolute percentage error (MAPE) are selected as the measurement indexes of valuation performance. The price assessment results of the Stacking valuation model in this paper are compared with those of valuation methods such as traditional market method and SVM model to test the valuation effect of the Stacking valuation model.
The data in this paper is not manually collected or questionnaires, but from the public data on the AliCloud website, which is recorded as the data related to used cars of the department of energy generated by a trading platform, and the original sample is more than 400,000, of which there are 31 feature variables for each piece of data, and 150,000 pieces are randomly extracted from the original data as a training set, and 50,000 pieces are randomly extracted from each piece of data as the test dataset A and test dataset B. In this paper, price is selected as the dependent variable according to the purpose of the study, and after some data cleaning and variable reconstruction, there are a total of 32 feature variables finally used for modeling. The data has been privatized and desensitized for non-public variables, for example, the name of the car deal contains a certain amount of information, this string-type feature variable is replaced, corresponding to get the appropriate numerical number, and variables containing personal information such as name, brand, regionCode, and model are desensitized, and the final data is used for model modeling and comparison of models. Analysis, the analysis tool used to process the data in this paper is Python.
For the 16 variables, the feature variables are categorized in the following four ways in order to more easily understand the various feature variables of the used automobiles within the dataset:
Marker-type variables: saleID transaction ID, name automobile transaction name, regionCode area code, seller seller, offertype offer type, createdate automobile on time.
Time type variables: regdate automobile registration date.
Sub-type variables: brand car brand, bodytype body type, gearbox gearbox, fueltype fuel type, notRepairedDamge car has unrepaired damage.
Numeric variables: model model code, power engine power, kilometer car has been driven kilometers, v-series characteristics.
In real life, there are generally missing and duplicate data, which will have a great impact on the effect of data mining, and can not be used directly for modeling training. First of all, the data are tested and processed for missing values.
Missing values refer to the phenomenon that some attributes exist in real life, but may produce omissions during data collection. For continuous variables, there are several main methods for missing value filling:
(1) Populate using the mean. That is, the average of all other sample points for that characteristic variable. This value is used to fill in the missing values, but the mean value method cannot be used when there are extremes in the data, and the method will have a large error.
(2) Filling using regression equations. That is, the complete data set is used to establish the regression equation, or the regression algorithm model in machine learning is used to fill in the missing values. For variables with missing values, the unmissing eigenvalues are substituted into the regression equation, and that predicted value is used to fill in instead of the missing value. However, when the variables are not linearly correlated, the regression equation method can lead to estimation bias.
(3) K nearest distance neighbor method for filling. That is, according to the Euclidean distance of the sample points to find out all the K samples closest to the sample points of the missing data, the values of these K samples are weighted to find the average, that is, the missing value of the filling.
There are several main methods to fill the missing values of subtype variables:
a) Filling using the plural. That is, the value with a clear concentration trend, and use the value to replace the missing data. However, the data filled by this method is not precise enough. b) Fill using the median. That is, the number in the middle of the position, using the value to replace the missing data. c) Multiple filling method for filling. That is, using the filling model, each missing value is trained, corresponding to the generation of M possible filling values, thus generating M complete datasets, and then all complete datasets are statistically analyzed to obtain the parameter set, and finally the M parameter sets are combined to obtain the filled missing values. d) Treating missing values as a new classification. That is, treat all missing values as a special kind of classification.
Outliers, refers to some sample points in the data set whose values deviate more significantly from the values of the rest of the sample points, so they are also called outliers. Before dealing with outliers, you must first know how to define outliers, and there are different criteria for outlier judgment in different data.
Standardization (normalization) of data processing is the most basic work of data cleaning, as the evaluation indicators are not the same, the outline will be different, so the data should be standardized to weaken the influence of the variables on each other's outline, so that the resulting data can be comprehensively compared in each model. The following are commonly used data normalization methods:
Normalization is achieved by shifting the position of the decimal point, which ultimately maps the variable value to the [–1,1] range. The normalization formula is as follows:
where
Also known as deviation normalization, this refers to the linearization of the data to map to within the [0,1] range. The conversion formula is as follows:
Where
This method is applicable to the original data to meet the approximate normal distribution. The method is to take the data mean
where
Most scholars are using multiple linear regression model, hierarchical analysis method, replacement cost method modeling for new energy used car valuation, few use mathematical models and machine learning algorithms modeling for research, in the past two years there are a few scholars who use random forests and neural networks for the exploration of the used car valuation model, the machine learning model is not only excellent in terms of accuracy, but also excellent in the processing of big data. The machine learning model not only performs well in terms of accuracy, but also excels in the processing of big data, which also proves the feasibility and applicability of machine learning algorithms in the valuation of new energy used cars.
The price of new energy used car has more influencing factors, to get more accurate price prediction results need to comprehensively consider a number of influencing factors, this paper from the physical factor indicators, functional factor indicators and social factor indicators to establish a used car valuation index system. The indicators of physical factors include the time of the first registration of the vehicle, the mileage of the vehicle and the repair and maintenance of three variables. Functional factor indicators include three variables: vehicle brand, vehicle power, and vehicle transmission type. The social factor indicators include two variables: vehicle type and battery life.
The commonly used performance metrics in regression modeling are Mean Absolute Error (MAE), Root Mean Square Error (RMSE), R-squared (R2) and Mean Absolute Percentage Error (MAPE):
where
where
where
where
In this paper, the dataset is randomly divided proportionally to get 80% of the training dataset and 20% of the test dataset. The divided training dataset is used to train the model, and the test dataset is used to make predictions, and in this paper, the results of the predictions are used as the main basis for evaluating the performance of each model. Based on the theoretical foundation of machine learning in the previous section, we constructed six single models, namely, K nearest neighbor regression, support vector machine regression, random forest regression, Xgboost, LightGBM, and BP neural network, respectively.
K-nearest neighbor regression, in essence, is to find out the K neighbors around a certain sample, and then find the average value of certain attributes of these neighbors as the corresponding attribute value of that sample [22].
The KNeighborsRegressor function in the sklearn module can be used to invoke the K-neighbors regression model. The score of the model test set, i.e., the R-squared value, is 0.9405 when the functions all use the given default parameters.
Support Vector Machine Regression SVR, is a use of Support Vector Machine SVM in regression problems. Fundamentally, it refers to finding a regression plane that minimizes the distance from that plane to all the data by training a large amount of data [23].
The support vector machine regression model is invoked using the LinearSVR function in the sklearn.svm module, and the score of the model test set, i.e., the R-squared value, is 0.7946 when the functions are all using the given default parameters.
Random Forest Regression, integrates multiple trees to get such a forest through an integrated learning approach, its basic unit is the decision tree, and the regression result is to take the average of each tree after training [24].
The RandomForestRegressor function in sklearn.ensemble is utilized to invoke the random forest regression model, and the score of the model test set, i.e., the R-squared value, is 0.9719 when the function uses the default parameters.
Xgboost, essentially a tree model, is an integration of decision tree algorithms, where the new tree is based on the model's performance of the existing tree is constantly revised until the prediction is achieved [25].
The XGBRegressor function in the xgboost.sklearn module was utilized to call the Xgboost regression model, and the score of the model test set, i.e., the R-squared value, was 0.9666 when the functions were all using the given default parameters.
LightGBM, a Histogram-based decision tree algorithm, requires weight corrections at each iteration based on the predictions of the already iterated results to minimize the prediction error [26].
The XGBRegressor function in the xgboost.sklearn module is utilized to call the Xgboost regression model, and the score, i.e., the R-squared value, of the model test set is 0.9587 when the functions are all using the given default parameters.
The BP neural network consists of three parts, which are the input layer, hidden layer and output layer, and the model compares the results obtained from each training with the real values, and gradually corrects the weights and thresholds according to the error after training, and finally obtains the model with the smallest error [27].
The MLPRegressor function in the sklearn.neural_network module is utilized to call the BP neural network, and the score of the model test set is 0.8842 when the functions all use the given default parameters.
In order to further improve the model accuracy, the algorithm parameters are tuned for the best-performing Random Forest, Xgboost and LightGBM. All three models have some degree of improvement in prediction for the test set after tuning, with MAPE reduced to between 1.85% and 2.05%.
Model fusion, where a set of individual learners is generated and then combined with some strategy to enhance the modeling effect. In regression problems, common strategies are averaging and Stacking algorithms.
(1) Simple averaging method
Simple averaging method considers that the importance of all models is the same, and the prediction results of each model are simply averaged, so the same weight is given to each model for model fusion.
(2) Weighted Average Method
The weighted average method considers that the models with better performance are given more weight, and the models with poorer performance are given lower weight. Here, in this paper, the inverse of the absolute error of each model is used as the weight, and the results of each model are weighted average model fusion.
Firstly, data cleaning and preprocessing are carried out on the new energy used car data, and randomly divided into training dataset and test dataset according to a certain proportion. Then, six single evaluation models, LightGBM model, random forest model, Xgboost model, KNN, BP neural network, and support vector machine SVR, are selected as the base learner, and the five-fold cross-validation is used to obtain a new training set A matrix and test set matrix B, which are combined into a new matrix H=(A, B)T.
In this paper, the prediction results of the training and test sets are input into the meta-model based on six base learners and three base learners after model selection as a new training and test set, respectively, and a simple multiple linear regression model is set up as a meta-learner. Programming utilizes the StackingCVRegressor function of mxltend.regressor to implement Stacking model fusion.The structure of Stacking model fusion is shown in Figure 1.

Stacking model fusion structure
This paper is based on 50,000 new energy used car trading data, which are divided into training set and test set according to the ratio of 4:1, where the training set includes 40,000 data, which are used as modeling training to determine the best parameters. And the remaining 10,000 test sets are used as modeling tests to evaluate the accuracy and credibility of modeling. Through the ten-fold cross-test, the valuation results of new energy used cars on the test set can be obtained as shown in Fig. 2, where the number of the test set is represented by the horizontal coordinate and the vertical coordinate represents the transaction price, with red color indicating the real value and blue color indicating the expected value.

Stacking price estimation fusion model prediction results
Observing Figure 2, it can be found that the predicted price of this paper's Stacking fusion valuation model for 10,000 samples is close to the actual transaction price of the samples, and the price difference between the two is relatively small. From the comparison results of the predicted and actual prices of new energy used cars in the first 100 samples in Figure 3, the predicted price of the Stacking fusion valuation model in this paper basically matches the actual price of new energy used cars, and the maximum price difference is 45,800 yuan. The Stacking fusion valuation model in this paper has a good evaluation effect on the price of new energy used cars.

Stacking fusion model price estimation results of the first 100 samples
Based on the data obtained in the previous article, this paper adopts the traditional market approach to appraise the price of new energy used cars again. Through the market approach, the transaction conditions, time of transaction and other influencing factors of the comparable objects can be adjusted in order to more accurately calculate the market value of the valuation object on the valuation benchmark date. The six examples of transactions collected in the article, all in May-July 2023, are close to the appraisal base date, so they will not have much impact on the price and no further modifications will be made to them.
The candidate indicators for new energy used car valuation in Chapter 2 are taken as the characterization factors, and the relevant indicators are quantified by indicators, and the quantified influencing factors are obtained as shown in Table 1.
Perspective | Influencing factor | Assessment case | Comparable case 1 | Comparable case 2 | Comparable case 3 | Comparable case 4 | Comparable case 5 | Comparable case 6 |
---|---|---|---|---|---|---|---|---|
Physical factor | First registration time | 100 | 105 | 99 | 91 | 100 | 94 | 100 |
Vehicle mileage | 100 | 100 | 95 | 94 | 93 | 97 | 95 | |
Maintenance condition | 100 | 100 | 108 | 100 | 95 | 100 | 100 | |
Functional factor | Brand | 100 | 100 | 108 | 100 | 95 | 100 | 105 |
Power | 100 | 104 | 105 | 105 | 100 | 93 | 100 | |
Gearbox type | 100 | 95 | 100 | 100 | 100 | 95 | 95 | |
Social factor | Vehicle type | 100 | 95 | 95 | 100 | 100 | 105 | 95 |
Battery life expectancy | 100 | 100 | 100 | 105 | 100 | 95 | 105 | |
Transaction price | 23 | 22 | 26 | 12 | 15 | 16 | 20 |
The following formula can be used to modify the traditional market approach when using the cumulative multiplier approach to valuation:
where
From the above formula, based on the data obtained from the comparison table, direct substitution calculations can be made to obtain the Example 1 correction results as:
Based on the results of the calculations, it can be concluded that the revised values for Examples 2-6 are 237,600 yuan, 127,200 yuan, 178,700 yuan, 199,100 yuan, and 211,600 yuan, respectively. The results are then calculated using the simple arithmetic average method:
Through the traditional market approach, the calculated appraised value of the new energy used car to be appraised is 196,200 yuan, which is 33,800 yuan different from the actual, and the percentage of absolute error of the traditional market approach is calculated to be 14.70% on the case of the appraisal to be appraised.
The Stacking fusion valuation model in this paper is compared with the support vector machine model (SVM), which is often used in price evaluation, to explore the performance of the model in valuation in this paper. Twenty samples are randomly selected in the training set for the valuation test, and the obtained results are shown in Table 2.
Number | Real price | SVM | Stacking | ||||
---|---|---|---|---|---|---|---|
Predict price | Fitting degree | Absolute error | Predict price | Fitting degree | Absolute error | ||
1 | 4.31 | 7.33 | 1.70 | 3.02 | 5.22 | 1.21 | 0.91 |
2 | 4.57 | 4.10 | 0.90 | -0.47 | 4.40 | 0.96 | -0.17 |
3 | 14.99 | 15.66 | 1.04 | 0.67 | 14.53 | 0.97 | -0.46 |
4 | 13.61 | 17.41 | 1.28 | 3.80 | 13.88 | 1.02 | 0.27 |
5 | 20.17 | 26.26 | 1.30 | 6.09 | 20.33 | 1.01 | 0.16 |
6 | 10.59 | 15.23 | 1.44 | 4.64 | 10.77 | 1.02 | 0.18 |
7 | 8.63 | 5.03 | 0.58 | -3.60 | 7.15 | 0.83 | -1.48 |
8 | 15.67 | 16.57 | 1.06 | 0.90 | 15.69 | 1.00 | 0.02 |
9 | 4.29 | 9.49 | 2.21 | 5.20 | 4.09 | 0.95 | -0.20 |
10 | 13.64 | 10.01 | 0.73 | -3.63 | 13.96 | 1.02 | 0.32 |
11 | 4.80 | 6.34 | 1.32 | 1.54 | 4.68 | 0.98 | -0.12 |
12 | 11.75 | 15.33 | 1.30 | 3.58 | 10.37 | 0.88 | -1.38 |
13 | 10.47 | 10.14 | 0.97 | -0.33 | 10.64 | 1.02 | 0.17 |
14 | 17.39 | 21.56 | 1.24 | 4.17 | 17.11 | 0.98 | -0.28 |
15 | 16.32 | 15.11 | 0.93 | -1.21 | 16.78 | 1.03 | 0.46 |
16 | 12.23 | 16.63 | 1.36 | 4.40 | 12.91 | 1.06 | 0.68 |
17 | 22.83 | 18.67 | 0.82 | -4.16 | 22.56 | 0.99 | -0.27 |
18 | 20.52 | 23.27 | 1.13 | 2.75 | 20.72 | 1.01 | 0.20 |
19 | 15.20 | 12.79 | 0.84 | -2.41 | 15.22 | 1.00 | 0.02 |
20 | 18.88 | 16.80 | 0.89 | -2.08 | 18.00 | 0.95 | -0.88 |
R2 | 0.875 | 0.989 | |||||
MAE | 2.933 | 0.432 | |||||
RMSE | 3.357 | 0.597 | |||||
MAPE | 1.152 | 0.995 |
Looking at the model as a whole, the goodness-of-fit of SVM is 0.875, while the goodness-of-fit of Stacking fusion valuation model in this paper is 0.989, which indicates that the Stacking fusion model is supported to learn the sample laws better and expand well under the similarity of sample characteristics. As far as the mean absolute error, root mean square error and mean absolute percentage error are concerned, the Stacking fusion model has a certain degree of improvement over the SVM model.
In order to further illustrate the effectiveness of the Stacking fusion valuation model in the valuation of new energy used cars, it is necessary to compare and analyze it with the traditional market valuation method. The market comparison method and the Stacking fusion valuation model are used to predict the price of new energy used planning respectively.
The Market Comparison Method and Stacking fusion valuation model use the same dataset for new energy used car price prediction, with the goal of minimizing the training error. The prediction results are shown in Fig. 4 for a random 200 data sets using the trained model.

Training data error
Observing Figure 4, it can be found that the error range of predicting new energy used cars using the traditional market comparison method is much larger than that of the Stacking fusion valuation model in this paper. When assessing and predicting the price of new energy used cars, the price error of the traditional market comparison method is in the range of -8~80,000 yuan, while the spread of this paper’s Stacking fusion valuation model is centrally distributed in the range of -2~20,000 yuan. It can be seen that the Stacking fusion valuation model in this paper can provide more accurate price assessment for new energy used cars.
This paper uses machine learning algorithms to optimize the traditional used price assessment model of new energy vehicles, incorporates used car valuation influencing factors into the index system of the model, and constructs a new energy used car valuation model based on Stacking fusion. The model in this paper is used in the new energy used car valuation, and the valuation results are compared with other valuation methods to explore the effect of Stacking fusion valuation model.
The price difference between the predicted price of the Stacking fusion valuation model and the actual transaction price of the samples in this paper is small, with the maximum price difference of 45,800 yuan among the first 100 samples. The appraisal price of the case to be valued by the traditional market approach is 196,200 yuan and the actual price is 230,000 yuan, with an absolute error percentage of 14.70%. The valuation fitting goodness of the Stacking fusion valuation model in this paper (0.989) is better than the SVM model (0.875). In terms of new energy used car valuation, the price error interval of traditional market comparison method is [-8, 8] million dollars, and the price difference interval of Stacking fusion valuation model in this paper is [-2, 2] million dollars. The valuation results of the Stacking fusion valuation model in this paper are more accurate.