Research on Dynamic Monitoring and Intelligent Early Warning of Community Correctional Recidivism Risk Based on Multidimensional Data Mining

As socialism with Chinese characteristics enters a new era, providing the people with a sense of access, happiness and security in terms of democracy, rule of law, fairness, justice and security is the historical mission and noble responsibility of the judicial administrative organs given by the new era. When the crime penalty through the meat penalty-based, imprisonment-based development stage, began to enter the community penalty-based stage, the implementation of community corrections socialization of the execution of sentences has become an international trend [1]. In order to adapt to the changes in the crime situation and economic and social development of the status quo, in line with the international trend of socialized execution of punishment, the implementation of the criminal policy of leniency and severity, and accompanied by the impact of penal concepts such as penal humanitarianism, China in some areas of community corrections in the initial exploration, began to pilot and fully implemented, community corrections in China practice began from scratch, from point to point continue to develop and improve [2].

In recent years, some problems have appeared in the re-socialization of some community correctional clients and ex-offenders, and individual correctional clients have reoffended with cruel behaviors and bad social impact, which seriously affects social stability [3]. Behind the multiple cases of re-socialization failure, it has been prompting people that community correctional objects reoffending has become a very urgent problem facing the current in-depth promotion of community correctional work [4]. For this reason, the dynamic monitoring and intelligent early warning of community correction recidivism risk can be realized through multidimensional data mining technology. Data mining is a process of extracting useful information, knowledge and laws from large-scale and complex data sets. This technique combines statistics, machine learning, artificial intelligence and other methods to process, summarize and mine raw data to provide decision makers with valuable reference bases [5]. Using this technology, community correctional workers can more accurately assess the risk of reoffending by community correctional clients, improve the pertinence of correctional programs, adjust the content of correctional programs in a timely manner, and prevent community correctional clients from embarking on the path of crime again [6].

Scientific and effective monitoring and early warning of recidivism risk and social dangerousness of community sentenced persons can lay the foundation for controlling the risk of community correction. Literature [7] tested the validity of an intermediate measure of recidivism risk scored by community supervisors with 632 males convicted of sexually motivated offenses, and the results showed that the risk of recidivism changes over time, community supervision can use the most recent information about the service recipients’ community adaptation to predict recidivism, and that the most recent scores or rolling averages of the scores are the the best predictor of recidivism. Literature [8] indicated that the most common model in community supervision is to determine the client’s initial supervision level through static risk assessment and/or stabilization dynamic assessment, and proposed the integration of dynamic risk assessment into the practice of community supervision, and the proposal has an important significance and usefulness for the achievement of the work goal of preventing and reducing recidivism of community corrections staff in community corrections work. Literature [9] utilized 2,877 Canadian females under community supervision as the study population. Using logistic regression modeling to explore how dynamic risk and dominance factors change over time and how these changes are associated with three recidivism outcomes: technical violations, any new charges, and new violent charges, the results of the study indicated that a reassessment of the dynamic risk and dominance scores over time is necessary to help predict recidivism outcomes. Literature [10] predicted recidivism using mobile neurocognitive assessment software designed to assess risk in a correctional community probation population, empirically analyzed using probationers in Houston, Texas, U.S.A. The results proved the validity of the proposed methodology, which can determine different levels of recidivism risk based on the type of offense, age, or gender. Literature [11] suggests that reassessment of dynamic risk is a better predictor of violent recidivism than initial risk assessment at release, which was validated by dynamic risk rating and static risk scoring experiments using a sample of individuals under community supervision in New Zealand, emphasizing the importance of periodic reassessment of dynamic risk factors after release from prison in predicting violent recidivism, and stating that supervisory staff should consider the overall interpersonal hostility and reactive patterns. Literature [12] used a group-based trajectory model to identify different trajectories of dynamic risk and protective factors in probationary youth and to explore individual and neighborhood-level factors associated with these trajectories, with results pointing to the existence of a large number of risk/need trajectory combinations among probationary youth, a finding that has important implications for the use of dynamic risk assessment among probationary youth and for research on youth recidivism more broadly. The literature [13] suggests that risk factors can increase the likelihood of reoffending, attempts to use dynamic risk to predict violent recidivism in “real time”, and suggests that repeated assessments of dynamic risk can improve the prediction of violent recidivism, and the findings provide clear support for the reassessment of risk in the reentry process.

In addition, literature [14] designed a recidivism prediction model for a specific population based on big data by using data exchange technology of heterogeneous databases, feature engineering technology and linear regression algorithm, and through programming implementation and simulation experiments, it proved that the designed model could obtain the prediction of people’s tendency to re-offend and realized the purpose of reducing the recidivism rate. Literature [15] proposes a recidivism prediction and risk factor detection model under different time windows based on machine learning techniques, and the superior performance of the proposed model is verified through several tests, and its interpretable results can be seamlessly integrated into prison and court management systems as a valuable risk assessment tool for assessing the likelihood of recidivism of inmates who have completed their sentences. Literature [16] synthesized a rough set attribute approximation algorithm based on probability distributions, an improved k-prototype clustering algorithm, and a back-propagation neural network to design a recidivism early warning model, and the feasibility and practicability of the designed model was verified by a large number of experiments with public data from the state of Iowa in the U.S. The feasibility and practicability of the designed model was verified by a large number of experiments to be able to predict more accurately the rate of recidivism of offenders after the completion of sentence and release. Literature [17] constructed a machine learning model for predicting criminals’ recidivism tendency based on artificial intelligence algorithms and blockchain technology, taking Ukrainian prison inmates as the research object, and verified the validity of the proposed model through experiments, which identified the number of convictions of the actual sentence and the number of probation as the important factors affecting criminals’ recidivism tendency. Literature [18] systematically reviewed several studies on advanced methods for recidivism risk prediction, revealing the potential utility of machine learning (ML) applied to criminal justice, and providing a theoretical basis for research focused on finding better techniques for predicting offenders’ recidivism risk. Literature [19] reviewed five data mining techniques for predicting recidivism, namely clustering, association rules, decision trees, support vector machines, and neural networks, pointing out that data mining techniques are another tool with accuracy for predicting recidivism in addition to standard parametric (logistic regression) and actuarial risk assessment models. Literature [20] Predicting recidivism in juvenile offenders: a random forest model for predicting recidivism was developed with the aim of using machine learning algorithms to augment professional judgment for the purpose of predicting recidivism in juvenile offenders, and experimentally verified that the predictive performance of the model was superior to that of existing models that analyze aggregated data by traditional statistical methods, and provided a direct intervention for those who are more likely to reoffend.

In this paper, the MApriori algorithm, logistic regression model and density clustering (DBSCAN) algorithm are used as research tools to realize the multi-dimensional data feature mining of community correctional recidivism and construct the community correctional recidivism risk monitoring and early warning model. First, the traditional Apriori algorithm is improved to obtain the MApriori algorithm based on Mondrian platform to mine the multidimensional data of community correctional recidivism and obtain its basic features. Then, the Logistic regression method is used to analyze and derive the main factors affecting the recidivism of community correctional first-time offenders, and the early warning regression model of drug-related recidivism is constructed based on the extracted main influencing factors to realize the monitoring of the recidivism risk of community correctional personnel. Finally, in order to improve the dynamics and intelligence of the monitoring and warning model, the prediction of criminal behavior is realized based on the density clustering (DBSCAN) algorithm in the state of L as an example, which provides a reference for its application to the field of community correctional recidivism risk monitoring and warning.

2

Multidimensional data mining processes in community corrections risk monitoring and early warning

Multidimensional data mining is a tool that uses semi-automated or automated methods to analyze multidimensional data and extract or mine knowledge. The use of multidimensional data mining technology in community correctional personnel recidivism risk monitoring and early warning work needs to make clear the basic status of correctional personnel in the study, and a complete multidimensional data mining process is centered around correctional personnel. In other words, to scientifically and effectively complete the multidimensional data mining work in community correctional risk monitoring and early warning, it is necessary to take a series of steps centered on correctional personnel to dig out the practical, previously unknown and usable knowledge from the database of community correctional personnel that stores a large amount of elemental information, and then use this mined knowledge to make a reasonable judgment or scientific decision. Specifically, a complete multidimensional data mining process in community correctional risk monitoring and early warning is mainly divided into four steps: defining the problem, multidimensional data preparation, multidimensional data mining, and analyzing the application results.

2.1

Defining the problem

Defining the goals to be achieved by multidimensional data mining is the first step of the whole process, i.e. the prerequisite step for specific mining work. Although the results of mining are unpredictable, it is necessary to foresee in advance the specific problems to be explored. For example, for the problem of recidivism risk assessment, one should familiarize oneself in advance with the basic situation of the person being tested, the type of behavior, the background material, etc., and roughly estimate the overall risk of different types of correctional personnel, and initially clarify the several risk factors that need to be probed through the technology.

2.2

Multidimensional data preparation

Multidimensional data preparation mainly includes the following: 1)

Multidimensional data selection. Collect all kinds of information of the tested person in the database of the community correctional management organization of the Judicial Bureau, including personality information, background information and environmental information (such as age, psychological condition, type of crime, years of incarceration, etc.), and select high-level data from them that can be followed to make a good selection of data.

2)

Data pre-processing. When the data are collected, the incomplete, noisy (with incorrect attribute values) and inconsistent data are cleaned up (filling in the vacant values, identifying and deleting isolated points, resolving inconsistencies in attribute naming and redundancy, etc.), analyzing the quality of various types of data, minimizing the size of the dataset, removing irrelevant attributes with the help of relevance analysis, and selecting specific methods to be used in the mining operation to make adequate pre-preparation for further analysis. Make adequate preparations for further analysis.

3)

Data conversion. Data conversion is mainly to convert the type and value of the original data into a unified format, which is convenient for subsequent analysis. In the work of community correctional risk monitoring and early warning, data conversion means organizing the original data sets of community correctional personnel and converting them into structured data, for example, quantitative data (continuous data) can be classified by delineating the boundaries (e.g., age can be divided into four categories: below 20 years old, 20-40 years old, 40-60 years old, and above 60 years old), so as to convert them into structured data that is convenient for statistics‥

2.3

Multidimensional data mining

After getting the transformed structured multidimensional data, the next step is to enter the core mining process. The process of multidimensional data mining is shown in Figure 1.

Specifically, the a priori probability density function of the static factors of the cohort to be refreshed (the initial information, which refers to the static and unchanging factors such as gender, crime type, and prior background of the community corrections personnel) is incorporated into the dynamic factors (the observed data, which refers to the dynamically changing factors such as the work/study situation, social affinity, and sentence execution of the community corrections personnel) through the Bayesian decision-making technique to obtain the posterior probability function of the cohort, to Complete the refreshment of the observed data in the base model. It should be noted that to ensure reliability and validity attention should be paid to verifying the different effects of dynamic factors (including antisocial patterns, social support, addictive abuse, kinship, and leisure activities of community corrections officers) identified in the previous study.

2.4

Analyze application results

By interpreting and evaluating the results generated after the above multidimensional data mining process, relevant conclusions can be drawn: 1)

The impact of different risk predictors on the likelihood of recidivism can be quantified through technical means.

2)

The impact of dynamic factors on recidivism risk assessment is clarified, and the cognitive changes and skill development changes of community corrections personnel as well as their recidivism risk are regularly assessed, and the results of the dynamic assessment are applied to monitor the correctional process and changes.

3)

Adjusting the form and quantity of correctional services and supervision according to the results of risk monitoring. At the present stage, there is a wide variety and large number of community correctional programs in China, but a considerable number of them do not reduce the recidivism of correctional personnel, and can be said to be ineffective and useless, so there is a need to limit the content of the correctional programs to those that can reduce the risk of recidivism, based on the results of the risk assessment.

3

Characterization of recidivism in community corrections based on multidimensional association rules

3.1

Multidimensional association rules

3.1.1

Definitions related to multidimensional association rules

With the study of multidimensional data cubes, the focus of association rules research has gradually changed from unidimensional data to multidimensional data. One of the most important concepts in multidimensional association rules is the concept of dimension, where people refer to analyzing a data cube from different perspectives as a dimension. A multidimensional association rule is a rule that involves multiple attributes or predicates. The relevant definitions are given below: 1)

Predicates for multidimensional data cube: Let a multidimensional data cube Cube be defined, then the dimensional predicate di is the predicate d_i ∈ D_i expressed in the following form.

2)

Interdimensional predicates of a multidimensional data cube: Let d₁, d₂,⋯ d_n be a dimensional predicate, and they belong to different dimensions D₁, D₂,⋯ D_n then d₁, d₂,⋯ d_n is an interdimensional predicate.

3)

Multi-dimensional association rule: Rule: d₁ ^ d₂ ^ d₃ ⋯ ^ d_k ⇒ d_m, and d₁ ^ d₂ ^ d₃ ⋯ ^ d_k ^ d_m is a predicate from different dimensions of the multi-dimensional data cube. (Note: This multidimensional association rule is the definition of inter-dimensional association rule. Multi-dimensional association rules generally include inter-dimensional association rules and mixed-dimensional association rules, this paper does not study mixed-dimensional association rules, only inter-dimensional association rules)

4)

Support degree of multidimensional association rules (minsup): in multidimensional association rules, the support degree of d₁ ^ d₂ ^ d₃ ⋯ ^ d_k ==⇒ d_n is recorded as Support(R) : 1 $S u p p o r t (R) = \frac{C O U N T (C u b e (d_{1}, d_{2}, d_{3}, \dots d_{n}))}{C O U N T (C u b e (A L L, A L L, A L L, \dots A L L))}$

5)

Confidence of multidimensional association rules (minconf): in multidimensional association rules, the confidence of d₁ ^ d₂ ^ d₃ ⋯ ^ d_k ==⇒ d_n is denoted as Conference(R) : 2 $C o n f e r e n c e (R) = \frac{C O U N T (C u b e (d_{1}, d_{2}, d_{3}, \dots d_{k}, d_{n}))}{C O U N T (C u b e (d_{1}, d_{2}, d_{3}, \dots d_{k}, A L L))}$

3.1.2

Apriori algorithm

Apriori algorithm is the most influential method for mining association rules in the field of association rules. The algorithm uses a priori knowledge of the nature of frequent itemset Apriori to explore the generation of candidate (K+1)-itemsets through a layer-by-layer iterative search method, where K-item frequent itemsets are used as a basis. First, the algorithm searches the set of frequent 1-item sets. This set is labeled as L1. Thereafter, the algorithm uses L1 to search for the set of frequent 2-item sets, L2. Then, L2 continues to be used to search for L3. This iterates until no frequent K-item sets can be found [21].

The algorithmic flow of Apriori is shown in Fig. 2, and the whole Apriori algorithm can be decomposed into the following two steps to design: 1)

Mining all frequent itemsets occurring in the relational database (R).

Frequent itemset means that the support of the item set is greater than the minimum support MinSupport.

Firstly, the relational database needs to be scanned to search for frequent 1-itemsets.

Then, a layer-by-layer iterative approach is used to search out frequent k-itemsets (k>1) from frequent k-1-itemsets, the specific design idea is:

After searching out the candidate frequent k-item set (Ck), the support of this candidate frequent item set is calculated, and then filtered according to the minimum support MinSupport to generate the frequent k-item set.

Search all frequent k-itemsets (k>0) by the above operation, and finally merge them all together.

2)

Further search all the strong association rules through the set of frequent itemsets searched in step 1.

A strong association rule is an association rule whose confidence level is greater than a given minimum confidence MinConference.

In this step, starting from the set of frequent itemsets obtained in the previous step, all possible association rules (or candidate association rules) are first searched, and then valuable strong association rules are generated based on MinConference.

3.1.3

MApriori based multidimensional association rule mining

The traditional Apriori algorithm is proposed for single-dimensional databases, and its running time will be greatly increased when the database data volume is very large and the support degree is very small, and at the same time, when generating frequent itemsets, the search space of candidate itemsets will be geometrically increased. The traditional Apriori algorithm requires a significant amount of computation when calculating the support of the item set. At the same time, the traditional Apriori algorithm repeatedly calculates the support of itemsets when generating association rules, without caching some intermediate calculations when solving the support of frequent itemsets. Finally, the traditional Apriori algorithm does not provide a clear solution when dealing with multidimensional data. Based on the shortcomings of traditional Apriori algorithms in dealing with multidimensional data, this paper proposes a MApriori algorithm based on the Mondrian platform.

The biggest improvement and the core of MApriori algorithm is to query Mondrian OLAP engine by constructing MDX statement to get the support degree of any item of multidimensional frequent predicate set, so as to determine whether the multidimensional predicate set is frequent or not.MDX can flexibly construct query expressions and dynamically generate multidimensional query results in the returned result set, which is a syntax specification specifically designed for retrieval on multidimensional data cubes. Therefore, in the MApriori algorithm, it is possible to manipulate multidimensional datasets using MDX statements to obtain multidimensional query results.

The complete MApriori algorithm flow is shown in Fig. 3 and its operational steps are as follows: 1)

Call the corresponding API provided by Mondrian to get each predicate of each dimension and its corresponding support, and save the predicate of support into Hash and put it into L1 at the same time.

2)

Construct the basic MDX syntax tree by utilizing the set of 1-item frequent predicates obtained in step 1).

3)

Generate candidate predicate sets by iterating layer by layer, respectively.

4)

Construct MDX statements to compute the support degree by Mondrian OLAP engine.

5)

Generate the set of frequent predicates by iterating layer by layer and save it into the hash.

6)

Obtain the support degree of each frequent predicate subset from the Hash saved in the above steps and calculate the confidence degree of the rule.

3.2

Example analysis

3.2.1

Analytical process

This example takes the released and rehabilitated persons from a community correctional institution between 2018 and 2020 as a sample. First, {ethnicity, literacy, place of origin, marital status, offense, sentence term, age at release} are selected as condition attributes from the base data, and whether they recidivate or not is used as a decision attribute. Some of the condition attributes were merged (e.g., age at discharge = sentence termination date - date of birth) or deleted (e.g., deletion of irrelevant or similar attributes such as arresting authority, date of arrest, sentencing authority, and date of sentence).

Second, attribute values in the dataset were discretized and categorized. For example, the literacy level is discretized as {1: Elementary school and below, 2: Middle school, 3: High school (intermediate), 4: Specialized (high school), 5: Bachelor’s degree and above}. Charges are recorded through the code, itself is discrete, but in order to facilitate the analysis or need to go through some categorization (eg: 4002, 400201, 400202, 400203 respectively, “theft”, “habitual theft”, “attempted theft”). “Attempted burglary” and “preparation for burglary”, categorized and unified with 4002). Sentences are disaggregated into 1: less than 36 months, 2: 36-72 months, 3: 72-108 months, 4: more than 108 months}. The age at first offense is discretized as {1: under 25 years old, 2: 25 to 35 years old, 3: 35 to 45 years old, 4: over 45 years old}.

The discretized data were imported into Rosette and attributes were simplified using Johnson’s simplification algorithm, resulting in a simplified R = {education, offense, sentence, and age at diversion}, denoted as A, B, C, and D, respectively.

3.2.2

Analyzing results

When the minimum support of MApriori algorithm is set to 0.3, the minimum confidence is set to 0.8, the minimum lift is 1, and the number of rule entries is 13, the results of multidimensional association rule mining are shown in Table 1, in which the code 4002 is the crime of theft.

Table 1.

The first multi-dimensional association rule mining results

Serial number	Rules	Support	Confidence	Lift
1	(4002-C1)-A1	0.3971	0.9463	2.1641
2	(D1-4002-C1)-A1	0.2879	0.9372	3.0472
3	4002-A1	0.4598	0.9392	1.7959
4	D1-A1	0.5977	0.9475	1.3547
5	(D1-4002)-A1	0.3459	0.9619	2.5961
6	(D1-C1)-A1	0.5130	0.9441	1.7715
7	C1-A1	0.6908	0.9308	1.2436
8	(D1-4002-A1)-C1	0.2898	0.9074	2.5924
9	(D1-4002)-C1	0.3286	0.8353	2.5903
10	D1-C1	0.5399	0.3102	1.5659
11	(D1-A1)-C1	0.4907	0.8178	1.3577
12	(4002-A1)-C1	0.4317	0.8348	1.4991
13	4002-C1	0.4647	0.8543	1.6635

In Table 1, the 1st rule is explained as: the number of offenders who committed theft with short sentences (less than 3 years) and low literacy (elementary school and below) accounted for 39.71% of the total sample data. At the same time 94.63% of these burglary offenders with short sentences have elementary school and below literacy. The degree of elevation (LIFT) indicates that this rule is more common than the finding that the first two conditions (committing burglary and having a low literacy level) are met at the same time.

In addition, the minimum support is set to 0.3, the minimum confidence is set to 0.6, and the minimum lift is 1. All the charges of community correctional officers are mined individually, and the results of multidimensional association rule mining are shown in Table 2. The “*” in parentheses in the table indicates the number of the charge in the prior criminal record.

Table 2.

The second multi-dimensional association rule mining results

Serial number	Rules	Support	Confidence	Lift
1	4002-4002(*)	0.3597	0.6972	1.3024
2	4002(*)-4002	0.3597	0.6038	1.0549

As can be seen from Table 2, the current crime and the pre-crime record both contain the crime of theft, which accounts for 35.97% of the total sample data. Among the community ex-prisoners who committed the current offense of theft, 69.72% of the previous criminal records also contained the offense of theft. 60.38% of those whose prior criminal records contained a burglary offense committed a burglary offense in the present.

3.2.3

Evaluation of rules

After using lift as a measure of relevance in the analysis, all positively correlated rules were filtered out, such as Rule 1 in Table 1, where Lift=2.1641 indicates that this rule is more common than finding that both of the first two conditions (committing a burglary and having a low level of literacy) are met. Similarly in Table 2 it is shown that neither rule is accidental and that it is more common than if it is found to have committed only one burglary offense. In addition, in Table 1, rules with some inclusion relationships are found which have the same successor (RHS) and the prior (LHS) is a subset of the relationship, for example, the LHS of Rule 2 contains the LHS of Rule 1, 3, 4, 5, 6, and 7, and the degree of enhancement of Rule 2 is greater than the degree of enhancement of the other rules, and these types of rules can be merged all together into Rule 2. Finally, based on the results, it can be seen that the cultural (elementary school and below), short sentence (less than 36 months), young age (less than 25 years old), and previous burglary are the main characteristics of reoffending, and the above results provide further support to the findings of the existing literature related to the analysis of the causes of crime. Table 2 shows that the crime of theft has its multiplicity and frequency. In view of the criminal characteristics of theft, community correctional institutions should strengthen the knowledge and skill training and psychological correction of community correctional personnel, and adopt a set of scientific and objective evidence-based correctional methods to prevent and reduce the likelihood of the recurrence of criminal acts of theft, and to improve their confidence and adaptability to return to society.

4

Logistic regression-based community corrections recidivism risk monitoring

4.1

Logistic regression model

4.1.1

Basic Theory of Logistic Regression Models

Logistic regression model is commonly used to describe the relationship between categorical response variables and explanatory variables, which belongs to a class of generalized linear regression models, and can be specifically divided into dichotomous regression models and multicategorical regression models [22]. Its explanatory variables can be continuous and discrete variables, such as attribute data, count data, etc., and the response variables are two or more discrete variables. In this paper, a bicategorical logistic regression model is used for monitoring recidivism risk in community corrections and early warning.

As a dichotomous logistic regression model, the response variable Y has only two outcomes such as “yes” or “no”, which are denoted by 1 and 0, respectively. Assuming that Y depends on p independent variables (or explanatory variables), which are denoted as X₁, X₂,…, X_p. Under the effect of p independent variables, the conditional probability of Y taking the value of “yes” is denoted as P = P_r {Y = 1 | X₁, X₂,…, X_p} = π(X), and the conditional probability of Y taking the value of “no” can be denoted as 1–P = P_r {Y = 0 | X₁, X₂,…, X_p} = 1–π(X), that is: 3 $\begin{array}{l} P & = π (X) = \frac{\exp (β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p})}{1 + \exp (β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p})} \\ = \frac{1}{1 + \exp [- (β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p})]} \end{array}$ 4 $1 - P = 1 - π (X) = \frac{1}{1 + \exp (β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p})}$

Eq. (3) or Eq. (4) is also known as the general form of the Logistic regression model, where β₁, β₂,…, β_p is called the regression coefficient of the Logistic regression model and β₀ is called the constant term or intercept. From equation (3), it can be seen that the Logistic regression model is a typical probabilistic nonlinear regression model, where the independent variable X_j (j = 1,2,⋯, p) can be a continuous variable, a categorical variable, or a dummy variable. Notice that when the independent variable X_j (j = 1,2,⋯, p) takes on any value, β₀ +β₁ X₁ + β₂ X₂ +⋯+ β_p X_p takes on a range of (-∞,+∞), and therefore P always varies between 0 and 1, which is exactly what probability means.

Doing a logit transformation of equation (3), the logistic regression model can be transformed into the following linear form, i.e.: 5 $\log i t (P) = \ln (\frac{P}{1 - P}) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}$ where $\frac{P}{1 - P}$ is called the occurrence ratio of the event and is numerically equal to the ratio of the probability of the event occurring to the probability of the event not occurring.

4.1.2

Parameter Estimation of Logistic Regression Models

Logistic regression models are nonlinear regression models, and the great likelihood method can be used to estimate the parameters of the nonlinear model. Since the response variable Y obeys a binomial distribution, then the probability distribution of Y ~ b(1, π(X)), and at this point E(Y) = π(X), one of the observations of the sample can be set as: 6 $P (y_{i} | X_{i}) = π {(X_{i})}^{y_{i}} {[1 - π (X_{i})]}^{1 - y_{i}}$

Here X_i = [x_i1,x_i2,⋯, x_ip], i = 1,2,⋯, n, then the great likelihood function for the n sample observations is: 7 $\begin{array}{l} L (β ∣ X, y) & = \prod_{i = 1}^{n} p (y_{i} ∣ X_{i}; β) \\ = \prod_{i = 1}^{n} {[π (X_{i})]}^{y_{i}} {[1 - π (X_{i})]}^{1 - y_{i}} \end{array}$

The log-likelihood function is: 8 $\begin{array}{l} l (β) & = \ln (L (β ∣ X, y)) \\ = \sum_{i = 1}^{n} {y_{i} \ln [π (X_{i})] + (1 - y_{i}) \ln [1 - π (X_{i})]} \end{array}$

The Levenberg-Marquardt algorithm combines the advantages of Newton’s method and the gradient descent method and has a higher speed of iterative convergence. The specific procedure for iteratively solving the great likelihood estimation of β using the Levenberg-Marquardt algorithm is as follows:

Let the sample composition structure be: 9 $[X, Y] = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 p} & y_{1} \\ x_{21} & x_{2 n} & \dots & x_{2 p} & y_{z} \\ \dots & \dots & \dots & \dots & \dots \\ x_{n 1} & x_{n 2} & \dots & x_{n p} & y_{n} \end{matrix}]$

Then, the log-likelihood function of the logistic regression model can be rewritten as: 10 $\begin{array}{l} l (β) = \sum_{i = 1}^{n} {y_{i} (β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p}) - \ln [1 + \exp (β_{o} + β_{1} x_{i 1} + \dots + β_{p} x_{i p})]} \\ = \sum_{i = 1}^{n} {y_{i} (β_{0} x_{i 0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p}) - \ln [1 + \exp (β_{0} x_{i 0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p})]} \\ = \sum_{i = 1}^{n} y_{i} β X_{i} - \sum_{i = 1}^{n} \ln [1 + \exp (β X_{i})] \end{array}$ where x_i0 = 1, such that $f (β) = l (β) = \sum_{i = 1}^{n} y_{i} β X_{i} - \sum_{i = 1}^{n} \ln [1 + \exp (β X_{i})]$ , then: 11 $\nabla f (β) = \frac{\partial f}{\partial β}, H (β) = [\begin{matrix} \frac{\partial^{2} f}{\partial β_{0}^{2}} & \dots & \frac{\partial^{2} f}{\partial β_{0} \partial β_{p}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial β_{p} \partial β_{0}} & \dots & \frac{\partial^{2} f}{\partial β_{p}^{2}} \end{matrix}]$

The iterative formula based on the principle of the Levenberg-Marquardt algorithm gives the iterative form of β as: 12 $β_{i + 1} = β_{i} - {(H + λ D)}^{- 1} \nabla f (β_{i})$ where H is the Hessian matrix, D is the diagonal matrix associated with H, and λ is the variable adjustment parameter. Given the initial value β₀ = {β₀₀, β₀₁,⋯, β_0p} and the computational accuracy ε, the iteration stops when ||β_i+1 – β_i|| ≤ ε, and a set of values for β_i+1, the corresponding parameter estimates of the logistic regression model, is obtained.

4.1.3

Early warning of recidivism in community corrections based on logistic regression

In constructing a community corrections recidivism early warning model using the Logistic regression algorithm, it is first necessary to preprocess the original dataset, divide the dataset into a training set and a test set in an appropriate proportion, and determine the independent variables (i.e., basic information attributes of community corrections personnel, crime information attributes, etc.) and dependent variables (i.e., recidivism status of community corrections personnel, “recidivism” is denoted as 1 and ‘non-recidivism’ is denoted as 0). Second, the training set was used to obtain the specific structure and regression coefficients of the regression model, and the test set was used to test the classification accuracy of the regression model.

Since in the logistic regression model, the output of Eq. (3) takes the value range of [0,1], which indicates the probability of community correctional officers reoffending, it is necessary to select an acceptable threshold of the probability of community correctional officers reoffending in practical applications, such as taking the threshold equal to 0.5, that is, when the output probability of the model is greater than 0.5, it is determined that the community correctional officers are in the state of recidivism, otherwise it is determined that the community correctional officers are in the state of recidivism. Otherwise, the community correctional officer is judged to be in a non-reoffending state.

4.2

Community Corrections Recidivism Early Warning Modeling

4.2.1

Data sample selection

The data samples used in this chapter were collected by collecting cases of crimes sentenced to community corrections in L state between 2018 and 2022 in the China Judgment Network and comparing the statewide community corrections personnel control database and the statewide offender management system with the support of L state community corrections agencies, from which the information of 600 offenders sentenced to community corrections in L state between 2018 and 2022 was extracted as the sample data, and based on whether they are first-time offenders and recidivists, divided into two groups of 294 samples in the recidivist group and 306 samples in the first-time offender group for comparison, and carried out a model fit test on the samples, and there were no missing value situations.

4.2.2

Model construction and analysis

When the chi-square test was conducted on the elements affecting the likelihood of recidivism of community corrections officers, it was found that this independent variable was deleted because the p-value of community corrections officers’ criminal organizing was greater than 0.05, which was not statistically significant. At the same time, because this calculation started from the full model using the posterior progression method, that is, all independent variables of this model plus statistics were analyzed and evaluated one by one and the explanatory variables in the model that were not valid for it were removed, the variable of marital status was actively deleted during the analysis. And then extracted gender, cultural level, criminal experience and other seven factors that may have a more significant impact on the occurrence of recidivism of community correctional officers, combined with the results of Logistic regression analysis of the explanatory variables and the explanatory variables can be concluded that the seven factors on the possible triggering of recidivism of drug-involved personnel have different degrees of influence, so this paper through Logistic regression analysis The results were plotted in a multi-factor analysis table of influencing community correctional officers’ recidivism as shown in Table 3.

In Table 3, the meanings of the indicators are: β is the standardized partial regression coefficient, SE is the standard error of the partial regression coefficient, Wald is the value calculated in the chi-square test, OR is the ratio ratio, the dominance ratio, and P is the probability of the statistical observation. Statistics according to the significance of the test method of P-value is divided into three ranges of 0.001-0.01-0.05, it is generally believed that P-value of less than 0.05 indicates that the variables are statistically different, P-value of less than 0.01 indicates that there is a statistically significant difference between the variables, P-value of less than 0.001 (shown in the table as 0.000) indicates that the variables have extremely significant statistical differences. The probability of sampling error being the cause of the difference between the samples is lower than 0.05, 0.01, and 0.001.

Based on the selected sample data, each parameter was estimated using the Logistic regression model and combined with statistical significance (P<0.05), which in turn led to the establishment of a model equation for the early warning model of recidivism for community correctional officers to monitor the risk of crime and early warning for community correctional officers.

Combining the results of the analysis in Table 3 can visualize whether the coefficients of the regression equation are significant or not, and also test out the existence of covariance and other problems. As can be seen from Table 3, the meaningful variables are: gender χ₁, age 30-39 group χ₍₂₎₍₂₎, 40-49 group χ₍₂₎₍₃₎, 49-59 group χ₍₂₎₍₄₎, 60 and above group χ₍₂₎₍₅₎, place of residence χ₃, state of residence χ₄, junior high school educational level χ₍₅₎₍₂₎, high school educational level χ₍₅₎₍₃₎, occupation χ₆, experience with drug use χ₍₇₎₍₁₎. The final equation of the community corrections recidivism early warning model obtained is: 13 $\begin{array}{l} L o g i t (p) = 2.61 - 0.98 χ_{1} + 0.33 χ_{2 (2)} + 0.31 χ_{2 (3)} - 4.41 χ_{2 (4)} - 17.95 χ_{2 (5)} \\ - 0.38 χ_{3} - 3.04 χ_{4} - 0.91 χ_{5 (2)} - 0.82 χ_{5 (3)} + 0.92 χ_{6} + 1.36 χ_{7} \end{array}$

The constructed early warning model of community correction is analyzed, and the content and results of the analysis are as follows: 1)

According to analyzing the variable of gender, it can be concluded that the β-value of gender as female is -0.98<0, P=0.005<0.05, which can be seen that the female group of community correctional personnel who are likely to commit recidivism among first-time offenders are less risky compared to the male group.

2)

According to analyzing the variable of age, it can be concluded that the β-value of the two groups of age 30-39 and 40-49 is 0.33 and 0.31 respectively, P<0.05, which can be seen that the age groups of 30-39 and 40-49 among the first-time offenders who are likely to commit recidivist crimes are at a higher risk of recidivism compared to the other two age groups.

3)

According to the analysis of the variable of residence, it can be concluded that the β value of residence in the city is -0.38<0, P=0.007<0.05, based on which it can be seen that the group of first-time offenders who may commit recidivism in urban areas is less risky than the group in rural and remote areas, which indicates that most of the community corrections persons who commit recidivism live in townships, rural areas and other areas where people are relatively concentrated and the economy is relatively backward, which is mainly determined by the geographical location and economic conditions of State L.

4)

According to analyzing the variable of state of residence, it can be concluded that the β-value of the state of residence as having a fixed abode is -3.04<0, P<0.001, which shows that the group of first-time offenders without a fixed abode who are likely to commit reoffending behaviors are more risky compared to the group with a fixed abode.

5)

Based on analyzing the variable of cultural level, it can be concluded that the β-values of junior high school and high school cultural level are -0.91 and -0.82, respectively, with β<0, P<0.001, which shows that the group of first-time offenders who may commit recidivist acts with a lower cultural level has a higher risk of recidivism compared to the group with a fixed residence.

6)

According to the analysis of the variable of occupation type, it can be concluded that the regression coefficient β value of the unemployed is 0.92, P=0.017<0.05, which shows that the more unstable the occupation of the perpetrators who may commit recidivist crimes among the first-time offenders is, the higher the recidivism probability is relatively.

7)

According to analyzing the variable of drug abuse experience, it can be concluded that the regression coefficient β value of those who have drug abuse experience is 1.36, P<0.001, which can be seen that the higher the possibility of recidivism for those who have drug abuse experience among first-time offenders who may commit recidivism.

Table 3.

Logistic analysis of community correctional personnel’s recidivism behavior

Influencing factor	β	SE	Wald χ²	OR	95%CI	P
Gender
Male				1.02
Female	-0.98	0.29	10.32	0.41	0.24~0.71	0.005
Age			22.44			0.000
18-29 years old				1.05
30-39 years old	0.33	0.16	5.04	1.43	1.12~1.85	0.019
40-49 years old	0.31	0.14	4.28	1.14	0.98~1.82	0.025
50-59 years old	-4.41	1.12	15.96	0.15	0.02~0.11	0.000
60 years old and above	-17.95	3105.68	0.01	0.02		0.963
Domicile				1.05
City	-0.38	0.16	7.94	0.72	0.49~0.93	0.007
Countryside
Living condition
Unfixed residence				1.09
Fixed home	-3.04	0.35	75.27	0.06	0.02~0.08	0.000
Cultural degree
Primary school and below
Junior high school culture	-0.91	0.17	35.14	0.39	0.31~0.56	0.000
High school above	-0.82	0.21	2.17	0.78	0.52~1.16	0.144
Occupation
Fixed occupation	-4.35	1.06	16.36	0.09	0.01~0.09	0.000
Unemployed man	0.92	0.48	3.85	2.61	0.99~6.38	0.017
Criminal experience
Have drug use experience	1.36	0.95	1.88	0.27	0.05~1.67	0.162
No drug use experience	-0.88	0.18	32.95	0.41	0.35~0.56	0.000
Constants	2.61	0.87	9.24	14.24		0.004

Based on the above analysis, in the community corrections recidivism early warning model equation, if the calculation results in a value closer to 1, it means that the community corrections officer has a higher probability of committing recidivism. On the contrary, if the result is closer to 0, it means that the community correction officer has a lower probability of committing reoffending.

Based on the community correctional personnel recidivism early warning model equation derived from the analysis of this paper, this paper selects two community correctional personnel for the following examples: 1)

When M is a male, 42 years old, living in the rural area, with a fixed residence, junior high school education level, unemployed, and has experience in drug abuse. According to M’s personal information, the early warning model equation can be utilized for calculation i.e. Logit(p)=2.59+0.31-3.04-0.91+0.92+1.36=1.23, and it can be seen that the result of the calculation is relatively close to 1, which indicates that the risk of the community correctional officer’s recidivism is relatively large.

2)

When N is a woman, 46 years old, living in the city, no fixed abode, high school education level, no fixed occupation, no drug experience. According to N’s personal information, the early warning model equation is used to calculate i.e. Logit(p)=2.59-0.98-0.31-0.38- 0.82+0.92-0.88=0.14, and the result of the obtained calculation is relatively close to 0, which indicates that the risk of recidivism for this community correctional officer is relatively low.

5

Crime prediction based on the DBSCAN algorithm

5.1

Principles and Improvements of Multi-density Density Clustering

5.1.1

Principles of Density Clustering (DBSCAN)

Density-based clustering (DBSCAN) is a density-based clustering algorithm that identifies clusters by discovering high-density regions in a dataset and is effective in dealing with noise and outliers [23]. Following are the main concepts of DBSCAN:

CorePoints: refers to a data point in a dataset that satisfies the condition of containing at least MinPts of data points within a given neighborhood radius (Eps). In general, the core object is a high density point in the dataset with a sufficient number of neighboring points that can be approximated as the center of the clusters.

Directly Density-Reachable: Refers to a data point A If it is in the eps neighborhood of data point B and data point B is the core object, then it can be said that data point A is in a cluster with B as the core object. This means that with core object B, data point A can go directly to the cluster in which it resides.

Density-Reachable: means that for data points A and B, if there exists a sequence of data points P₁, P₂,……,P_n, where P₁ is A and P_n is B, and for each data point P_i(1 ≤ i ≤ n), P_i+1 in the sequence is within the neighborhood of P_i, then it can be said that data point A is within a cluster with data point B as the core object.

Density-Connected: if there exists data point C such that data points A and B are both within the Eps neighborhood of data point C, then data points A and B can be said to be within the same cluster.

The basic steps of DBSCAN algorithm are as follows: Step 1:

Initialization

Each object in the dataset needs to be given a category tag with an initial value of zero. When the category tag of an object is zero, it means that the object has not been categorized yet. Also initialize the parameters of the algorithm such as neighborhood radius (Eps) and neighborhood density threshold (MinPts).

Step 2:

Iterate over unclassified objects

Traverse the uncategorized objects in the dataset sequentially. If the object’s category is labeled as zero, it means that the object is not yet categorized, go to the next step.

Step 3:

Expand clusters based on density reachability

Examine the number of points in the object’s neighborhood to determine if the conditions for a core object are met. Starting with a central object that has not been visited, recursively group the points in its neighborhood into the same cluster using the density reachability principle. If a point is a core object, then all points in its neighborhood must also belong to the same cluster. This process continues until no new core object can be found.

Step 4:

Marking Noisy Points

If the object does not fulfill the conditions of a core object, i.e., it is not a core object, then mark it as a noise point.

Step 5:

End Scan

All objects in the dataset are categorized, so the clustering ends, and each object is marked with the category tag to which it belongs.

The principle of DBSCAN algorithm is shown in Fig. 4, assuming that Eps is 3 and MinPts is 3, i.e., if the number of elements in the circle with Eps as the radius is 3, then the point is a core point. The dots within the circle in the figure indicate the core points, the triangles indicate the boundary points, and the squares indicate the noise points.

The basic idea is to determine the neighborhood radius of the elements in the sample data set and the minimum density threshold of the elements. Any one element is selected with Eps as the radius, if the number of elements in the circle is not less than the minimum density threshold, this element is the core point, the other elements in the circle are called boundary points, and the remaining elements are called noise points. According to the set parameters, the core point is found and the clustering is expanded based on the core point until all the elements in the dataset are traversed.

5.1.2

Improvements in density clustering

The performance of DBSCAN is directly affected by two key parameters, the neighborhood radius Eps and the neighborhood density threshold MinPts, which may cause large errors if the values of these parameters are not reasonable. The traditional manual parameter adjustment method can be hampered by subjective factors and may not be able to locate the optimal parameter combination. And DBSCAN may be less effective in dealing with clusters with large density differences, and it is easy to misidentify clusters with smaller density as noise points, which leads to bias in clustering results.

For this reason, this paper proposes a method for multi-density adaptive determination of the parameters of the DBSCAN algorithm. The algorithm is effective in handling community corrections recidivism datasets with large differences in density distribution. By identifying community corrections recidivism types with different densities, it can better adapt to the distribution characteristics of the community corrections recidivism dataset itself. Its adaptive nature makes the algorithm more flexible and stable in the process of clustering community correctional recidivism.

The key to the multi-density adaptive DBSCAN is based on the self-attenuating K – mean nearest neighbor algorithm and the mathematical expectation method, which is capable of automatically generating clustering parameters based on the characteristics of the community corrections recidivism dataset itself. Its specific steps are as follows: Step 1:

Calculate the Euclidean distance from each community corrections recidivism type to other community corrections recidivism types and generate the distance matrix Dist_m×m: 14 $D i s t_{m \times m} = {d i s t_{(a, b)} | 1 \leq a \leq m, l \leq b \leq m}$ where m is the number of data sets, dist(a,b) is the Euclidean distance between points a and b, and Dist_m×m is a real symmetric matrix of order m.

Step 2:

Generate a new matrix $D i s t_{m \times m}^{n e w}$ by sorting the elements of each row of Dist_m×m in ascending order.

Step 3:

Average each column of $D i s t_{m \times m}^{n e w}$ to obtain $\bar{D_{k}} (1 \leq k \leq m)$ and subtract the self decaying term to find the candidate Eps parameter: 15 $E p s_{k} = \bar{D_{k}} (1 - λ^{2})$ where λ is the self attenuation function and 0 ≤ λ ≤ 1.

Step 4:

The list of Eps parameters is obtained after solving for k parameters: 16 $E p s_{l i s t} = {E p s_{k} | 1 \leq k \leq m}$

Step 5:

Generate a list of MinPts based on the mathematical expectation method: 17 $M i n P t s_{k} = \frac{(1 - λ)}{n} \sum_{t = 1}^{m} P_{t}$ where P_t is the number of community corrections recidivism types in the Eps neighborhood of the tnd element. Get the list of MinPts parameters: 18 $M i n P t s_{l i s t} = {M i n P t s | 1 \leq k \leq m}$

Step 6:

Define the multi-density threshold Destiny_i as the presence of MinPts_i data points in a circle with radius Eps_i : 19 $D e s t i n y_{i} = \frac{M i n P t s_{i}}{π E p s_{i}^{2}}$

After obtaining the list of neighborhood radius and density threshold parameters the parameter pairs corresponding to different k values are sequentially substituted for clustering (1 ≤ k ≤ m) and when the number of clustered clusters N is the same or converges several times in a row, it is the optimal number of clusters.

When the clustering result tends to be stable, its corresponding optimal parameters are selected in reverse order. In this paper, the number of consecutive stable times m is introduced to indicate the situation that the clustering results keep the same number of clusters: when the same number of clusters occurs m times consecutively, it means that the clustering results tend to be stable, and the number of clusters corresponding to the first stable interval will be labeled as the optimal number of clusters. If we cannot find the same number of clusters for m consecutive times, we will continue to search for the same number of clusters for m–1 consecutive times until we find a stable interval where the fluctuation of the number of clusters is within the range of one.

After determining the stabilization interval, the beginning of the interval is found at point starK and the end of the interval at point endK. In order to satisfy the need for accurate discrimination of the noise level, three noise levels are used in this paper, namely L (less noise), M (more noise) and N (normal noise). The following formula is used: 20 $K = {\begin{array}{l} s t a r t K, l e v e l = M \\ \frac{(s t a r t K + e n d K)}{2}, l e v e l = N \\ e n d K, l e v e l = L \end{array}$

Under the premise of ensuring correct clustering results, the lower the density threshold, the better the clustering effect. Calculate the density threshold under different k values, and under the premise of stabilizing the number of clusters, the minimum density threshold is obtained and recorded as the initial density threshold. The noise data under the initial density threshold is extracted and the same operation is performed to obtain a new density threshold until the number of noise data does not meet the minimum number of noise, and the clustering results of different density thresholds are combined.

5.2

Experimentation and Analysis

5.2.1

Data pre-processing

In order to achieve the prediction of recidivism behavior of community correctional officers, this chapter explores the prediction mechanism of criminal behavior. The dataset used in this chapter is reliable and authentic data obtained from the public security authorities in District H. The dataset was used in the experiment. The dataset contains a total of eight attributes, and the experiment applied five attributes: crime type, location, date, latitude, and longitude for the study of crime prediction. One of the data for the year 2022 was selected for one year systematic analysis. In processing the data, the missing and doubtful data were removed and the valid data obtained was 1926. In the treatment of time, the time of occurrence of the case is taken as the month and day, such as June 15, 2022, which is taken as 6.15, which helps in the later DBSCAN experiments.

5.2.2

Data visualization

The relationship between the type of crime and the time of crime can be obtained intuitively through preliminary descriptive statistical analysis. Among the 1,926 crimes that occurred in District H in 2022, the main types of crimes are theft, fighting, drugs, gambling, obstruction, intentional injury, traffic collision, prostitution, extortion, dangerous driving, sexual assault, provocation and fraud in 13 major categories. The number of occurrences of different crime types is shown in Table 4.

Table 4.

Number of different types of crime

Case category	1	2	3	4	5	6	7	8	9	10	11	12	Total
Burglary	75	41	42	72	60	73	66	61	67	68	51	35	711
Braid	2	4	3	3	2	1	1	1	3	0	2	3	25
Drugs	7	6	7	3	1	5	1	4	4	2	0	2	42
Gambling	6	3	5	2	3	4	5	2	7	4	3	1	45
Nuisance class	1	0	2	3	1	1	2	3	4	2	0	1	20
Intentional injury class	0	2	2	7	3	0	2	6	1	4	5	3	35
Traffic accident class	8	0	5	4	11	5	4	6	4	5	0	7	59
Prostitution	2	1	0	3	4	2	1	1	1	1	1	1	18
Blackmail	1	0	0	2	1	3	1	1	1	2	1	1	14
Dangerous driving class	10	11	13	14	22	26	15	10	16	18	16	19	190
Sexual abuse	0	3	1	2	4	1	0	4	2	0	3	2	22
Pick up trouble	3	3	1	4	5	5	3	2	3	3	0	2	34
Scams	51	20	47	45	69	69	87	53	32	47	46	43	609
Other classes	13	5	8	12	9	9	13	11	5	10	3	4	102
Total	179	99	136	176	195	204	201	165	150	166	131	124	1926

According to the data in Table 4, analyzed from the perspective of crime types, among them, theft, fraud and dangerous driving are the three major crimes, accounting for 36.92%, 31.62% and 9.87% of the total number of crimes in the region for the whole year, respectively, and local public security authorities should pay attention to these three types of crimes. Analyzing from the perspective of the time of occurrence of crimes, May, June, and July are the time periods of high incidence of crimes. The high incidence time periods for theft-type crimes were January, April, June, and October, for fraud-type crimes were May, June, and July, and for dangerous driving crimes were May and June. Combining the characteristics of the type of crime and the time of the crime, the public security authorities should strengthen the control of theft-type, fraud-type and dangerous driving-type crimes in May, June and July, so as to effectively control the total number of crimes in H district.

Visualization techniques allow the presentation of statistical results. The different types of crimes are plotted in both temporal and spatial dimensions on a three-dimensional map, as shown in Figure 5.

As can be seen from Figure 5, theft has the widest spatial and temporal distribution in the district, and it is clear that economic crime has become the type of crime that seriously affects the sense of security in the lives of the people in the district. Theft crimes are more concentrated in a specific spatio-temporal area, which ranges from latitude: 31°70′N to 31°75′N. longitude: 122°40′E to 122°50′E. time range: June to October. Thus, researchers can take preventive measures based on the types of crimes that frequently occur in an area during a certain time period. For example, if burglary cases are frequent in April to June and dangerous driving cases are frequent in August to October in Township c, District H, then community patrols can be strengthened in April to June, and more traffic policemen can be assigned to the roads in August to October, while focusing on the promotion of safe driving.

5.2.3

Intelligent Crime Monitoring and Early Warning Modeling

The correlation pattern of crime type and crime time can be found in the previous descriptive statistics, but more factors need to be considered for crime prediction. In this section, a crime prediction model based on the DBSCAN algorithm is constructed, and the crime type, crime location, and crime time are clustered and analyzed.

The results of clustering experiment of DBSCAN algorithm are shown in Figure 6. As can be seen from Fig. 6, five clusters are formed after clustering the crime data, indicating that the points in these five clusters are closely related in space and time. The higher the density, the higher the risk of crime in that area during that time period, and more police force should be allocated. It can be seen that there are five crime hotspots in District H in terms of crime time and space, and these five crime hotspots are mainly concentrated in one area: latitude 31°75′N~30°90′N, longitude 122°40′E~122°60′E. The public security organs should assign police force to this area as the core to spread to the periphery.

If a continuous crime occurs in an area, the exclusion and attenuation zones around the actual residence can be identified by means of density clustering, which leads to the distribution of possible residences of the suspects and helps the police to narrow down the scope of investigation of the case. In addition, the model can monitor crime dynamics in real-time, update crime hotspots, conduct tracking investigations and research on long-term crime problems, and assess the effectiveness of policing prevention and control through changes in density. As for community correctional institutions, this model can also be used to predict crime among community correctional personnel, using the community where they serve their sentences as a space to predict recidivism.

6

Conclusion

In this paper, MApriori algorithm is used to mine the multidimensional association rules of community correctional recidivism, and logistic regression analysis is used to construct the community correctional risk monitoring and early warning model, while density-based clustering (DBSCAN) algorithm is used to realize the study of criminal behavior prediction. The conclusions are as follows: 1)

In the multidimensional association rule mining samples, the current crime and the previous criminal record both contain theft crime accounted for 35.97% of the total sample data. Among the community ex-prisoners who committed the current offense of theft, 69.72% of their previous criminal records also included the offense of theft. The percentage of those whose prior record contained a theft offense who committed another theft offense this time was 60.38%. The correlation rule shows that low literacy, short sentence, young age, and previous burglary are the main characteristics of reoffending, i.e., burglary has its multiplicity and frequency. Community correctional institutions should strengthen the knowledge and skill training and psychological correction for community correctional personnel, and adopt a set of scientific and objective evidence-based correctional methods to prevent and reduce the likelihood of committing the reoccurrence of burglary offenses, and improve their confidence and adaptability to return to society.

2)

An early warning equation based on logistic regression model was established through the community correctional personnel’s own characteristics and case characteristics. It is found that the risk of recidivism of first-time offenders is relatively high due to their own and external factors, such as changes in family and marital relationships, mostly young and strong males with generally low cultural levels, no fixed place of residence and stable living conditions, no suitable employment channels and stable sources of income, and most of them have had drug abuse experiences and are addicted to drugs.

3)

Three-dimensional visualization technology and DBSCAN clustering algorithm were used to analyze the spatial and temporal distribution of different crime types in District H. The results of the study show that theft has the widest distribution and is more concentrated in a specific spatial and temporal range. There are five crime hotspots in the district in both space and time, and the police force should be deployed with five crime hotspots as the center. The prediction method provides a basis for predicting recidivism behavior in community corrections.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Research on Dynamic Monitoring and Intelligent Early Warning of Community Correctional Recidivism Risk Based on Multidimensional Data Mining

Manna Xie

Published Online: Mar 19, 2025

Received: Oct 26, 2024

Accepted: Feb 02, 2025

DOI: https://doi.org/10.2478/amns-2025-0442

KeywordsMultidimensional data mining, MApriori algorithm, Logistic regression model, Density clustering

© 2025 Manna Xie, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
Multidimensional data mining, MApriori algorithm, Logistic regression model, Density clustering