[Download] Tải Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam – Tải về File Word, PDF

Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam

Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam
Nội dung Text: Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam


Bài viết này nghiên cứu hiệu quả của phương pháp xử lý dữ liệu mất cân bằng trong bài toán phân loại khách hàng tại các ngân hàng thương mại. Đây là một vấn đề phổ biến trong vấn đề phân loại khách hàng, trong đó các quan sát của một lớp nhiều hơn lớp còn lại trong dữ liệu. Mời các bạn cùng tham khảo!

Bạn đang xem: [Download] Tải Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam – Tải về File Word, PDF

*Ghi chú: Có 2 link để tải biểu mẫu, Nếu Link này không download được, các bạn kéo xuống dưới cùng, dùng link 2 để tải tài liệu về máy nhé!
Download tài liệu Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam File Word, PDF về máy

Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam

Mô tả tài liệu

Nội dung Text: Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam

    ICYREB 2020


    Nguyen Thi Lien, MS ; Nguyen Thi Thu Trang, MS; Nguyen Thi Dung
    National Economics University

    This article investigates the effectiveness of imbalanced data processing methods in the
    problem of customer classification at commercial banks. This is a common issue in a customer
    classification attempt, where observations of one class outnumber the remaining class. We apply
    the methods widely used in the world including undersampling, oversampling, bothsamling tech-
    niques, and SMOTE (Synthetic Minority Oversampling Technique) to deal with imbalances. The
    logit model is applied to datasets that have been processed by these methods to classify customers.
    Using 7501 transaction data from individual customers, the classification results using data
    processed with these techniques all improve significantly compared to using untreated data. Be-
    sides, the results also show that the most efficient method is SMOTE technique combined with
    the logit model using variables transformed by Weight of Evidence (WOE).
    Keywords: Bothsampling, credit scoring, oversampling, SMOTE, undersampling, WOE.

    Tóm tắt
    Bài báo này nghiên cứu hiệu quả của phương pháp xử lý dữ liệu mất cân bằng trong bài
    toán phân loại khách hàng tại các ngân hàng thương mại. Đây là một vấn đề phổ biến trong vấn
    đề phân loại khách hàng, trong đó các quan sát của một lớp nhiều hơn lớp còn lại trong dữ liệu.
    Chúng tôi áp dụng các phương pháp được sử dụng rộng rãi trên thế giới bao gồm kỹ thuật lấy
    mẫu dưới (Undersampling), lấy mẫu quá mức (Oversampling), kỹ thuật lấy mẫu cả hai (booth-
    sampling) và SMOTE (Synthetic Minority Oversampling Technique) để giải quyết vấn đề mất
    cân bằng. Mô hình logit được áp dụng cho các tập dữ liệu đã được xử lý mất cân bằng để phân
    loại khách hàng. Sử dụng 7501 dữ liệu giao dịch từ các khách hàng cá nhân, kết quả phân loại
    sử dụng dữ liệu đã được xử lý mất cân bằng đều cải thiện đáng kể so với sử dụng dữ liệu không
    được xử lý. Bên cạnh đó, kết quả cũng cho thấy phương pháp hiệu quả nhất là kỹ thuật SMOTE
    kết hợp với mô hình logit sử dụng các biến đầu vào được chuyển đổi sang trọng số bằng chứng
    (Weight of Evidence – WOE).
    Từ khóa: Chấm điểm tín dụng, lấy mẫu cả hai, lấy mẫu quá mức, lấy mẫu dưới, SMOTE,


    ICYREB 2020

    1. Introduction
    In the case of classification algorithms, we aim to predict the false observations with the
    highest accuracy. However, the accuracy of these algorithms can be influenced by the imbalanced
    data (López, 2013), where observations of one class outnumber the other class(es) at least 5:1
    for the binary case (He & Garcia, 2009). In order to improve the prediction accuracy and lower
    the computation expense, a previous identification and quantification of the most relevant input
    variables of the model is always highly advised. Imbalanced data also appears in some other
    fields, such as fault detection (Gong & Qiao, 2012; Silva et al., 2006; Wei et al., 2013), toxin de-
    tection (Harley et al., 2020), medical diagnosis (Ertekin et al., 2007) and customer churn predic-
    tion (Bing Zhu et al., 2017). In medical area, consider the group of cancer patients as the positive
    class and remaining persons are in the negative class, the difference about quantity between these
    two groups in datasets is large and unequal in the number of observations between the groups.
    Other examples of rare events include software defects (Rodriguez et al., 2014), cancer gene ex-
    pressions (Wu et al., 2012), credit card transactions fraudulent (Kundu et al., 2009), fraud detec-
    tion in telecommunication (Augustin, 2012), and natural disaster events (Kim et al., 2016). The
    most prevalent class is called the majority class, while the rarest class is called the minority class
    (Huang et al., 2016).
    Several techniques to process imbalanced data are developed and applied on the aspect of
    data while others concentrate on the algorithmic level. To address the imbalanced problem, four
    categories in the applications include the preprocessing strategies, cost-sensitive learning meth-
    ods, adaptation of machine learning techniques and combination of the previous three approaches.
    The preprocessing strategies include resampling techniques and/or variable importance analysis.
    Resampling algorithms are presented by oversampling and undersampling methods. Resampling
    techniques are used to rebalance the sample space for an imbalanced dataset to alleviate the effect
    of the skewed class distribution in the learning process. Resampling methods are more versatile
    because they are independent of the selected classifier (López et al., 2013). Oversampling meth-
    ods are created the synthetic minority samples are randomly duplicating the minority samples.
    Some research (Chawla et al., 2002; Estabrooks et al., 2004; and Tahir et al., 2009) introduced
    the random undersampling in managing the imbalanced dataset. The undersampling method drops
    non-default observations to oversee the data imbalance. However, those methods have limitations
    such as overfitting or loosing essential information. The SMOTE (Synthetic Minority Oversam-
    pling Technique) method increases instances for minority class based on KNN (K Nearest Neigh-
    bor) algorithm introduced by Batista et al. (2004). The added observations have data properties
    that are close to the original observations, reducing the level of imbalance in the data set. Based
    on the original SMOTE method, several newly developed algorithms include SMOTE-N (Syn-
    thetic Minority Oversampling Technique Nominal), SMOTE-NC (Synthetic Minority Oversam-
    pling Technique Nominal Continuous), MWMOTE (Majority Weighted Minority Oversampling
    Technique). Sukarna Barua et al. (2012) introduced two methods, Borderline-SMOTE 1 and Bor-
    derline-SMOTE 2, which only alter the observations at the border of the sample to reduce the
    misclassification rate. The SMOTEBoost method (Chawla et al., 2003) improved the SMOTE
    method by incorporating the AdaBoost M2 algorithm. The EasyEnsemble method (proposed by
    Liu et al., 2009) has increased the efficiency of the undersampling algorithm by adding useful


    ICYREB 2020

    information to the clustering method to reduce the noise level of the generated sample. Besides,
    the BalanceCascade method (Han et al., 2009) provides a method to guide the deletion of obser-
    vations rather than the random deletion method. Besides, bothsampling is a combination of the
    above methods (Dubey et al., 2014).
    The traditional classification method for predicting the probability of default of a loan is
    logistic regression (Hosmer et al., 2000; Maalouf and Trafalis, 2011; Shu et al., 2014; Maratea et
    al., 2014); neural networks and decision trees (Quinlan, 1998; Sarkar et al., 2016). Other methods
    are applied including gradient boosting, least square support vector machines (Jin et al., 2012).
    These research models focus on statistical orientation or artificial intelligence (AI).
    Input variable analysis of these algorithms has gained attention in many practical applica-
    tions (Ferretti et al., 2016) due to the complexity of interactions among variables on large datasets.
    Variable analysis is a crucial task to improve the model interpretability, reduce the computational
    cost, optimize the data storage, and provide a smaller number of relevant input variables. Several
    approaches have been considered, which can be grouped input variables into new categories or
    values without losing explanatory prediction power. Principal Component Analysis (PCA) is only
    suitable for data that are normally distributed, near-standard, or linearly related features (Jolliffe
    and Cadima, 2016). Other analysis constructs a model base on the Weight of Evidence (WOE)
    and Information Value (IV) in clustering algorithms (Polykretis and Chalkias, 2018).
    In Vietnam, the data processing algorithm by the undersampling method of reducing ran-
    dom elements on the data boundary is introduced in medical data (Phuong et al., 2015). Research
    on SMOTE techniques (Lien et al., 2018) in machine learning techniques in credit card fraud de-
    tection show that the algorithm is suitable for controlling credit card fraud detection in Vietnam.
    Currently, commercial banks in Vietnam have many difficulties in building internal risk
    management models according to Basel II capital standards (issued by the Basel Committee on
    Banking Supervision (BIS,2004)) and Circular No.41/2016/TT-NHNN (issued by the State Bank
    of Vietnam). Specific difficulties occur in the process of data collection, data processing, selection
    and evaluation of the effectiveness of the internal risk management model. More seriously, ac-
    cording to VietstockFinance (2020), the bad debt ratio of Vietnamese commercial banks is all
    lower 3.42% in 2019. As a result, the imbalanced datasets effect on the efficiency of the good
    and bad classification algorithms at Vietnamese commercial banks. This means that the number
    of non-performing loans is much less than that of performing loans in the datasets. These imbal-
    anced datasets can lead to biased prediction in favor of the major group in the classification al-
    gorithms (Ganganwar, 2012).
    . Reality shows the need to find algorithms for imbalanced data processing and debt clas-
    sification models appropriate to the context in Vietnam. Earlier studies yet investigate the effec-
    tiveness of variable analysis based on grouped by WOE after managing imbalanced processed
    datasets. This article aims to oversee an imbalanced credit dataset by oversampling, undersam-
    pling, bothsampling and SMOTE algorithms in the Vietnam bank context. After that, we apply
    this handled imbalanced data into the logistic regression with input variables transformed by
    WOE to evaluate the importance of these algorithms and select an efficient credit scoring method
    at commercial banks in Vietnam.


    ICYREB 2020

    2. Credit scoring method at commercial banks
    The credit scoring process at commercial banks has a number of steps. The first step is the
    data processing. One of the challenging problems when applying regression models in practice
    is the quality of the data. The data processing step also consumes a great deal of time, accounting
    for nearly 80% of the total time. The data processing includes segmenting, sampling, data parti-
    tioning, processing missing data, and outliers. The next step is the scorecard construction process.
    After data cleaning, the variables will be grouped by binning. The new value for each group is
    WOE (Weight of Evidence). WOE presents the predictive power of an independent variable in
    relation to the dependent variable.

    Distribution of good is the percentage of good customers in a particular group.
    Distribution of bad is the percentage of bad customers in a particular group.
    The WOE approach is suitable for the logistic regression model thanks to the convenience
    of scoring and no need to deal with missing data. Binning is to reduce the number of attributes
    because if original variables are used in the regression model, a great deal of dummy variables
    will be created. Also, binning variables is useful, especially for variables whose relationship to
    the dependent variable is nonlinear. This technique helps deal with nonlinear effects in a linear
    model. The next step is to remove some variables before running the model. They are the variables
    which has the poor ability to differentiate between good and bad accounts. Eliminating them be-
    fore running the model could improves the model quality and shortens processing time afterwards.
    The index of value information – IV is used to remove weak variables (Siddiqi, 2012). The IV is
    calculated using the following formula:
    Criteria for evaluating the variable predictiveness according to IV:

    Table 1. The assessment of IV

    IV Assessment
    < 0.02 Not useful for prediction
    0.02 – 0.1 Weak predictive power
    0.1 – 0.3 Medium predictive power
    – 0.5 Strong predictive power
    >0.5 Suspicious predictive power

    (Source: Siddiqi, 2012)


    ICYREB 2020

    From the IV, some variables that are not useful for prediction or have weak predictive
    power are immediately removed from the model. For remaining variables, it is necessary to com-
    bine with expert opinion to achieve the goal of dropping unnecessary variables but not losing
    good factors.
    Logit regression method
    The dependent variable is the default status of the customer, symbolized by D (D = 0 if the
    customer pays on time, D = 1 if the customer pays late). Modelling the relationship between cus-
    tomer characteristics and repayment ability through the logit function as follows:

    The maximum likelihood method is applied to estimate the coefficients β, thereby calcu-
    lating the default probability for each specific customer.
    Criteria for selecting variables for the logit model are statistical significance and expert
    judgement. At the same time, operational efficiency evaluation includes considering resources
    mobilized for data collection, considering whether the use of variables is consistent with the pro-
    visions of law or not.
    3. Algorithms for handling imbalanced datasets
    The common method of increasing instances in the group of bad accounts is randomly re-
    peating the data in this group. Take a set of randomly selected minority examples in the minority
    group, then augment the size of this group by replicating the selected examples and adding them
    to it. The result is that total examples in the bad group will be increased and the class distribution
    becomes more balanced accordingly.
    This random method simply replicates a portion of minority class in order to increase the
    weights of those examples. Because the replacement process is totally random, this method re-
    creates some existing examples in the original minority class. Therefore, its main drawback is
    that the overfitting phenomenon can occur. This method is the most fundamental among over-
    sampling techniques. Many other common oversampling algorithms used in the real world are
    developed based on this method (Peng Jun Huang, 2015).
    Undersampling method randomly reduces observations in the good group. The downside
    of this technique is that it is possible to eliminate useful observations from the majority group.
    The common technique applied in undersampling is the weighted sampling. It is described
    as follows: Calculate the weighted Euclidean distance of each negative instance from each of the
    positive instances. All features are weighted by its Fishers discriminant score (F1 measure), which
    measures the overlapping per attribute. After that, for each positive sample, sort negative instances
    in ascending order of distance from the positive sample. Finally, for each positive instance, select


    ICYREB 2020

    a user-defined number of negative instances. The user-defined number indicates the desired ratio
    of negative samples to positive samples. At this stage, special care is taken to avoid repetitive se-
    lection of negative samples. If a particular negative sample has been already selected, the next
    available negative sample is selected (Fernández, 2018).
    Due to the fact that oversampling results in too many bad observations and undersampling
    can cause loss of the original data, the combination of these two methods can be chosen in the
    hope of fixing those problems. However, it may not yield good results for the study either because
    the method simultaneously multiply bad observations and reduce the random good ones.
    SMOTE (Synthetic Minority Over-sampling Technique)
    The SMOTE algorithm carries out an oversampling approach, but the key difference is that
    it introduces synthetic examples but not replicate existing instances. To create these new data
    points, the KNN algorithm is used for the minority group, clustering the data into K different
    groups. Then, between two or more observations of the same group, it is needy to create more
    observations. For this reason, the procedure is said to be focused on the “feature space” rather
    than on the “data space”. For example, an xi positive instance is selected as basis to create new
    synthetic data points. Based on a distance metric, several NNs of the same class (points xi1 to
    xi4) are chosen from the training set. Finally, a randomized interpolation is carried out in order
    to obtain new instances r1 to r4 (Fernández, 2018).
    The SMOTE method works in “feature spaces” but not “data spaces”. It treats nominal and
    continuous attributes in different ways. In the closest neighbor determination calculations, it uses
    Euclidean distances for the continuum property and the Value Distance Metric for the nominal
    For continuum attributes:
    hGet the difference between the feature vector (the minority class pattern) and one of its
    k closest neighbors (the minority class sample).
    hMultiply this difference with a random number between 0 and 1.
    hAdding this difference to the object value of the original object vector constitutes a new
    feature vector.
    For nominal attributes:
    hTake the ratio of majority groups between the feature vector under consideration and the
    k nearest neighbors. In the case of a tie, take it at random.
    hAssign that value to an additional new case for the minority class.
    hUsing SMOTE creates additional regions for the minority, thereby allowing the classi-
    fication methods to predict more minorities.
    SMOTEboost algorithm
    The SMOTEboost algorithm is the integration of the SMOTE algorithm into the standard


    ICYREB 2020

    boosting process. Thus, it gains the benefits of both boosting and SMOTE. This algorithm applies
    to some asymmetric data at the medium and high levels, improves the prediction results for the
    minority and improves the F-value.
    Boosting helps improve the predictability of classifiers by recalculating the weights of mis-
    classified observations. SMOTE only improves the classifier of minority cases. By attaching
    SMOTE to boosting, it promotes boosting that focuses more on minority cases than in the ma-
    SMOTEboost implicitly increases the weight of false negatives in the distribution because
    the SMOTE algorithm increases the number of minority observations. So, in the next iteration,
    SMOTEboost could create a wider decision area for the minority. SMOTEboost combines
    SMOTE’s Recall-improving power and boosting’s Precision-improving power. Altogether it im-
    proves the F-value or .
    4. Validation
    In the classification problem, to evaluate the performance of the model, two methods are
    used including Confusion matrix and AUROC curve.
    Confusion Matrix
    In these problems, it is common to define the more critical data class as the Positive class
    (P-positive), the other one called the Negative class (N-negative). In a good-bad classification
    problem, bad means Positive, and good means Negative. From there, we define True Positive
    (TP), False Positive (FP), True Negative (TN), False Negative (FN) to create confusion matrix
    not normalized as follows:
    Table 2. Confusion matrix

    Predicted “bad” Predicted “good”

    Actual “bad” True Positives (TP) False Negatives (FN)

    Actual “good” False Positives (FP) True Negatives (TN)

    Evaluation criteria from the confusion matrix as following:

    Table 3. Evaluation criteria

    Predicted “bad” Predicted “good”

    Actual “bad” TPR = TP/(TP+FN) FNR = FN/(TP+FN)

    Actual “good” FPR = FP/(FP+TN) TNR = TN/(FP+TN)

    FPR (False Positive Rate) is also known as a false forecast rate, FNR (False Negative Rate)
    is also known as omission rate.


    ICYREB 2020

    For the classification problem where the data sets of the classes are hugely different from
    each other, there is an efficiency measure commonly used as the Precision-Recall.
    Precision = TP/(TP+FP)
    Recall = TP/(TP+FN) = TPR
    High precision means the accuracy of the bad observations found is high. Precision = 1 or
    FP = 0 means that all observations predicted to be bad are true as bad. However, this does not
    guarantee to find out all bad observations.
    A high recall means a high True Positive Rate, meaning a low rate of omitting bad obser-
    vations. Recall = 1 or FN = 0 means finding all bad observations. However, it is unlikely that all
    bad prediction observations are correct.
    Therefore, a good classification model is one that has both Precision and Recall as high as
    possible, as close to 1 as possible. To measure the quality of the classifier based on both Precision
    and Recall we use the F1 score.

    Choosing β higher than 1 means to value Recall over Precision and vice versa, β less
    than 1 means that Precision is more important than Recall. Two commonly used values of β
    are 2 and 0.5.
    AUROC curve
    The values of FPR and TPR change when the good or bad threshold changes. When the
    threshold decreases, both FPR and TPR increase, which means more false statements than omis-
    sions. Conversely, when the threshold increases, both the FPR and TPR decrease, which means
    more omission than false statements.

    (Source: Sarang Narkhede, 2018)

    Figure 1. Receiver Operating Characteristic curve or ROC curve


    ICYREB 2020

    For each threshold, there is one pair of value (FPR, TPR). Standing for the points (FPR,
    TPR) on the graph, when changing the threshold from 0 to 1, we get a line called the Receiver
    Operating Characteristic curve or ROC curve.
    Plot the graph with the horizontal axis FPR (some graphs denoted by 1 – Specificity) and
    the vertical axis as TPR (some graphs denoted as Sensitivity).
    Based on the ROC curve, one can show whether a model is useful or not. An efficient
    model has low FPR and high TPR, there exists a point on the ROC curve that is close to the point
    with the coordinate (0, 1) on the graph (upper left corner). The closer the curve is, the more effi-
    cient the model is.
    There is another parameter used to evaluate a model called Area Under the Curve or
    AUROC, which is the area under the ROC curve. This value is a positive number less than or
    equal to 1. The larger the value, the better the model. The meaning of AUROC criteria
    Table 4. Assessment of model by AUROC

    AUROC Assessment of model
    > 0.90 Excellent
    0.80 – 0.90 Good
    0.70 – 0.80 Fair
    0.60 – 0.70 Poor
    0.50 – 0.60 Fail

    (Source: D’Agostino et al., 2013)

    Gini coefficient
    The Gini coefficient is calculated according to the formula (Schechtman, 2016) of
    Gini = 2* AUROC – 1
    This coefficient is also used in evaluating the relevance or significance level of models. Its
    measured values are between 0 and 1, where a score of 1 means that the model is 100% accurate
    in predicting the outcome. The closer the Gini is to one, the better model is. On the other hand,
    a Gini score equal to 0 means the model is entirely inaccurate.
    2. Results of experimental research
    The article uses credit dataset at a commercial bank in Vietnam, including 7052 observa-
    tions with 456 bad observations. With the bad rate of 7.13%, the data surveyed is imbalanced be-
    tween the two groups of good observations and the number of bad ones. The data relating to the
    information of the customers is encrypted. After outliers processing, imbalanced data is handled
    by methods of oversampling, undersampling, bothsampling and SMOTE. After that, the processed
    datasets are divided into two datasets including training set (70%) and testing set (30%).
    With each generated dataset, run the logistic model with the original variables and WOE
    grouping variables for credit scoring. The model evaluation results are as follows:


    ICYREB 2020

    Table 5. Results of logistic models

    Model Logistic Method AUROC Gini Precision Recall F_score
    [1] Original Original 0.5521 0.1042 0.008 0.999 0.023
    [2] Original Oversam- 0.6271 0.254 0.581 0.499 0.308
    variables pling
    [3] Undersam- 0.6089 0.3166 0.628 0.784 0.448
    [4] Bothsam- 0.6296 0.2646 0.595 0.627 0.372
    [5] SMOTE 0.6105 0.221 0.628 0.580 0.353

    [6] (WOE) Oversam- 0.7051 0.4196 0.656 0.639 0.385
    grouping pling
    [7] Undersam- 0.6228 0.244 0.599 0.634 0.376
    [8] Bothsam- 0.6978 0.3424 0.636 0.618 0.373
    [9] SMOTE 0.8208 0.642 0.757 0.742 0.447

    (Source: author’s calculation)
    Using the original imbalanced dataset, the logistic model has the AUROC of 0.5521 and
    the G coefficient of 0.1042. This result shows that the model is not able to distinguish between
    good and bad observation. The low precision and high Recall coefficients (0.008 and 0.999, re-
    spectively) means the accuracy of the bad observations found is low. As a result, predict results
    are biased towards the majority group.
    After processing with oversampling, undersampling, bothsampling, and SMOTE methods,
    the validation results of the logit regression models with the unbalanced processed data set were
    improved compared to the model [1]. The results show that the imbalanced handling method in-
    creases the efficiency of the classification methods. The AUROC of these logistic models increase
    greater than 0.5, showing that these models are capable of differentiation. Besides, the under-
    sampling method shows better results than the other three methods because the coefficients of
    Gini, Precision, Recall, F_score are all higher. However, the Gini coefficient is in the range of
    0.2 – 0.3, showing that the classification ability of these models is still quite weak.


    ICYREB 2020

    To improve the differentiation of the model, we use a combination of logistic regression
    variable analysis. Validation results of the model with unbalanced processed data sets showed a
    marked improvement in the evaluation criteria of classification ability. The AUROC coefficient
    results show that using binning WOE variables has improved the ability to classify good and bad
    customers. Also, the SMOTE methods using the logistic model [9] combined WOE grouping
    variables are significantly improved by the high AUROC (0.821) and Gini coefficients (0.624).
    This model has the highest recall of 0.757 and precision coefficients of 0.742, compared to [6],
    [7], and [8] models. This conclusion is also true for the F_ score coefficient.
    6. Conclusion
    Imbalanced data problem occurs in some different fields and reduces the efficiency of the
    classification regression. Several algorithms are introduced to handle this problem in preprocess-
    ing data steps. Various methods are introduced in this study. Each method has both advantages
    and drawbacks. The bank must consider rationality, stability, and strength, and complexity when
    working with each method. The oversampling algorithm is easy to implement. However, the data
    size increases, but observations are repeated from the original observations, so in some cases,
    the oversampling method shows ineffective classification. The undersampling technique that ran-
    domly reduces the data in the Good observation group is also easy to perform, in turn, it can
    eliminate useful observations in the majority group. Overcoming the disadvantages of oversam-
    pling, the SMOTE method complements the minority, creating a completely new advantage
    through artificial methods to show more optimal results in most classification cases.
    We have tested the effectiveness of the oversampling, undersampling, bothsampling and
    SMOTE algorithms with specific data in Vietnam. Based on the AUROC, Gini, recall, precision,
    and F-score coefficients; the results show that imbalanced datasets have a negative effect on the
    ability to classify good and bad customers. Furthermore, the undersampling technique is more
    efficient when used in the logit model with the original variables. To improve performance of
    the classification regression, when combining variable analysis grouped by WOE, the SMOTE
    algorithm shows outstanding efficiency compared to the mentioned others.
    Appendix: The Figures represent the AUROC

    Figure 2. AUROC Figure 3. AUROC Figure 4. AUROC
    of the logistic model with of the logistic model with of the logistic model with
    original variables oversampling data undersampling data


    ICYREB 2020

    Figure 5. AUROC Figure 6. AUROC Figure 7. AUROC
    of the logistic model with of logistic model with of Logistic model clustering
    bothsampling data SMOTE data variables by WOE with
    oversampling data

    Figure 8. AUROC Figure 9. AUROC Figure 10. AUROC
    of Logistic model clustering of Logistic model clustering of Logistic model clustering
    variables by WOE with variables by WOE with variables by WOE with
    undersampling data bothsampling data SMOTE data

    1. Augustin, S., Gaißer, C., Knauer, J., Massoth, M., Piejko, K., Rihm, D., & Wiens, T. (2012).
    Telephony fraud detection in next generation networks. Proceedings of the AICT, 203-207.
    2. Barua, S., Islam, M.M., Yao, X. and Murase, K., 2012. MWMOTE—majority weighted
    minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowl-
    edge and Data Engineering, 26(2), pp.405-425.
    3. Basel II: Revised international capital framework”. bis.org. 2004-06-10.
    4. Batista, G.E., Prati, R.C. and Monard, M.C., 2004. A study of the behavior of several
    methods for balancing machine learning training data. ACM SIGKDD explorations newslet-
    ter, 6(1), pp.20-29.
    5. Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.W., 2003, September. SMOTE-
    Boost: Improving prediction of the minority class in boosting. In European conference on prin-
    ciples of data mining and knowledge discovery (pp. 107-119). Springer, Berlin, Heidelberg.
    6. Dubey, R., Zhou, J., Wang, Y., Thompson, P. M., Ye, J., & Alzheimer’s Disease Neu-
    roimaging Initiative. (2014). Analysis of sampling techniques for imbalanced data: An n= 648
    ADNI study. NeuroImage, 87, 220-241.


    ICYREB 2020

    7. Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007, November). Learning on the border:
    active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference
    on Conference on information and knowledge management (pp. 127-136).
    8. Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning
    from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of ar-
    tificial intelligence research, 61, 863-905.
    9. Ferretti, F., Saltelli, A., & Tarantola, S. (2016). Trends in sensitivity analysis practice in
    the last decade. Science of the total environment, 568, 666-670.
    10. Ganganwar, V. (2012). An overview of classification algorithms for imbalanced
    datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47.
    11. Gong, X., & Qiao, W. (2012). Imbalance fault detection of direct-drive wind turbines
    using generator current signals. IEEE Transactions on energy conversion, 27(2), 468-476.
    12. Han, H., Wang, W.Y. and Mao, B.H., 2005, August. Borderline-SMOTE: a new over-
    sampling method in imbalanced data sets learning. In International conference on intelligent
    computing (pp. 878-887). Springer, Berlin, Heidelberg.
    13. Han, S., Yuan, B. and Liu, W., 2009, November. Rare class mining: progress and
    prospect. In 2009 Chinese Conference on Pattern Recognition (pp. 1-5). IEEE.
    14. Harley, J. R., Lanphier, K., Kennedy, E., Whitehead, C., & Bidlack, A. (2020). Random
    forest classification to determine environmental drivers and forecast paralytic shellfish toxins in
    Southeast Alaska with high temporal resolution. Harmful Algae, 99, 101918.
    15. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on
    knowledge and data engineering, 21(9), 1263-1284.
    16. Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X., 2000. Introduction to the logistic
    regression model. Applied Logistic Regression, 15, pp.1-30.
    17. https://vietstock.vn/2020/02/buc-tranh-no-xau-ngan-hang-nam-2019-757-730262.htm
    18. Huang, C., Li, Y., Loy, C. C., & Tang, X. (2016). Learning deep representation for im-
    balanced classification. In Proceedings of the IEEE conference on computer vision and pattern
    recognition (pp. 5375-5384).
    19. Huang, P.J., 2015. Classication of Imbalanced Data Using Synthetic Over-Sampling
    Techniques (Doctoral dissertation, UCLA).
    20. Jin, Y., Yang, K., Wu, Y. J., Liu, X. S., & Chen, Y. (2012). Application of particle swarm
    optimization based least square support vector machine in quantitative analysis of extraction so-
    lution of safflower using near-infrared spectroscopy. Fenxi Huaxue, 40(6), 925-931.
    21. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent
    developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
    Engineering Sciences, 374(2065), 20150202.


    ICYREB 2020

    22. Kim, S., Kim, H., & Namkoong, Y. (2016). Ordinal classification of imbalanced data
    with application in emergency and disaster information services. IEEE Intelligent Systems, 31(5),
    23. Kundu, A., Panigrahi, S., Sural, S., & Majumdar, A. K. (2009). Blast-ssaha hybridiza-
    tion for credit card fraud detection. IEEE transactions on dependable and Secure Computing, 6(4),
    24. Liu, T.Y., 2009, August. Easyensemble and feature selection for imbalance data sets.
    In 2009 international joint conference on bioinformatics, systems biology and intelligent com-
    puting (pp. 517-520). IEEE.
    25. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into
    classification with imbalanced data: Empirical results and current trends on using data intrinsic
    characteristics. Information sciences, 250, 113-141.
    25. Maalouf, M., & Trafalis, T. B. (2011). Robust weighted kernel logistic regression in
    imbalanced and rare events data. Computational Statistics & Data Analysis, 55(1), 168-183.
    27. 2Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted F-measure and kernel scal-
    ing for imbalanced data learning. Information Sciences, 257, 331-341.
    28. Naeem Siddiqi. (2012). Credit Risk Scorecards: Developing and Implementing Intel-
    ligent Credit Scoring. DOI:10.1002/9781119201731
    29. Narkhede, S. (2018). Understanding AUC-ROC Curve. Towards Data Science, 26.
    30. Nguyen Thi Lien, nguyen thi thu Trang, Nguyen Chien Thang. (2018). Machine learn-
    ing techniques to detect credit card fraud. Journal of economics and development.
    31. Phương, N.M., Tuyết, T.T.Á., Hồng, N.T. and Thọ, Đ.X.. (2016). Random border un-
    dersampling. Science and Technology.
    32. Polykretis, C., & Chalkias, C. (2018). Comparison and evaluation of landslide suscep-
    tibility maps obtained from weight of evidence, logistic regression, and artificial neural network
    models. Natural hazards, 93(1), 249-274.
    33. Quinlan, J.R., 1998. Miniboosting decision trees. Journal of Artificial Intelligence Re-
    search, pp.1-15.
    34. Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J. C. (2014, May).
    Preliminary comparison of techniques for dealing with imbalance in software defect prediction.
    In Proceedings of the 18th International Conference on Evaluation and Assessment in Software
    Engineering (pp. 1-10).
    35. Sarkar, S., Raj, R., Vinay, S., Maiti, J., & Pratihar, D. K. (2019). An optimization-based
    decision tree approach for predicting slip-trip-fall accidents at work. Safety science, 118, 57-69.
    36. Schechtman, E., & Schechtman, G. (2016). The Relationship between Gini Methodol-
    ogy and the ROC curve. Available at SSRN 2739245.


    ICYREB 2020

    37. Shu, B., Zhang, H., Li, Y., Qu, Y., & Chen, L. (2014). Spatiotemporal variation analysis
    of driving forces of urban land spatial expansion using logistic regression: A case study of port
    towns in Taicang City, China. Habitat international, 43, 181-190.
    38. Silva, K. M., Souza, B. A., & Brito, N. S. (2006). Fault detection and classification in
    transmission lines based on wavelet transform and ANN. IEEE Transactions on Power Deliv-
    ery, 21(4), 2058-2063.
    39. Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. (2009, June). A multiple expert ap-
    proach to the class imbalance problem using inverse random under sampling. In International
    workshop on multiple classifier systems (pp. 82-91). Springer, Berlin, Heidelberg.
    40. The State Bank of Vietnam, 2016, Circular No.41/2016/TT-NHNN dated 30/12/2016
    regulates the capital adequacy ratio for foreign banks and branches in Vietnam.
    41. Wei, W., Li, J., Cao, L., Ou, Y., & Chen, J. (2013). Effective detection of sophisticated
    online banking fraud on extremely imbalanced data. World Wide Web, 16(4), 449-475.
    42. Wu, H., Liu, X., You, L., Zhang, L., Zhou, D., Feng, J., … & Yu, J. (2012). Effects of
    salinity on metabolic profiles, gene expressions, and antioxidant enzymes in halophyte Suaeda
    salsa. Journal of Plant Growth Regulation, 31(3), 332-341.
    43. Zhu, B., Baesens, B., & vanden Broucke, S. K. (2017). An empirical comparison of
    techniques for the class imbalance problem in churn prediction. Information sciences, 408,


Download tài liệu Hiệu quả của các phương pháp xử lý dữ liệu mất cân bằng trong chấm điểm tín dụng: Trường hợp tại các Ngân hàng thương mại Việt Nam File Word, PDF về máy