6 Classification1
6.1 Summary
6.1.1 Classification and Regression Trees (CART)
classification trees: classifying data into two or more discrete categories
impure: the mixture of categories among the final leaves
Gini impurity: measure the impurity of a group containing different classes
the one with the lowest impurity goes at the top of the tree
use it at each branch to split the nodes further
regression trees: predict continuous dependent variable
residuals: to decide where the breaks in the data
sum of the squared residuals(SSR): the one with the lowest SSR is the root of the tree to start with.
overfitting: One leaf with only one pixel value or one person, shows high variance, and does not generalize the model well.
underfitting: high bias, oversimplifies the model
solutions for overfitting:
limit how trees grow by removing the leaves in each tree
weakest line pruning with tree score
alpha: gives the lowest SSR form testing data is the final value
6.1.2 Random Forests
grow many classification decision trees, which comprises the forest
two techniques:
bootstrapping(bagging): resampling with replacement. 70% of the training data is used in the bootstrap, 30% is left out of the bag(OOB).
- OOB error: proportion of OOB incorrectly classified
random feature slection
6.1.3 Image Classification
Turn every pixel in the image into one of a pre-defined categorical classification
6.1.3.1 Unsupervised
DBSCAN: radius and min points
ISODATA: same as k-means but clusters with few pixels are meaningless, close clusters could be merged and could be split for elongated clusters
Cluster Busting: a compensate for the difficulty of assigning meaning in ISODATA
6.1.3.2 Supervised
parametric
- Maximum Likelihood: based on the probability
non-parametric
- Support Vector Machine (SVM): a linear binary classifier
support vectors: points on the boundary
separating hyperplane: middle margin
maximum margin classifier: maximum margin between two classes
soft margin: allow some misclassifications to occur
underlying theory: structural risk minimization
selectable parameters: type of kernel, C (controls the slope), Gamma(controls the distance of influence of a training point)
- Support Vector Machine (SVM): a linear binary classifier
6.2 Application
6.2.1 Selection of Classifier
- Experiment More
We have actually learned a lot about classification methods, and have a basic understanding of their theories. However, in practical it’s hard to choose which methods to use when there are so many algorithms available. And there has been few literature generalized a optimum classifier, because the optimum algorithm is usually case-specified depending on the classes mapped, the nature of the training data, and the predictor variables(Maxwell, Warner, and Fang 2018). We should experiment with multiple algorithms to determine which is the optimal for the specific classification task(Lawrence and Moran 2015).
- Not just considering Overall Accuracy
It is also emphasized that the overall accuracy is not just the only thing to consider about, particularly when we are focusing on mapping rare classes. Rare classes would have little impact on the overall accuracy, but would be important in determining the usefulness of classification(Maxwell, Warner, and Fang 2018).
6.2.2 What affects the performance of the classification methods?
- Number of training samples and quality of sample data
Huang, Davis, and Townshend (2002) found that the training sample size would have a greater effect on the performance of classification methods especially when using Maximum Likelihood, Decision Tree(DT) and SVM. They further concluded that the size of training sample mainly depend on the algorithm, the number of input variables and the size and spatial variability of mapped area. In broader conclusion by Li et al. (2014), no matter what algorithm would be used, the large and accurate training datasets are more preferable.
In terms of data quality, it is not easy to collect a repository of high quality training data due limited time and access. We should select a method that are less sensitive to data quality like mislabeled data samples.
- The balance of classes
The performance of algorithms would be also affected by the class imbalance. In random sampling, the probability of selecting a class is proportional to the class area, and therefore relatively rare classes will likely comprise a smaller proportion of the training set(Maxwell, Warner, and Fang 2018). In this case, producer’s accuracy and user’s accuracy would become key measures. And they concluded that there are several solutions to balance training data.
- use an equalized stratified random sampling design
- randomly undersample the majority class, or reduce the overall number of samples used in the training.
- produce synthetic examples of the minority class that are similar to the original minority examples in the feature space.
- implement a cost-sensitive method
- Predefined Parameters
Predefined parameters are also an important impact factor on the performance of classification methods. The default values are usually suggested, but empirical testing to determine the optimum values is still needed to ensure the best performance. Notably, there has been research that avoid using predefined parameters that reduce the influence of user-set parameters(Suel et al. 2019).
6.3 Reflection
- Finding a classifier that can generalize well to new data can be a challenging task. The most effective way to find a suitable classifier is by experimenting with different approaches and evaluating their performance on a test dataset. It is also essential to keep in mind the risks of overfitting and underfitting during this process. And I think that’s why researchers are continually exploring new methods and techniques to improve the accuracy and generalization performance of classifiers.