Blog Archives

unsupervised learning

3/10/2016

Gaussian mixture model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

Although it is an unsupervised learning, the number of components need to be assigned before calculation.

supervised learning evaluation

metrics.cluster.adjusted_rand_score

        -- The Rand Index computes a similarity measure between
       two clusterings by considering all pairs of samples
   and counting pairs that are assigned in the same or
       different clusters in the predicted and true clusterings.
        -- for rand score: rand index:
         The Rand index has a value between 0 and 1, with 0 indicating
   that the two data clusters do not agree on any pair of points
   and 1 indicating that the data clusters are exactly the same.
           For smaller sample sizes or larger number of clusters, it is safer to use an adjusted index such as the
           Adjusted Rand Index (ARI)

metrics.cluster.adjusted_mutual_info_score

-- Mutual Information is a function that measures the agreement
of the two assignments, ignoring permutations.

metrics.cluster.homogeneity_score

         -- if the data points in a cluster is actually belongs to a single class. then we say
          homo is 1.0...so more cluster, the homogeneity will be higher! because the number of points in
          each cluster is reduced to 1 for an extreme example.

metrics.cluster.completeness_score

-- if two data are in the class, they also are in the same cluster, the completeness score is one
tip: switching pred and true will switch result to homogeneity

metrics.cluster.v_measure_score

   -- The V-measure is the harmonic mean between homogeneity and completeness:
         -- although homogeneity and completeness along is not meaningful as we have seen
      their harmonic mean is very good at capturing the point 3 optimal cluster number.

unsupervised learning evaluation

metrics.cluster.silhouette_score

         --The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.
           Negative values generally indicate that a sample has been assigned to the wrong cluster,
           as a different cluster is more similar.

K-MEAN

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum of squared criterion.

However, there are some drawbacks of K-Mean method:

Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Hierarchical clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

Novelty & Outlier Detection

One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set.

In this case, as it is a type of unsupervised learning, the fit method will only take as input an array X, as there are no class labels.

ensemble models

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. So it's a methodology in system level to reduce the variance of a model

Two families of ensemble models

averaging models: build several models independently, average their predictions
boosting models

bagging method

A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [R15]. If samples are drawn with replacement, then the method is known as Bagging [R16]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [R17]. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [R18].

So bagging is simply keep the number of features as the same, but choose different data, and then final result? take an average. So it is basically bootstrap and then take average result. While Pasting is only take the subsets of the samples, which means, only combination not permutation is taken from the total samples. On the other hand, when features becomes randomly chosen, means you are in the situation that you are not confident about your chosen features can be useful and some features might be interconnected with others so it might be good to just select a few of them. And then perhaps, I don't know, take an average.

Also, please note that in Sklearn, the default base estimator for Bagging is decision tree.

In many cases, bagging methods constitute a very simple way to improve with respect to a single model, without making it necessary to adapt the underlying base algorithm. As they provide a way to reduce overfitting, bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

Random forest

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

Bagging has a single parameter, which is the number of trees. All trees are fully grown binary tree (unpruned) and at each node in the tree one searches over all features to find the feature that best splits the data at that node.

Random forests has 2 parameters:

The first parameter is the same as bagging (the number of trees)
The second parameter (unique to randomforests) is mtry which is how many features to search over to find the best feature. this parameter is usually 1/3*D for regression and sqrt(D) for classification. thus during tree creation randomly mtry number of features are chosen from all available features and the best feature that splits the data is chosen.

Comments