Blog Archives

Dimension reduction

2/29/2016

why dimension reduction

Dimension is the number of features. Too much dimension leads to too many effective details in order to not overfitting. So for the most of the time, data is always not enough. On the other hand, less dimension will lead to less computational time. If the information between the different sample is conserved, it means, some feature is actually deleted. So dimension reduction is actually, deleting the most unwanted features or unimportant one as a whole.

Random projection

Dimensionality reduction techniques generally use linear transformations in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions. For this purpose there are various related techniques, including: principal component analysis, linear discriminant analysis, canonical correlation analysis, discrete cosine transform, random projection, etc.

Random projection is a simple and computationally efficient way to reduce the dimensionality of data by trading a controlled amount of error for faster processing times and smaller model sizes. The dimensions and distribution of random projection matrices are controlled so as to approximately preserve the pairwise distances between any two samples of the dataset.

The core idea behind random projection is given in the Johnson-Lindenstrauss lemma, which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points.

The mathematical backaground for random projection: Johnson–Lindenstrauss lemma
In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

Weakness:
Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.

Pca

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. PCA is sensitive to the relative scaling of the original variables.

PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipse is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only a commensurately small amount of information.

To find the axes of the ellipse, we must first subtract the mean of each variable from the dataset to center the data around the origin. Then, we compute the covariance matrix of the data, and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then, we must orthogonalize the set of eigenvectors, and normalize each to become unit vectors. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues.

It is important to note that this procedure is sensitive to the scaling of the data, and that there is no consensus as to how to best scale the data to obtain optimal results. In the data science notes, you will see examples of scaling effects. So normalize using preprocessing is important!!

Weakness
These methods can be powerful, but often miss important non-linear structure in the data.

Manifold learning - iSOMAP

Due to weakness of PCA and Random Projection, a nonlinear method to learn nonlinear data structure is found.

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.

Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points.

Comments

MODEL EVALUATION

2/23/2016

Comments

mODEL EVALUATION IN GEneral

For a specific question, determine which one is better

Randomly divide datasets in to 1) training 2) validation 3) test

Cross-validated is considered if data is limited

EVALUATE A CLASSIFIER - Confusion matrix

Definition: entry i, j in a confusion matrix is the number of observations actually in group i, but predicted to be in group j.

Command:

metrics.confusion_matrix(y_true, y_pred), if you are using 0,1,2,3.... as your elements in y_true and y_pred; However, if you are using other variables like strings, you should use
metrics.confusion_matrix(y_true, y_pred, labels=["good","neutral","bad"])

ROC curve

Sensitivity: predicted true / total true
Specificity: predicted false / total false

ROC curve = fpr (false predicted value) v.s. ppr (positive predicted value)

For a given Y_true and Y_score, the ROC curve is plotted by changing the threshold of the mapping from Y_score to Y_predict, (remember that we are working on classification)

And it begins with a high threshold and end up with a low threshold, so at first, almost everypoint is defined as zero. so ppr and fpr both are zero, so this is the reason why ROC curve start from zero. Similarly it ends up at 1,1.

Comments

Regression & Classification

2/21/2016

Comments

BiAS AND VARIANCE TRADE OFF

Basic question in machine learning is
//
y = f(x) + random error
given data xi, yi
estimate a model fN(x)
//

Minimizing the expectation of difference between prediction and true observation, which is found to be Bias**2 + Variance of the model itself + irreducible error(chaotic nature that cannot be removed)

Do a mind experiment:
a simple model such as linear regression, given N points, try a linear regression.

On one hand, linear model will be dependent on the training dataset you have in hands. So there is a variability in your linear model, this is called variance of the model itself.

On the other hand, the distance between the regression line and training dataset, can be described as a L2 error. A nature instinct is the deviation of data away from the straight line. In total, they will be bias error. According to Wiki, can be thought as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f(x) using a learning method for linear models, there will be error in the estimates f(x) due to this assumption.

How to measure bias error and variance error?

Bias error

As we change the training dataset, changing the choices of different data points, the expectation of (true value - predicted value) shall change, because the prediction model changes. So it has to change.

Variance error

Still, as we change the training dataset, changing the choices of different data points, the expectation of (the predicted value - expectation of prediction value) is not zero definitely, because the model is changing so the expectation of prediction value is changing. This is named as variance error due to the variance of model parameter in itself.

What should a good model be like?

A good model should have low bias error and low variance error. The first one is easy to understand. The second one is based on the belief that true function in the shadow should always be independent of the training data you choose.

Be careful! Even if the principle behind the wheel is a relation that independent of the training data, even if it was extremely simple, the method you choose for determining the true model (ML), actually determines what kind of variance you should expect.

COMMENTS ON R2

Definition of R2 is perfectly written in https://en.wikipedia.org/wiki/Coefficient_of_determination
R**2 score is a way to measure bias error. It is defined straightforward as how much variation in the random data point can be explained by the straight line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.
The coefficients estimate the trends while R-squared represents the scatter around the regression line.
High R-squared doesn't necessarily mean a perfect model. You will also need to check the residual map which should be ideal random in order to validate the correct employment of linear model.
Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers.
Low R-squared model doesn't mean you can not get anything from the model. For example, a highly noisy data will give you a low R-squared even if the original data is truly linear relation. But you can still predict the correct response value in terms of expectation. For example, the coefficient, which is connected to the inner mechanism of the model, is still able to be captured by a low R-squared model.

do we need to get more data?

If you have 200 points, let say, it is good to try a cross validation with varying training sample size(also accomplished by sklearn learning curve), if predicted results has reached a extrema, then let's call it stop, because data is not that needed any more. so far the mdoel is good enough for current data.

Adjusted R squared and r squared

People realize that R squared in linear regression analysis can be misleading:
-> adding more and more factors will ALWAYS increase the R squared, no matter it is useful or not.
So how to choose a R squared that will guide us to add useful variable? ---> choose adjusted R squared. n is total number of samples, k is the number of factors in your regression model

The mechanism is that adjusted R squared will only increase when you add the right variable, that causes R^2 to increase, then it overweights the increase of k, resulting a increased adjusted R squared.

For more information, visit:
http://www.statisticshowto.com/adjusted-r2/

uNDerfitting and overfitting

Underfitting usually has a high bias -> solution: add more feature, more complex model
Overfitting usually has a high variance in model -> solution: reduce feature; introducing regularization of the parameter in to the fitting

Ridge regression + lasso regression

bai
Both ridge and lasso are regularization of linear regression to reduce variance of the model.

Ridge
Ridge regression considers a penalty on L2 norm of coefficients in order to achieve reduced variance in models, based on the belief that smaller coefficient correlates closely to variance of model itself.
Lasso
Lasso model considers a L1 norm of penalty imposing on coefficient which based on the same argument for ridge. However, if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. The alpha is selected usually automatically to minimize the cross-validation error.

Difference between Lasso and Ridge
Lasso is able to achieve both of these goals by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to be set to zero, effectively choosing a simpler model that does not include those coefficients. This idea is similar to ridge regression, in which the sum of the squares of the coefficients is forced to be less than a fixed value, though in the case of Ridge regression, this only shrinks the size of the coefficients, it does not set any of them to zero. So in general, Lasso is an aggressive method, can select features, while Ridge is more conservative than Lasso, can only suppress overfitting. So if you are confident about the independence of your feature, use Ridge is safer than Lasso.

Used as feature selection
However! ONLY Lasso can be used as a tool of feature selection! If you try different alpha and plot the coef_ from fitted linear regressor, you will see as alpha increases, how does the coefficient vary and see if some feature can be eliminated at the first place.

Elastic Net
For a group of highly correlated data, lasso tends to eliminate all but one variable randomly. Elastic Net is the method of Ridge + Lasso, which is shown on the right. So Elastic Net can do feature selection, coefficient shrinkages. And the quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. So no random behaviour as Lasso.

Stochastic Gradient Descent - SGD Regressor
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet. So basically, it is still a linear model regressor but with fast speed. However, some tips has to be mentioned.

Tip 1:

Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be easily done using StandardScaler:
Also, for loss functions, you can choose huber loss function, which will lower the effects of outlier.

//
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Don't cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
//

For more information, please refer to
http://scikit-learn.org/stable/modules/sgd.html

Preprocessing

To generate features from raw data, in order to perform certain regression, preprocessing is used to generating those features.

For example, for poly features, use the following command.
pp_pf = preprocessing.PolynomialFeatures(degree=3)

Also, if the input is more than one column, the result will contain interaction features.
See the following in help documents.
Generate polynomial and interaction features. Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

classification

LOGISTIC REGRESSION

It sounds like it is a regression method but it is not. Actually regression is not that far away from classification. In some extent, they are almost the same. However, threshold is a interesting thing in classification, which also reflects how human classify things.

This classification method, logistic regression, is actually mapping the whole Y results, into [0,1] using a exp function. Then linear fitting for a log function appear in the derivation process. Finally threshold considered as 0.5. So a classification result is coming into being.

Some comments from Wiki I think is helpful:

Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression

Decision tree

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

SVM

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

Comments

The Data Scientist’s Toolbox - nOTES

2/21/2016

Comments

Basic git command

Fig 1.

1. git init: is used to create a local repo
2. git clone: if you have repo already in your github.. just use git clone and the address of your git.
3. git add : add the files to be tracked, add them into staging area (after I realize the structure of git, see Figure 1)

4. git remote add origin: it will connect the remote named as origin (this is used when you want to link a local repo to a repo in github)
5. git push -u origin master: will push the commits to origin and to the master branch (usually git push --all will push all the branches to remote)
6. git checkout -b branchname: it will create a new branch and GO THERE (git branch name will do the same thing but not checking out)
7. git rm: will delete the git tracking for a certain file AND delete the file

Data dredging

I understand data dredging as, taking too much from the data than it actually contains. This idea also name as data fishing, data snooping, equation-fitting and p-hacking.. (I don't know the later two)

The causes usually are drawing conclusions from non-representative data.

PREDICTION V.S. INFERENCE

The difference is that prediction, is usually something in the future and can be validated easily. While inference, is something can perhaps never being verified.

PYTHON PASS-BY-REFERENCE OR PASS BY VALUE?

I found this problem when I tried to make a list with each element is a new list, where it contains a certain number of vectors.

The problem is solved by considering a 3D numpy array and work out by the way we do in MATLAB.. The cause is that Python automatically controls what to be pass-by-reference or by value. The thing we learnt for MATLAB is usually pass by value. However, list.append is a function could be pass-by-reference.

import numpy as np
from sklearn import linear_model, metrics, cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)
x = np.linspace(0, 6.14, 101)
y = np.sin(x) + 0.3*np.random.randn(len(x))
N = 10
X = np.zeros((N,N,len(x)));
trscore = np.zeros((N,1))
tescore = np.zeros((N,1))
trerr = np.zeros((N,1))
teerr = np.zeros((N,1))
aselect = np.add(range(N),1)
for n in range(N):
    for i in range(n+1):
        if i == 0:
            X[n,i,:] = x
        else:
            X[n,i,:] = x**(i+1)


for n_select in aselect:
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(np.transpose(X[n_select-1])[:,0:n_select], y, test_size=0.33, random_state =48)
    lrg = linear_model.LinearRegression(n_jobs=-1)
#    print n_select,np.transpose(X[n_select-1])[:,0:n_select]
    lrg.fit(X_train,y_train)
    print 'when feature numer = ', n_select ,'abs(coef)_max      = ', max(abs(lrg.coef_))
    print 'when feature numer = ', n_select ,'abs(intercept)_max = ', max(abs(lrg.intercept_.ravel()))
    print '-------------------'
    tescore[n_select-1] = lrg.score(X_test,y_test)
    trscore[n_select-1] = lrg.score(X_train,y_train)
    trerr[n_select-1] = metrics.mean_squared_error(lrg.predict(X_test) ,y_test)
    teerr[n_select-1] = metrics.mean_squared_error(lrg.predict(X_train) ,y_train)

fig = plt.figure(figsize=(16,4))

ax1 = fig.add_subplot(1,2,1)
ax1.plot(aselect,tescore,c='blue',lw=3,label='test data')
ax1.plot(aselect,trscore,c='red',lw=3,label='training data')
ax1.legend(loc='best')
ax1.set_xlabel('number of feature')
ax1.set_ylabel('R2 score')

ax2 = fig.add_subplot(1,2,2)
ax2.plot(aselect, trerr, c='blue',lw=3,label='test data')
ax2.plot(aselect, teerr, c='red',lw=3,label='training data')
ax2.legend(loc='best')
ax2.set_xlabel('number of feature')
ax2.set_ylabel('MSE')

Comments

Dimension reduction

why dimension reduction

Random projection

Pca

Manifold learning - iSOMAP

MODEL EVALUATION

mODEL EVALUATION IN GEneral

EVALUATE A CLASSIFIER - Confusion matrix

ROC curve

Regression & Classification

BiAS AND VARIANCE TRADE OFF

How to measure bias error and variance error?

COMMENTS ON R2

do we need to get more data?

Adjusted R squared and r squared

uNDerfitting and overfitting

Ridge regression + lasso regression

Preprocessing

classification

LOGISTIC REGRESSION

Decision tree

SVM

The Data Scientist’s Toolbox - nOTES

Basic git command

Data dredging

PREDICTION V.S. INFERENCE

PYTHON PASS-BY-REFERENCE OR PASS BY VALUE?

Author

Archives

Categories