User:Akela1101/Supervised Learning Algorithms

Linear Regression (Approximation)

edit

Let's have a set of m samples  , where   is a vector   and  .

  For n = 1 it's a set of points on XY plane.  

We need to find   — linear function that approximates y.

Or simply  , if we assume   with   — bias element.

 

It's done by minimizing the error function (or cost function):

 

 

Gradient Descent

edit

An approximate method of minimizing  .

Let   and α be a small positive number. Then if ε is a precision, while  ,

 

Gradient   can be thought of as J growth angle, so this makes sense.

 , for each  

 

For n = 1 it's only two actions in a step:

 

 

 

Logistic Regression (Classification)

edit

As usual, a set of m samples  , where   is a vector   and  .

We need to classify, whether y is 0 or 1.

 

Logistic function  

Let  , where   and  

If h < 0.5, prediction is 0, and if h ≥ 0.5, prediction is 1.

 

Function   calls cross-entropy.

 

Gradient descent looks same as in linear regression:

 

Regularization

edit

Is a technique allowing to avoid overfitting.

 

Additional item with lambda lessens coefficients in W, making them less precise for initial data, but more adequate for new data.

Multi-value classification (One vs All)

edit

Logistic Regression can be used also in case, y can take more than 2 values:  .

 

Let   for  . I.e. for each k divide all samples into 2 sets, one with y = k and one with all others.

So we need to build r functions   in the same manner as  .

 

Classification is as simple as finding k for which   is maximum.

 

Neural Networks

edit

Actually, what we've got above is somewhat similar to one layer neural network, where   is a neuron representation.

 , where W is a weights matrix.

Note that applying g() is not necessary, as it's monotonic and y is the last layer.

Multilayer NN is built similarly by applying g() function to each layer.

 

 

Instead of using   one by one, W for one layer can be found with a sum of them:

 

For multiple layers the algorithm is similar:

 

where   is a set of weight matrices for each layer, and   — input parameters.

 

Support Vector Machines (Linear Classification)

edit

Is a technique similar to Logistic Regression, that can be sometimes faster and more extensible.

Given cost functions:  

Change them to similar linear functions:  

where a is some positive number and  .

 

With this simple approach samples are also divided linearly, but with some margin.

The hypothesis:  

 

 

Support Vector Machines (Kernels)

edit

Kernels are used to get more complex than linear classification.

Gaussian Kernel:  , for each training sample.

Thus x is replaced with  .

 

 , i.e. close to one of the samples it gives 1.

 

 

Decision Tree

edit

Is another approach to data classification. It is simpler to describe and visualize, but harder to implement properly.

Again, a set of m samples  , where   is a vector   and  .

y can take multiple values:  .

While Logistic Regression splits space by distance to samples, i.e. with circles,

Decision Tree does splitting with planes parallel to axes, i.e. with rectangles.

 

  • Quantization. All continuous parameters should be converted to discrete (categorical). I.e. if   can take N values, and N is big, those   should be grouped in a smaller set.

     

  • Choose some parameter   and split the dataset basing on  
  • For each subset  
    • if   with probability p, let's say (95%), then stop splitting this subset.
    • else go to previous step.

 

There are at least 3 implementation problems here:

  1. How to convert continuous parameters?
  2. How to choose splitting parameter?
  3. When to stop splitting? How to avoid overfitting?

There will be no specific answer to these for the moment.

But in any approach the idea is to select the most pure subsets, i.e.  

  • Gini impurity:  

  If  , E = 0. If  , E = 0.5  

  • Information gain:  , where entropy   The best split is one that has the most information gain.

Random forest

edit

The idea is to generate multiple randomly initialized trees, and then select the most popular result.

Random initialization can concern:

  • Learning on random subset of samples.
  • Learning on random subset of features.
  • Random selection of tree splitting parameter.
  NODES
Idea 2
idea 2
Note 2