Linear Regression (Approximation)
edit
An approximate method of minimizing
J
(
w
¯
)
{\displaystyle J({\bar {w}})}
.
Let
w
¯
0
=
0
{\displaystyle {\bar {w}}^{0}=0}
and α be a small positive number. Then if ε is a precision, while
J
(
w
¯
)
>
ϵ
{\displaystyle J({\bar {w}})>\epsilon }
,
w
¯
:=
w
¯
−
α
∇
J
{\displaystyle {\bar {w}}:={\bar {w}}-\alpha \nabla J}
Gradient
∇
J
{\displaystyle \nabla J}
can be thought of as J growth angle, so this makes sense.
w
i
:=
w
i
−
α
∂
J
∂
w
i
=
w
i
−
α
m
∑
j
=
1
m
(
h
(
x
¯
j
)
−
y
j
)
x
i
j
=
w
i
−
α
m
∑
j
=
1
m
(
w
0
+
w
1
x
1
j
+
.
.
+
w
n
x
n
j
−
y
j
)
x
i
j
{\displaystyle w_{i}:=w_{i}-\alpha {\partial J \over \partial w_{i}}=w_{i}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}h({\bar {x}}^{j})-y^{j}{\biggr )}x_{i}^{j}=w_{i}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}w_{0}+w_{1}x_{1}^{j}+..+w_{n}x_{n}^{j}-y^{j}{\biggr )}x_{i}^{j}}
, for each
i
∈
0..
n
{\displaystyle i\in 0..n}
◃
{\displaystyle \triangleleft }
For n = 1 it's only two actions in a step:
w
0
:=
w
0
−
α
m
∑
j
=
1
m
(
w
0
+
w
1
x
1
j
−
y
j
)
{\displaystyle w_{0}:=w_{0}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}w_{0}+w_{1}x_{1}^{j}-y^{j}{\biggr )}}
w
1
:=
w
1
−
α
m
∑
j
=
1
m
(
w
0
+
w
1
x
1
j
−
y
j
)
x
1
j
{\displaystyle w_{1}:=w_{1}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}w_{0}+w_{1}x_{1}^{j}-y^{j}{\biggr )}x_{1}^{j}}
▹
{\displaystyle \triangleright }
Logistic Regression (Classification)
edit
As usual, a set of m samples
{
x
¯
j
→
y
j
}
{\displaystyle \{{\bar {x}}^{j}\rightarrow y^{j}\}}
, where
x
¯
j
{\displaystyle {\bar {x}}^{j}}
is a vector
{
x
1
,
x
2
,
.
.
x
n
}
j
{\displaystyle \{x_{1},x_{2},..x_{n}\}^{j}}
and
j
∈
1..
m
{\displaystyle j\in 1..m}
.
We need to classify, whether y is 0 or 1.
◻
{\displaystyle \Box }
Logistic function
g
(
z
)
=
1
1
+
e
−
z
∈
(
0
,
1
)
{\displaystyle g(z)={\frac {1}{1+e^{-z}}}\in (0,1)}
Let
h
(
x
¯
)
=
g
(
w
¯
⋅
x
¯
)
=
1
1
+
e
−
w
¯
⋅
x
¯
{\displaystyle h({\bar {x}})=g({\bar {w}}\cdot {\bar {x}})={\frac {1}{1+e^{-{\bar {w}}\cdot {\bar {x}}}}}}
, where
x
¯
=
{
1
,
x
1
,
x
2
,
.
.
x
n
}
{\displaystyle {\bar {x}}=\{1,x_{1},x_{2},..x_{n}\}}
and
w
¯
=
{
w
0
,
w
1
,
w
2
,
.
.
w
n
}
{\displaystyle {\bar {w}}=\{w_{0},w_{1},w_{2},..w_{n}\}}
If h < 0.5, prediction is 0, and if h ≥ 0.5, prediction is 1.
J
(
w
¯
)
=
−
1
m
∑
j
=
1
m
(
y
j
ln
(
h
(
x
j
)
)
+
(
1
−
y
j
)
ln
(
1
−
h
(
x
j
)
)
)
→
w
m
i
n
{\displaystyle J({\bar {w}})=-{\frac {1}{m}}\sum _{j=1}^{m}{\biggl (}y^{j}\ln {(h(x^{j}))}+(1-y^{j})\ln {(1-h(x^{j}))}{\biggr )}{\xrightarrow[{w}]{}}min}
Function
y
j
ln
(
h
(
x
j
)
)
{\displaystyle y^{j}\ln {(h(x^{j}))}}
calls cross-entropy .
◼
{\displaystyle \blacksquare }
Gradient descent looks same as in linear regression:
w
i
:=
w
i
−
α
∂
J
∂
w
i
=
w
i
−
α
m
∑
j
=
1
m
(
h
(
x
¯
j
)
−
y
j
)
x
i
j
=
w
i
−
α
m
∑
j
=
1
m
(
1
1
+
e
−
w
¯
⋅
x
¯
j
−
y
j
)
x
i
j
{\displaystyle w_{i}:=w_{i}-\alpha {\partial J \over \partial w_{i}}=w_{i}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}h({\bar {x}}^{j})-y^{j}{\biggr )}x_{i}^{j}=w_{i}-{\frac {\alpha }{m}}\sum _{j=1}^{m}{\biggl (}{\frac {1}{1+e^{-{\bar {w}}\cdot {\bar {x}}^{j}}}}-y^{j}{\biggr )}x_{i}^{j}}
Is a technique allowing to avoid overfitting .
J
(
w
¯
)
=
1
2
m
∑
j
=
1
m
(
h
(
x
j
)
−
y
j
)
2
+
λ
2
m
∑
i
=
1
n
w
i
2
→
w
m
i
n
{\displaystyle J({\bar {w}})={\frac {1}{2m}}\sum _{j=1}^{m}{\biggl (}h(x^{j})-y^{j}{\biggr )}^{2}+{\frac {\lambda }{2m}}\sum _{i=1}^{n}w_{i}^{2}{\xrightarrow[{w}]{}}min}
Additional item with lambda lessens coefficients in W, making them less precise for initial data, but more adequate for new data.
Multi-value classification (One vs All)
edit
Logistic Regression can be used also in case, y can take more than 2 values:
y
∈
1..
r
{\displaystyle y\in 1..r}
.
◻
{\displaystyle \Box }
Let
y
k
:=
(
y
=
k
)
?
1
:
0
{\displaystyle y_{k}:=(y=k)\ ?\ 1:0}
for
k
∈
1..
r
{\displaystyle k\in 1..r}
. I.e. for each k divide all samples into 2 sets, one with y = k and one with all others.
So we need to build r functions
h
k
(
x
¯
)
{\displaystyle h_{k}({\bar {x}})}
in the same manner as
h
(
x
¯
)
{\displaystyle h({\bar {x}})}
.
J
k
(
w
¯
)
=
−
1
m
∑
j
=
1
m
(
y
k
j
ln
(
h
k
(
x
j
)
)
+
(
1
−
y
k
j
)
ln
(
1
−
h
k
(
x
j
)
)
)
+
λ
2
m
∑
i
=
1
n
w
k
,
i
2
→
w
k
m
i
n
{\displaystyle J_{k}({\bar {w}})=-{\frac {1}{m}}\sum _{j=1}^{m}{\biggl (}y_{k}^{j}\ln {(h_{k}(x^{j}))}+(1-y_{k}^{j})\ln {(1-h_{k}(x^{j}))}{\biggr )}+{\frac {\lambda }{2m}}\sum _{i=1}^{n}w_{k,i}^{2}{\xrightarrow[{w_{k}}]{}}min}
Classification is as simple as finding k for which
h
k
(
x
¯
)
{\displaystyle h_{k}({\bar {x}})}
is maximum.
◼
{\displaystyle \blacksquare }
Actually, what we've got above is somewhat similar to one layer neural network, where
h
k
(
x
¯
)
{\displaystyle h_{k}({\bar {x}})}
is a neuron representation.
y
¯
=
g
(
W
⋅
x
¯
)
{\displaystyle {\bar {y}}=g(W\cdot {\bar {x}})}
, where W is a weights matrix.
Note that applying g() is not necessary, as it's monotonic and y is the last layer.
Multilayer NN is built similarly by applying g() function to each layer.
y
¯
|
L
+
1
|
=
g
(
W
|
L
|
⋅
y
¯
|
L
|
)
{\displaystyle {\bar {y}}^{|L+1|}=g(W^{|L|}\cdot {\bar {y}}^{|L|})}
◻
{\displaystyle \Box }
Instead of using
J
k
(
w
¯
)
{\displaystyle J_{k}({\bar {w}})}
one by one, W for one layer can be found with a sum of them:
J
(
W
)
=
−
1
m
∑
j
=
1
m
∑
k
=
1
r
(
y
k
j
ln
(
h
k
(
x
j
)
)
+
(
1
−
y
k
j
)
ln
(
1
−
h
k
(
x
j
)
)
)
+
λ
2
m
∑
k
=
1
r
∑
i
=
1
n
w
k
,
i
2
→
W
m
i
n
{\displaystyle J(W)=-{\frac {1}{m}}\sum _{j=1}^{m}\sum _{k=1}^{r}{\biggl (}y_{k}^{j}\ln {(h_{k}(x^{j}))}+(1-y_{k}^{j})\ln {(1-h_{k}(x^{j}))}{\biggr )}+{\frac {\lambda }{2m}}\sum _{k=1}^{r}\sum _{i=1}^{n}w_{k,i}^{2}{\xrightarrow[{W}]{}}min}
For multiple layers the algorithm is similar:
J
(
{
W
|
L
|
}
)
=
−
1
m
∑
j
=
1
m
∑
k
=
1
r
|
q
|
(
y
k
j
ln
(
h
k
|
q
|
(
y
|
q
−
1
|
)
)
+
(
1
−
y
k
j
)
ln
(
1
−
h
k
|
q
|
(
y
|
q
−
1
|
)
)
)
+
λ
2
m
∑
L
=
1
q
∑
k
=
1
r
|
L
|
∑
i
=
1
n
|
L
|
(
w
k
,
i
|
L
|
)
2
→
{
W
|
L
|
}
m
i
n
{\displaystyle J(\{W^{|L|}\})=-{\frac {1}{m}}\sum _{j=1}^{m}\sum _{k=1}^{r^{|q|}}{\biggl (}y_{k}^{j}\ln {(h_{k}^{|q|}(y^{|q-1|}))}+(1-y_{k}^{j})\ln {(1-h_{k}^{|q|}(y^{|q-1|}))}{\biggr )}+{\frac {\lambda }{2m}}\sum _{L=1}^{q}\sum _{k=1}^{r^{|L|}}\sum _{i=1}^{n^{|L|}}(w_{k,i}^{|L|})^{2}{\xrightarrow[{\{W^{|L|}\}}]{}}min}
where
{
W
|
L
|
}
{\displaystyle \{W^{|L|}\}}
is a set of weight matrices for each layer, and
y
|
1
|
=
x
j
{\displaystyle y^{|1|}=x^{j}}
— input parameters.
◼
{\displaystyle \blacksquare }
Support Vector Machines (Linear Classification)
edit
Is a technique similar to Logistic Regression, that can be sometimes faster and more extensible.
Given cost functions:
c
o
s
t
1
∗
=
l
n
(
1
1
+
e
−
w
¯
⋅
x
¯
)
,
c
o
s
t
0
∗
=
l
n
(
1
−
1
1
+
e
−
w
¯
⋅
x
¯
)
{\displaystyle cost_{1}^{*}=ln{\biggl (}{\frac {1}{1+e^{-{\bar {w}}\cdot {\bar {x}}}}}{\biggr )},\ cost_{0}^{*}=ln{\biggl (}1-{\frac {1}{1+e^{-{\bar {w}}\cdot {\bar {x}}}}}{\biggr )}}
Change them to similar linear functions:
c
o
s
t
1
=
(
z
<
1
)
?
a
−
a
z
:
0
,
c
o
s
t
0
=
(
z
>
−
1
)
?
a
+
a
z
:
0
{\displaystyle cost_{1}=(z<1)\ ?\ a-az:0,\ cost_{0}=(z>-1)\ ?\ a+az:0}
where a is some positive number and
z
=
w
¯
⋅
x
¯
{\displaystyle z={\bar {w}}\cdot {\bar {x}}}
.
◻
{\displaystyle \Box }
With this simple approach samples are also divided linearly, but with some margin.
The hypothesis:
h
(
x
)
=
(
w
¯
⋅
x
¯
<
0
)
?
0
:
1
{\displaystyle h(x)=({\bar {w}}\cdot {\bar {x}}<0)\ ?\ 0:1}
J
(
w
¯
)
=
−
1
m
∑
j
=
1
m
(
y
j
c
o
s
t
1
(
w
¯
⋅
x
¯
j
)
+
(
1
−
y
j
)
c
o
s
t
0
(
w
¯
⋅
x
¯
j
)
)
→
w
m
i
n
{\displaystyle J({\bar {w}})=-{\frac {1}{m}}\sum _{j=1}^{m}{\biggl (}y^{j}cost_{1}({\bar {w}}\cdot {\bar {x}}^{j})+(1-y^{j})cost_{0}({\bar {w}}\cdot {\bar {x}}^{j}){\biggr )}{\xrightarrow[{w}]{}}min}
◼
{\displaystyle \blacksquare }
Support Vector Machines (Kernels)
edit
Kernels are used to get more complex than linear classification.
Gaussian Kernel:
K
j
(
x
)
=
e
−
|
|
x
−
x
j
|
|
2
σ
2
{\displaystyle K^{j}(x)=e^{-{\frac {||x-x^{j}||}{2\sigma ^{2}}}}}
, for each training sample.
Thus x is replaced with
K
¯
=
{
1
,
K
1
(
x
)
,
.
.
K
m
(
x
)
}
{\displaystyle {\bar {K}}=\{1,K^{1}(x),..K^{m}(x)\}}
.
◻
{\displaystyle \Box }
h
(
x
)
=
(
w
¯
⋅
K
¯
<
0
)
?
0
:
1
{\displaystyle h(x)=({\bar {w}}\cdot {\bar {K}}<0)\ ?\ 0:1}
, i.e. close to one of the samples it gives 1.
J
(
w
¯
)
=
−
1
m
∑
j
=
1
m
(
y
j
c
o
s
t
1
(
w
¯
⋅
K
¯
j
)
+
(
1
−
y
j
)
c
o
s
t
0
(
w
¯
⋅
K
¯
j
)
)
→
w
m
i
n
{\displaystyle J({\bar {w}})=-{\frac {1}{m}}\sum _{j=1}^{m}{\biggl (}y^{j}cost_{1}({\bar {w}}\cdot {\bar {K}}^{j})+(1-y^{j})cost_{0}({\bar {w}}\cdot {\bar {K}}^{j}){\biggr )}{\xrightarrow[{w}]{}}min}
◼
{\displaystyle \blacksquare }
Is another approach to data classification. It is simpler to describe and visualize, but harder to implement properly.
Again, a set of m samples
{
x
¯
j
→
y
j
}
{\displaystyle \{{\bar {x}}^{j}\rightarrow y^{j}\}}
, where
x
¯
j
{\displaystyle {\bar {x}}^{j}}
is a vector
{
x
1
,
x
2
,
.
.
x
n
}
j
{\displaystyle \{x_{1},x_{2},..x_{n}\}^{j}}
and
j
∈
1..
m
{\displaystyle j\in 1..m}
.
y can take multiple values:
y
∈
1..
r
{\displaystyle y\in 1..r}
.
While Logistic Regression splits space by distance to samples, i.e. with circles,
Decision Tree does splitting with planes parallel to axes, i.e. with rectangles.
◻
{\displaystyle \Box }
Quantization. All continuous parameters should be converted to discrete (categorical). I.e. if
x
i
{\displaystyle x_{i}}
can take N values, and N is big, those
x
i
{\displaystyle x_{i}}
should be grouped in a smaller set.
◃
{\displaystyle \triangleleft }
{
10
,
10
,
12
,
15
}
,
{
21
,
22
,
25
,
26
}
,
{
34
,
35
,
36
}
,
.
.
{\displaystyle \{10,10,12,15\},\ \{21,22,25,26\},\ \{34,35,36\},..}
▹
{\displaystyle \triangleright }
Choose some parameter
i
∈
1..
n
{\displaystyle i\in 1..n}
and split the dataset basing on
x
i
{\displaystyle x_{i}}
For each subset
A
i
{\displaystyle A_{i}}
if
y
j
=
u
,
j
∈
A
i
{\displaystyle y^{j}=u,j\in A_{i}}
with probability p , let's say (95%), then stop splitting this subset.
else go to previous step.
◼
{\displaystyle \blacksquare }
There are at least 3 implementation problems here:
How to convert continuous parameters?
How to choose splitting parameter?
When to stop splitting? How to avoid overfitting?
There will be no specific answer to these for the moment.
But in any approach the idea is to select the most pure subsets, i.e.
p
→
100
%
{\displaystyle p\rightarrow 100\%}
Gini impurity :
E
=
1
−
∑
j
∈
1..
r
p
j
2
{\displaystyle E=1-\sum _{j\in 1..r}p_{j}^{2}}
◃
{\displaystyle \triangleleft }
If
p
j
=
1
{\displaystyle p_{j}=1}
, E = 0. If
p
j
=
1
2
,
j
∈
1
|
2
{\displaystyle p_{j}={\frac {1}{2}},j\in 1|2}
, E = 0.5
▹
{\displaystyle \triangleright }
Information gain :
E
=
H
−
H
s
u
b
s
e
t
{\displaystyle E=H-H_{subset}}
, where entropy
H
=
−
∑
j
∈
1..
r
p
j
log
2
p
j
{\displaystyle H=-\sum _{j\in 1..r}p_{j}\log _{2}p_{j}}
The best split is one that has the most information gain.
The idea is to generate multiple randomly initialized trees, and then select the most popular result.
Random initialization can concern:
Learning on random subset of samples.
Learning on random subset of features.
Random selection of tree splitting parameter.