Linear regression algorithms are used for prediction purposes. An affine linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (or intercept term)
ˆy=θ0+θ1x1+θ2x2+⋯+θnxn=θ⋅x=hθ(x).where θ:=(θ0,…,θn)T,x:=(x0,x1,…,xn)T,x0=1.
The MSE of hypothesis hθ on a training set {(x(1),y(1)),…,(x(m),y(m))}⊂Rn×R is given by J(θ)=MSE(X,hθ)=1mm∑i=1(hθ(x(i))−y(i))2
We have ∇J(θ):=(∂J∂θ0(θ),…∂J∂θn(θ))T and the partial derivative of J with respect to θj is
∂J∂θj(θ)=1mm∑i=12x(i)j(hθ(x(i))−y(i))We have ∇J(θ):=1m2XT(Xθ−Y)
The Batch gradient descent algorithm is given by
θ(0) initial guess,θ(k+1)=θ(k)−η∇J(θ(k))The whole batch X of training data is used at every step.
The loss function is quadratic in θ (hence convex) and clearly L-smooth with
L=2m‖XTX‖2→2=2ms1(X)2where s1(X) is the largest singular value of X. Batch gradient descent with learning rate η<1/L converge to θ∗ which minimizes J.
Observe also that J is minimized at θ∗ for which ∇J(θ∗)=0. Hence,
θ∗ is solution of the normal equation
θ∗=(XTX)−1XTY
## Direct computation using normal equation
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
n,m = 1,200
x = 2* np.random.rand(m, n)
y = 2 + 5 * x + 0.5* np.random.randn(m, n) ## thus theta0= 2 , theta1=5
plt.scatter(x, y)
X_b = np.c_[np.ones((m, n)), x] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
plt.plot([0,2],[theta_best[0],theta_best[0]+2*theta_best[1]],color='r')
print ('theta_best=',theta_best)
plt.show()
theta_best= [[1.98331783] [4.97826472]]
## gradient descent applied to the function J: theta -> J (theta) ##
def val_f (theta):
return (1./m)*np.linalg.norm(X_b.dot(theta) - y)
def grad_f (theta):
return (2./m)*X_b.T.dot(X_b.dot(theta) - y)
x0 = np.zeros((2, 1)) # The algorithm starts at x0=(0,0)
s1 = np.linalg.norm(X_b, ord=2) ## Largest singular value of X_b
L = (2/m)*s1*s1 # L for L-smoothness
rate = 1./L # Learning rate
precision = 0.001 # This tells us when to stop the algorithm
step_size = 1 # step being taken
max_iters = 4000 # maximum number of iterations
iters = 0 #iteration counter
vec_x,vec_f = [],[]
cur_x = x0
while step_size > precision and iters < max_iters:
vec_x = vec_x+[cur_x]
vec_f = vec_f+[val_f(cur_x)]
prev_x = cur_x #Store current x value in prev_x
cur_x = cur_x - rate * grad_f(prev_x) #Gradien descent
step_size = np.linalg.norm(cur_x - prev_x) #Change in x
iters = iters+1 #iteration count
print("The local minimum occurs at", vec_x[-1])
print("The local minimum is equal to", vec_f[-1])
The local minimum occurs at [[1.99344927] [4.96976563]] The local minimum is equal to 0.035080866913785244
## Animation based on gradient descent applied for linear regression ##
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import animation, rc
#%matplotlib inline
#matplotlib notebook
#%matplotlib nbagg
%matplotlib tk
# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure()
fig.set_dpi(100)
xmin,xmax = 0, 2
fmin,fmax = 1,14
ax = plt.axes(xlim=(xmin, xmax), ylim=(fmin, fmax))
line, = ax.plot([], [], lw=2,color='r')
# initialization function: plot the background of each frame
def init():
plt.scatter(x, y,color='b')
line.set_data([], [])
return line,
# animation function. This is called sequentially
def animate(i):
a = [0,2]
theta= vec_x[i]
b = [theta[0],theta[0]+2*theta[1]]
line.set_data(a, b)
return line,
# call the animator. blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig,
animate,
init_func=init,
frames=50,
interval=50,
blit=True)
# save the animation as an mp4. This requires ffmpeg or mencoder to be
# installed. The extra_args ensure that the x264 codec is used, so that
# the video can be embedded in html5. You may need to adjust this for
# your system: for more information, see
# http://matplotlib.sourceforge.net/api/animation_api.html
#HTML(anim.to_html5_video())
from IPython.display import HTML
anim.save('GD_converge.mp4', fps=5, extra_args=['-vcodec', 'h264', '-pix_fmt', 'yuv420p'])
plt.show()
⟹ Gradient Descent algorithms
We are interested in minimizing the additive loss function
J(θ)=m∑i=1Ji(θ),Ji(θ):=(hθ(x(i))−y(i))2One random instance {(x(ω),y(ω))} in the training set is picked at every step k and the gradient is computed based only on this instance. This gradient is ∇Jω(θ)=∇θJω(θ) and is equal to
∇Jω(θ):=2(hθ(x(ω))−y(ω))x(ω).The Stochastic Gradient Descent (SGD) algorithm for MSE minimization in linear regression is given by
θ(0) initial guess,{ω randomly selected in{1,2,…,m}θ(k+1)=θ(k)−η∇Jω(θ(k))k≥0.Only one random instance {(x(ω),y(ω))} of training data X is used at every step.
n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
def learning_schedule(t):
return t0/(t+t1)
theta = np.random.randn(2,1) # random initialization
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
print(theta)
[[1.97779774] [4.97149486]]
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)
sgd_reg.fit(x, y.ravel())
sgd_reg.intercept_, sgd_reg.coef_
(array([1.97771021]), array([4.960105]))
from IPython.display import IFrame
IFrame(src='https://arxiv.org/pdf/1609.04747.pdf', width=700, height=600)
Logistic Regression (also called Logit Regression) algorithms are used for binary classification. The model estimates the probability that an instance belongs to one of two classes ({0,1} or {Y,N} or {+,-} etc) and predicts for the instance the class with the highest probability.
For classes y in {0,1}, the model estimate a parameter θ=(θ0,θ1,…,θn)T∈Rn+1 and classifies (i.e. predicts the class) according to
ˆy={0ifσ(θ.x)<0.51ifσ(θ.x)≥0.5where σ is the sigmoid function.
The number ˆp:=σ(θ.x)∈]0,1[ is an estimated probability that x belongs to class 1.
Observe:
Observe:
σ′(t)=σ(t)(1−σ(t))
Here ˆp=σ(θ.x) and we introduce the misslasification error
c(θ)={−log(ˆp)y=1−log(1−ˆp)y=0≡c(θ)=−[log(ˆp)y+log(1−ˆp)(1−y)]This error is intuitive
−log(ˆp)=log(1/ˆp) grows very large when ˆp approaches 0. The loss (or penalty) is large if the model estimates a probability close to 0 for instance in class 1.
−log(1−ˆp) is close to 0 when ˆp approaches 0. The loss (or penalty) is small if the model estimates a probability close to 0 for instance in class 0. (Same observations if ˆp approaches 1).
Given a training set {(x(1),y(1)),…,(x(m),y(m))}⊂Rn×{0,1}, the loss function is given by
J(θ)=1mm∑i=1c(i)(θ),c(i)(θ)=−[log(ˆp(i))y(i)+log(1−ˆp(i))(1−y(i))]This loss function is called negative log-likelihood/binary cross entropy/log loss [a visual explanation]
Given c(θ)=−[log(ˆp)y+log(1−ˆp)(1−y)], observe ∂c∂θj=∂c∂ˆp∂ˆp∂(θ.x)∂(θ.x)∂θj=∂c∂ˆpσ′(θ.x)xj We have ∂c∂ˆp=−[1ˆpy−11−ˆp(1−y)]σ′(θ.x)=σ(θ.x)(1−σ(θ.x))=ˆp(1−ˆp) Finally ∂c∂θj=−[(1−ˆp)y−ˆp(1−y)]xj=[ˆp−y]xj
The partial derivatives of J are given by ∂J(θ)∂θj=1mm∑i=1∂c(i)(θ)∂θj=1mm∑i=1[σ(θ.x(i))−y(i)]x(i)j
Chain rule: seed to Back-propagtion algorithms
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())
['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
iris['feature_names']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
from sklearn.linear_model import LogisticRegression
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.int)
log_reg = LogisticRegression(C=10**10, random_state=42)
log_reg.fit(X, y)
x0, x1 = np.meshgrid(np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = log_reg.predict_proba(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")
zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]
plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
save_fig("logistic_regression_contour_plot")
plt.show()
In logistic regression, the model find the "best" parameter θ for which the hyperplane θ.x=0 (dashed black line above) "best" divides training set into the two classes.
Softmax regression generalizes logistic regression in order to support K classes. The model estimates a parameter matrix
Θ=[Θ1|Θ2…|ΘK]T∈RK×(n+1),Θk=(Θ0,k,Θ1,k,…,Θn,k)T.Given x∈Rn and x=(1,x)∈Rn+1, we compute z=Θx ( i.e. zj=Θj.x for every j=1,…,K), then compute the K probabilities
ˆpk=exp(zk)K∑j=0exp(zj),k=1,…,K.The class predicted for x is the class k with the highest probabilities ˆpk.
Here ˆp is as in (17) and we introduce the misslasification error for (x,y) by
c(Θ)=−log(ˆpk),ify=k.Intuitively, given (x,y) with y belongs to class k
Given a training set {(x(1),y(1)),…,(x(m),y(m))}⊂Rn×{1,2,…,K}, the loss function is given by
J(Θ)=1mm∑i=1c(i)(Θ)=−1mm∑i=1log(ˆp(i)y(i)).We will view the function J as a function of (Θ1,Θ2,…,ΘK).
We let θ=Θj∈Rn+1 for some j∈{1,…,K}: we have the following function dependency diagram
Θ⟶(z(i)l)i=1,…,ml=1,…,K⟶J.Using chain rule with respect to these intermediates variables and ∇θzl=δj,lx which follows from zl=Θl.x,
∇θJ(Θ)=m∑i=1K∑l=1∂J(Θ)∂z(i)l∇θz(i)l=m∑i=1∂J(Θ)∂z(i)jx(i)In view of the formula for J, for i such that y(i)=k holds
∂J(Θ)∂z(i)j=−1m∂log(ˆp(i)k)∂z(i)jBy derivation of usual functions, one has
∂log(ˆpk)∂zj=1ˆpk∂ˆpk∂zj=1ˆpk{ˆpk−(ˆpk)2ifj=k−ˆpkˆpjifj≠k={1−ˆpkifj=k−ˆpjifj≠kThe combination of the previous formulas implies that J as a function of (Θ1,Θ2,…,ΘK) satisfies
∇ΘjJ(Θ)=−1mm∑i=1(1y(i)=j−ˆp(i)j)x(i)∇ΘjJ(Θ)∈Rn+1.Finally
∇ΘJ(Θ)=[∇Θ1J(Θ)|∇Θ2J(Θ)|…|∇ΘkJ(Θ)]T∈RK×(n+1).can all be implemented.
X = iris["data"][:, (2, 3)] # petal length, petal width, mainly for visualisation
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42)
softmax_reg.fit(X, y)
#lbfgs: Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm.
x0, x1 = np.meshgrid( np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)
zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
save_fig("softmax_regression_contour_plot")
plt.show()
How do we learn the "best" weights vector (θ∗0,…,θ∗n) or matrix (Θ∗0|…|Θ∗n) in linear/logistic/softmax regression
Essence of the entire learning process behind deep learning algorithms
We consider classification problem for iris dataset using neural network with one hidden layer (5 nodes)
and activation function φ (sigmoid, ReLU, etc)
- Input layer
x=(x1,x2,x3,x4)T.- Hidden layer
h=(h1,h2,h3,h4,h5)T.- Output layer
z=(z1,z2,z3)T.With bias in input and hidden layer x0=1,h0=1, we denote
x=(1,xT)T,h=(1,hT)TAs for softmax regression, the model minimizes the negative log-likelihood loss function associated with the training set {(x(1),y(1)),…,(x(m),y(m))}⊂Rn×{1,2,…,K} (n=4 and K=3 for iris dataset)
J(˜Θ,Θ)=−1mm∑i=1log(ˆp(i)k),(k is such that y(i) belongs to class k),in order to find the best parameter matrices ˜Θ∗ and Θ∗.
If nh is the number of nodes in the hidden layer (here nh=5), we write
˜Θ=[˜Θ1|˜Θ2…|˜Θnh]T∈Rnh×(n+1),Θ=[Θ1|Θ2…|ΘK]T∈RK×(nh+1),
We will view the function J as a function of (˜Θ1,˜Θ2,…,˜Θnh,Θ1,Θ2,…,ΘK).
We recall the mapping diagram
x⟹h−=˜Θx⟹h=φ(h−)⟹z=Θhand the function dependency diagram
(˜Θ,Θ)⟶(h−j(i))i=1,…,mj=1,…,nh⟶(hj(i))i=1,…,mj=1,…,nh⟶(z(i)l)i=1,…,ml=1,…,K⟶J.- For j=1,…,K, we proceed as in soft-max regression (but as if h is the input)
∇ΘjJ=m∑i=1∂J∂z(i)jh(i)=−1mm∑i=1(1y(i)=j−ˆp(i)j)h(i).Observe dependance on ˜Θ through the h(i) then on Θ through the ˆp(i)k.
- For j=1,…,nh and ˜θ=˜Θj, we proceed as in softmax regression: chain rule with variables (h−l(i))i=1,…,ml=1,…,nh and h−=˜Θx implies
∇˜θJ=m∑i=1nh∑l=1∂J∂h−l(i)∇˜θh−l(i)=m∑i=1∂J∂h−j(i)x(i).Then ∂J∂h−j(i)=∂J∂h(i)jφ′(h−j(i)) and
∂J∂h(i)j=K∑l=1∂J∂z(i)l∂z(i)l∂h(i)j=K∑l=1∂J∂z(i)lΘj,l.The gradients ∇ΘJ and ∇˜ΘJ are completely determined if we know (compute)
∂J∂z(i)ji=1,…,m,j=1,…K∂J∂h(i)ji=1,…,m,j=1,…nh.Gradient descent and variations can be implemented
- Forward pass: given ˜Θ and Θ, compute h−(i),h(i),z(i)i=1,…,m. Then compute J(˜Θ,Θ).
- Backpropagate errors: compute ∇z(i)J,∇h(i)J,∇h−(i)J,i=1,…,m. Then compute ∇ΘJ,∇˜ΘJ.
The generalization to multiple hidden layer is straightforwad, see online free book
The same approach applies if
in which case, one wants to minimize
J(Θ1,Θ2,…)+λ1tr(ΘT1Θ1)+λ2tr(ΘT2Θ2)+…Training a Neural Network
import tensorflow as tf
with tf.name_scope("dnn"):
hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
logits = neuron_layer(hidden2, n_outputs, "outputs")