Почему я получаю огромные затраты на реализацию стохастического градиентного спуска?

Я столкнулся с некоторыми проблемами при попытке реализовать стохастический градиентный спуск, и в основном происходит то, что моя стоимость растет как сумасшедшая, и я понятия не имею, почему.

Реализация МСЭ:

def mse(x,y,w,b):
    predictions = x @ w 
    summed = (np.square(y - predictions - b)).mean(0)
    cost = summed / 2 
    return cost

Градиенты:

def grad_w(y,x,w,b,n_samples):
    return -y @ x / n_samples + x.T @ x @ w / n_samples + b * x.mean(0)
def grad_b(y,x,w,b,n_samples):
    return -y.mean(0) + x.mean(0) @ w + b

Реализация СГД:

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    
    for it in range(iterations):
        cost =0
        indices = np.random.permutation(length)
        X = X[indices]
        y = y[indices]
        for i in range(0,length,batch_size):
            X_i = X[i:i+batch_size]
            y_i = y[i:i+batch_size]

            w -= learning_rate*grad_w(y_i,X_i,w,b,length)
            b -= learning_rate*grad_b(y_i,X_i,w,b,length)
            
            cost = mse(X_i,y_i,w,b)
        cost_history[it]  = cost
        if cost_history[it] <= 0.0052: break
        
    return w, cost_history[:it]

Случайные переменные:

w_true = np.array([0.2, 0.5,-0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1) 
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0

После запуска этого:

theta,cost_history = stochastic_gradient_descent(x,y,w,b)

print('Final cost/MSE:  {:0.3f}'.format(cost_history[-1]))

Я понимаю:

Final cost/MSE:  3005958172614261248.000

А вот сюжет

Dawid_C 11.11.2020 источник

Ответы (2)

arrow_upward
0
arrow_downward

Вот несколько предложений:

ваша скорость обучения слишком велика для обучения: изменение ее на что-то вроде 1e-3 должно быть в порядке.
ваша часть обновления может быть немного изменена следующим образом:

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    
    for it in range(iterations):
        cost =0
        indices = np.random.permutation(length)
        X = X[indices]
        y = y[indices]
        for i in range(0,length,batch_size):
            X_i = X[i:i+batch_size]
            y_i = y[i:i+batch_size]

            w -= learning_rate*grad_w(y_i,X_i,w,b,len(X_i)) # the denominator should be the actual batch size
            b -= learning_rate*grad_b(y_i,X_i,w,b,len(X_i))
            
            cost += mse(X_i,y_i,w,b)*len(X_i) # add batch loss
        cost_history[it]  = cost/length # this is a running average of your batch losses, which is statistically more stable
        if cost_history[it] <= 0.0052: break
        
    return w, b, cost_history[:it]

Окончательные результаты:

w_true = np.array([0.2, 0.5, -0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1) 
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 0.0
theta,bias,cost_history = stochastic_gradient_descent(x,y,w,b,learning_rate=1e-3,iterations=3000)

print("Final epoch cost/MSE:  {:0.3f}".format(cost_history[-1]))
print("True final cost/MSE: {:0.3f}".format(mse(x,y,theta,bias)))
print(f"Final coefficients:\n{theta,bias}")

TQCH 12.11.2020

arrow_upward
0
arrow_downward

Привет, @TQCH, и спасибо за это. Я придумал другой подход к реализации SGD без внутреннего цикла, и результаты также были довольно приятными.

def stochastic_gradient_descent(X,y,w,b,learning_rate=0.35,iterations=3000,batch_size =100):
    
    length = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(length/batch_size)
    marker = 0
    cost = mse(X,y,w,b)
    print(cost)
    for it in range(iterations):
        cost =0
        indices = np.random.choice(length, batch_size)
        X_i = X[indices]
        y_i = y[indices]

        w -= learning_rate*grad_w(y_i,X_i,w,b)
        b -= learning_rate*grad_b(y_i,X_i,w,b)
            
        cost = mse(X_i,y_i,w,b)
        cost_history[it]  = cost
        if cost_history[it] <= 0.0075 and cost_history[it] > 0.0071: marker = it
        if cost <= 0.0052: break
    print(f'{w}, {b}')
    return w, cost_history, marker, cost

w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
theta,cost_history, marker, cost = stochastic_gradient_descent(x,y,w,b)

print(f'Number of iterations: {marker}')
print('Final cost/MSE:  {:0.3f}'.format(cost))

который дал мне эти результаты:

1,9443112664859845,
[ 0,19592532 0,31735225 -0,20044424], -0,9059800816290591
Количество итераций: 68
Окончательная стоимость/MSE: 0,005

Но вы правы, я пропустил, что делил на общую длину вектора y, а не на размер партии, и забыл добавить потерю партии!

Спасибо за это!

Dawid_C 23.11.2020

Почему я получаю огромные затраты на реализацию стохастического градиентного спуска?

Ответы (2)

Вопросы по теме