{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": " Линейная_регрессия,_градиентный_спуск.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "8S0l1cfKEdcQ" }, "source": [ "# **План**\n", "\n", "- Реализация алгоритма стохастического градиентного спуска:\n", " - загрузка данных\n", " - создание и обучение алгоритма\n", "- Сравнение написанного алгоритма с реализованным в библиотеке Sklearn.\n", "\n", "- Рекомендованная литература" ] }, { "cell_type": "markdown", "metadata": { "id": "WBIx7HpJwCKa" }, "source": [ "ноутбук в колаб: https://drive.google.com/file/d/1aqRvKK0WRq8hn89bm1O__khM5cg1wuW7/view?usp=sharing" ] }, { "cell_type": "markdown", "metadata": { "id": "xvGoXa3CFAeV" }, "source": [ "# Реализуем алгоритм стохастического градиентного спуска" ] }, { "cell_type": "code", "metadata": { "id": "8JSYdc4UJD6s" }, "source": [ "import numpy as np\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4oW8ZZWAXN2E" }, "source": [ "## Загрузка и подготовка данных\n", "\n", "### **Постановка задачи:** \n", "Предсказать вероятность одобрения заявки на поступление в ВУЗ (вероятность от 0 до 1) по набору признаков: оценки за экзамены, наличие research и тд" ] }, { "cell_type": "code", "metadata": { "id": "vW1tcon9JTEQ" }, "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/tixonsit/mmdad_materials/master/datasets_14872_228180_Admission_Predict_Ver1.1.csv')\n", "del df['Serial No.']" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "oUQRj1lsJlFR", "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "outputId": "f39b113b-4be6-4892-ac0b-6ee50245296b" }, "source": [ "df.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GRE ScoreTOEFL ScoreUniversity RatingSOPLORCGPAResearchChance of Admit
033711844.54.59.6510.92
132410744.04.58.8710.76
231610433.03.58.0010.72
332211033.52.58.6710.80
431410322.03.08.2100.65
\n", "
" ], "text/plain": [ " GRE Score TOEFL Score University Rating ... CGPA Research Chance of Admit \n", "0 337 118 4 ... 9.65 1 0.92\n", "1 324 107 4 ... 8.87 1 0.76\n", "2 316 104 3 ... 8.00 1 0.72\n", "3 322 110 3 ... 8.67 1 0.80\n", "4 314 103 2 ... 8.21 0 0.65\n", "\n", "[5 rows x 8 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "code", "metadata": { "id": "TG3eV_4dJmZT", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e16d8189-acfd-41be-c8cb-944f8cd3cb94" }, "source": [ "# мало :(\n", "len(df)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "500" ] }, "metadata": { "tags": [] }, "execution_count": 8 } ] }, { "cell_type": "code", "metadata": { "id": "kw2Nz3P1XgMH", "colab": { "base_uri": "https://localhost:8080/", "height": 747 }, "outputId": "0724c481-bfc6-493b-e7a5-07e8bfed8718" }, "source": [ "import seaborn as sns\n", "\n", "sns.clustermap(df.corr())" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 9 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "tDzaJx_AJoSZ" }, "source": [ "# перемешка\n", "df = df.sample(frac=1).reset_index(drop=True)\n", "# train test split\n", "df_train = df[:400]\n", "df_test = df[400:]\n", "# среднее и стандартное отклонение\n", "mean = df.mean(axis=0)\n", "std = df.std(axis=0)\n", "# 0 мат ожидание и 1 дисперсию\n", "df_train = (df_train - mean)/std\n", "\n", "X_train = df_train.drop(columns=['Chance of Admit ']).values\n", "y_train = df_train['Chance of Admit '].values\n", "df_test = (df_test - mean)/std\n", "X_test = df_test.drop(columns=['Chance of Admit ']).values\n", "y_test = df_test['Chance of Admit '].values" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "qvctLNtwKBzh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f9a2acac-6e2f-4072-c09e-e16510b6def6" }, "source": [ "X_train[:5]" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[-0.21885503, -0.3604156 , -0.09969289, 0.12714383, -1.06326701,\n", " -0.45706705, -1.12702343],\n", " [ 1.55181671, 1.94150887, 1.64930524, 1.13622188, 1.63812275,\n", " 1.59315411, 0.88551841],\n", " [ 1.02061519, 0.46170028, 0.77480617, 0.63168286, -1.06326701,\n", " 0.73338395, 0.88551841],\n", " [-0.21885503, 0.2972771 , 0.77480617, 0.12714383, 0.55756685,\n", " 0.32003291, 0.88551841],\n", " [ 1.19768236, 1.11939299, 0.77480617, 1.13622188, -0.52298906,\n", " 0.98139457, 0.88551841]])" ] }, "metadata": { "tags": [] }, "execution_count": 11 } ] }, { "cell_type": "code", "metadata": { "id": "aKV4jwOhKDpO", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e186abb7-ee75-461d-c93b-29fbcfaabfe3" }, "source": [ "y_train[:5]" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([-1.28765396, 1.5464034 , 0.62533476, 0.69618619, 0.97959192])" ] }, "metadata": { "tags": [] }, "execution_count": 12 } ] }, { "cell_type": "markdown", "metadata": { "id": "S8PPlFzDXSlU" }, "source": [ "## Создание алгоритма" ] }, { "cell_type": "code", "metadata": { "id": "yG3D8nw8JC2B", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "bb0e5293-10aa-4336-b5e6-78456631cc00" }, "source": [ "print('размерность пространства признаков:', X_train.shape[1])\n", "# инициализация весов под размерность пространства признаков\n", "w = np.ones(X_train.shape[1])\n", "# выбираем случайный индекс, по которому найдем частную производную\n", "ind = np.random.randint(X_train.shape[1])\n", "print('Случайный индекс', ind)\n", "# переможим выбранный столбец на нужный вес (получаем кол-во элементов train)\n", "len(np.dot(X_train[:,ind], w[ind]))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "размерность пространства признаков: 7\n", "Случайный индекс 5\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/plain": [ "400" ] }, "metadata": { "tags": [] }, "execution_count": 13 } ] }, { "cell_type": "markdown", "metadata": { "id": "tfobTgP5LHtZ" }, "source": [ "$$MSE = \\frac{1}{n}\\sum_{i = 1}^{n}(y_i - \\hat{y_i})^2$$\n", "\n", "Реализуем ошибку по формуле:" ] }, { "cell_type": "code", "metadata": { "id": "QWlm6k1ULb0X" }, "source": [ "mse = lambda y, y_pred:((y-y_pred)**2).sum()/len(y_pred) " ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "KTe3AIgPQ6Vc" }, "source": [ "# очистка экрана\n", "from google.colab import output\n", "# коэффициент детерминации\n", "from sklearn.metrics import r2_score" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "VAfKZM45XyT2" }, "source": [ "Стохастический градиентный спуск (англ. stochastic gradient descent, SGD) − оптимизационный алгоритм, отличающийся от обычного градиентного спуска тем, что градиент оптимизируемой функции считается на каждом шаге не как сумма градиентов от каждого элемента выборки, а как градиент от одного, случайно выбранного элемента." ] }, { "cell_type": "markdown", "metadata": { "id": "0QVx-8CgXc_h" }, "source": [ "![dssmall](https://github.com/m9psy/neural_network_habr_guide/blob/master/Part%203/images/stochastic.gif?raw=true)" ] }, { "cell_type": "markdown", "metadata": { "id": "7LMlKFjzMgdi" }, "source": [ "$$w_{t+1} = w_{t} - \\frac{2\\alpha}{n}X_i( - y_i) $$\n", "\n", "Реализуем шаг по формуле" ] }, { "cell_type": "code", "metadata": { "id": "x1JnXq6a_HRH" }, "source": [ "gradient_step_stah = lambda X, y, w, alpha, ind: w - (alpha* 2.0 / X.shape[0]) * X[ind] * (np.dot(X[ind], w) - y[ind])" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "uz-WNufqNIpJ", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "ed21396e-204c-4038-9f99-66e390a2aa91" }, "source": [ "# выбрали случайный индекс\n", "ind = np.random.randint(X_train.shape[1])\n", "# сделали один шаг (w = [1, 1, ...])\n", "gradient_step_stah(X_train, y_train, np.ones(X_train.shape[1]), 1e-4, ind)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0.99999378, 0.99999473, 0.99999747, 0.99999958, 0.99999642,\n", " 0.99999523, 0.99999711])" ] }, "metadata": { "tags": [] }, "execution_count": 17 } ] }, { "cell_type": "code", "metadata": { "id": "NAK0BcE6gpnP", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "acf2591d-c4dd-48be-ed08-c50af935a8cb" }, "source": [ "X_train.shape[1]" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "7" ] }, "metadata": { "tags": [] }, "execution_count": 14 } ] }, { "cell_type": "markdown", "metadata": { "id": "6jG4O7DRgStd" }, "source": [ "## Обучение алгоритма" ] }, { "cell_type": "code", "metadata": { "id": "wYMc6vvtJQVY", "colab": { "base_uri": "https://localhost:8080/", "height": 385 }, "outputId": "f2c98f38-957e-4380-db84-15041b6a9177" }, "source": [ "# стохастический градиентный спуск\n", "def sgd(X, y, w, alpha = 1e-4, max_it = 10e6):\n", " # номер итерации\n", " iter_num = 0\n", " # ошибки на трейне\n", " errors = []\n", " # ошибки на тесте\n", " errors_test = []\n", " # r2 на тесте\n", " r2 = []\n", " while (iter_num < max_it):\n", " # выбираем случайный элемент\n", " ind = np.random.randint(X.shape[0])\n", " # обновляем веса град спуском\n", " w = gradient_step_stah(X, y, w, alpha, ind)\n", " # отображаем каждый %\n", " if iter_num%(int(max_it/100))==0:\n", " # очищаем экран\n", " output.clear()\n", " print('Выполнено:', int(iter_num/max_it * 100), '%')\n", " # mse train\n", " error = mse(y_train,np.dot(X_train, w))\n", " errors.append(error)\n", " print('Mse train:', error)\n", " # mse test\n", " error = mse(y_test,np.dot(X_test, w))\n", " errors_test.append(error)\n", " print('Mse test:', error)\n", " # r2 test\n", " R = r2_score(y_test,np.dot(X_test, w))\n", " r2.append(R)\n", " print('R2:', R)\n", " iter_num += 1\n", "\n", " return w, errors, errors_test, r2\n", "\n", "w, mse_train, mse_test, r2 = sgd(X_train, y_train, np.ones(X_train.shape[1]))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Выполнено: 23 %\n", "Mse train: 0.22755895661893744\n", "Mse test: 0.17505321808539098\n", "R2: 0.7868441725625581\n" ], "name": "stdout" }, { "output_type": "error", "ename": "KeyboardInterrupt", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 35\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merrors_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 37\u001b[0;31m \u001b[0mw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmse_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmse_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mr2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msgd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mones\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36msgd\u001b[0;34m(X, y, w, alpha, max_it)\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0miter_num\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mmax_it\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m# выбираем случайный элемент\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mind\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0;31m# обновляем веса град спуском\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0mw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgradient_step_stah\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malpha\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mind\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ] }, { "cell_type": "code", "metadata": { "id": "oBcedhprSn3t", "colab": { "base_uri": "https://localhost:8080/", "height": 609 }, "outputId": "7fdf8871-321f-4aa4-8c9f-ce5e178e7ae1" }, "source": [ "from matplotlib.pyplot import figure\n", "\n", "plt.figure(figsize=(15,10))\n", "plt.grid()\n", "\n", "plt.plot(mse_train, label = 'train')\n", "plt.plot(mse_test, label = 'test')\n", "plt.legend()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 16 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "etXEv-uWVAvm", "colab": { "base_uri": "https://localhost:8080/", "height": 609 }, "outputId": "92efb630-0f21-4fd1-ea4c-88d28512815c" }, "source": [ "from matplotlib.pyplot import figure\n", "\n", "plt.figure(figsize=(15,10))\n", "plt.grid()\n", "\n", "plt.plot(r2, label = 'r2')\n", "plt.legend()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 17 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "8ACfnz1FRV8E", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "123ca904-521d-4bd7-944f-787a2e8346e3" }, "source": [ "print('веса', w)\n", "print('R^2 = ', r2_score(y_test, np.dot(X_test,w)))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "веса [0.17588524 0.17370687 0.07285169 0.04868733 0.15837534 0.3460636\n", " 0.08444862]\n", "R^2 = 0.8374580182018496\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "xLZZmoD_QQyB", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "185043e1-bf2d-4661-d5be-2fa9a511a85a" }, "source": [ "# массив результатов\n", "r2_shuffles = []\n", "# проверим, зависит ли изначальная перемешка от результата\n", "for i in range(20):\n", " print(f'Итерация {i+1}')\n", " # перемешка\n", " df = df.sample(frac=1).reset_index(drop=True)\n", " # train test split\n", " df_train = df[:400]\n", " df_test = df[400:]\n", " # среднее и стандартное отклонение\n", " mean = df.mean(axis=0)\n", " std = df.std(axis=0)\n", " # 0 мат ожидание и 1 дисперсию\n", " df_train = (df_train - mean)/std\n", " X_train = df_train.drop(columns=['Chance of Admit ']).values\n", " y_train = df_train['Chance of Admit '].values\n", " df_test = (df_test - mean)/std\n", " X_test = df_test.drop(columns=['Chance of Admit ']).values\n", " y_test = df_test['Chance of Admit '].values\n", "\n", " w, mse_train, mse_test, r2 = sgd(X_train, y_train, np.ones(X_train.shape[1]))\n", " print(f'Итерация {i+1} | R^2 = ', r2_score(y_test, np.dot(X_test,w)))\n", " r2_shuffles.append(r2_score(y_test, np.dot(X_test,w)))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Выполнено: 36 %\n", "Mse train: 0.17666929498762216\n", "Mse test: 0.3019622514937877\n", "R2: 0.7218339837286702\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "jlFxViMwXAjH" }, "source": [ "fig1, ax1 = plt.subplots()\n", "ax1.set_title('R^2 в результате перемешки')\n", "ax1.boxplot(r2_shuffles)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GiNTmE6ngdhx" }, "source": [ "Вывод: сильно зависит от перемешки -> данные нерепрезентативны -> накапливаем еще или аугментируем" ] }, { "cell_type": "markdown", "metadata": { "id": "tvxt53hHFQZJ" }, "source": [ "# Сравнение с Sklearn" ] }, { "cell_type": "markdown", "metadata": { "id": "q8GU1nh3mj7n" }, "source": [ "## Обучение с помощью sklearn" ] }, { "cell_type": "code", "metadata": { "id": "aIrbw9wGmjRc", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "15673994-49f5-4f7c-c617-6737c89020c1" }, "source": [ "from sklearn.linear_model import SGDRegressor\n", "# инициализируем и обучаем\n", "reg = SGDRegressor()\n", "reg.fit(X_train, y_train)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,\n", " eta0=0.01, fit_intercept=True, l1_ratio=0.15,\n", " learning_rate='invscaling', loss='squared_loss', max_iter=1000,\n", " n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,\n", " shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,\n", " warm_start=False)" ] }, "metadata": { "tags": [] }, "execution_count": 21 } ] }, { "cell_type": "code", "metadata": { "id": "4dBg3GWs0Jq_", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "6d751e4b-df2b-409a-800b-d98dbddcd5a1" }, "source": [ "print('Mse sgd (sklearn): ', r2_score(y_test, reg.predict(X_test)))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Mse sgd (sklearn): 0.8099229508452628\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "i181uCF21Bs4" }, "source": [ "Настраиваем тонко" ] }, { "cell_type": "code", "metadata": { "id": "lmS4w6U8nHA7", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "8d927a25-b103-4b57-f491-02cee4c29d56" }, "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "grid = {'penalty': ['l1', 'l2'],\n", " 'alpha': [1e-4, 1e-5, 1e-6, 1e-7]}\n", "\n", "reg = SGDRegressor()\n", "gs = GridSearchCV(reg, grid, cv=5)\n", "\n", "# Обучаем его\n", "gs.fit(X_train, y_train)\n", "gs.best_params_, gs.best_score_" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "({'alpha': 0.0001, 'penalty': 'l2'}, 0.8097119654612956)" ] }, "metadata": { "tags": [] }, "execution_count": 23 } ] }, { "cell_type": "code", "metadata": { "id": "buEMyrmz0n58", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "42529b2f-a940-4e9e-c935-08307e4056d7" }, "source": [ "from sklearn.linear_model import SGDRegressor\n", "# инициализируем и обучаем\n", "reg = SGDRegressor(alpha = 1e-05, penalty = 'l2')\n", "reg.fit(X_train, y_train)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "SGDRegressor(alpha=1e-05, average=False, early_stopping=False, epsilon=0.1,\n", " eta0=0.01, fit_intercept=True, l1_ratio=0.15,\n", " learning_rate='invscaling', loss='squared_loss', max_iter=1000,\n", " n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,\n", " shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,\n", " warm_start=False)" ] }, "metadata": { "tags": [] }, "execution_count": 24 } ] }, { "cell_type": "code", "metadata": { "id": "PWG_X64r0o1V", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c792db64-b042-4b5e-de24-25af6bfc76f4" }, "source": [ "print('Mse sgd (sklearn): ', r2_score(y_test, reg.predict(X_test)))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Mse sgd (sklearn): 0.8085748782624143\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "s5PlFmWT1Lb3", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e3a0ca9d-01e9-4668-b068-6dd43ae94d1c" }, "source": [ "fin_score = []\n", "\n", "for i in range(30):\n", " print(f'Итерация {i+1}')\n", " # перемешка\n", " df = df.sample(frac=1).reset_index(drop=True)\n", " # train test split\n", " df_train = df[:400]\n", " df_test = df[400:]\n", " # среднее и стандартное отклонение\n", " mean = df.mean(axis=0)\n", " std = df.std(axis=0)\n", " # 0 мат ожидание и 1 дисперсию\n", " df_train = (df_train - mean)/std\n", " X_train = df_train.drop(columns=['Chance of Admit ']).values\n", " y_train = df_train['Chance of Admit '].values\n", " df_test = (df_test - mean)/std\n", " X_test = df_test.drop(columns=['Chance of Admit ']).values\n", " y_test = df_test['Chance of Admit '].values\n", "\n", " # обучение\n", " grid = {'penalty': ['l1', 'l2'],\n", " 'alpha': [1e-4, 1e-5, 1e-6, 1e-7]}\n", "\n", " reg = SGDRegressor()\n", " gs = GridSearchCV(reg, grid, cv=5, scoring = 'r2')\n", "\n", " # Обучаем его\n", " gs.fit(X_train, y_train)\n", " print(gs.best_score_)\n", " fin_score.append(gs.best_score_)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Итерация 1\n", "0.8109177706435201\n", "Итерация 2\n", "0.8198261986327733\n", "Итерация 3\n", "0.8092944066597569\n", "Итерация 4\n", "0.8078472515506094\n", "Итерация 5\n", "0.8119252273501584\n", "Итерация 6\n", "0.7969072937292928\n", "Итерация 7\n", "0.7956026082377654\n", "Итерация 8\n", "0.8036936558068124\n", "Итерация 9\n", "0.8060383820051505\n", "Итерация 10\n", "0.8032334665712701\n", "Итерация 11\n", "0.8063744122411365\n", "Итерация 12\n", "0.814178695781138\n", "Итерация 13\n", "0.8113545721445821\n", "Итерация 14\n", "0.8000055833354516\n", "Итерация 15\n", "0.7936631228843594\n", "Итерация 16\n", "0.8128613722777185\n", "Итерация 17\n", "0.8174970129769239\n", "Итерация 18\n", "0.7978412281418048\n", "Итерация 19\n", "0.8081217313303494\n", "Итерация 20\n", "0.8204904791626919\n", "Итерация 21\n", "0.8159335375063641\n", "Итерация 22\n", "0.8187848150067663\n", "Итерация 23\n", "0.8072786430431321\n", "Итерация 24\n", "0.8033467813126485\n", "Итерация 25\n", "0.8249546871718272\n", "Итерация 26\n", "0.8244556550113333\n", "Итерация 27\n", "0.8117539794599405\n", "Итерация 28\n", "0.7904734177162176\n", "Итерация 29\n", "0.8014776460898485\n", "Итерация 30\n", "0.8105552181734911\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "tlUhmuP74t8G", "colab": { "base_uri": "https://localhost:8080/", "height": 417 }, "outputId": "0bcca010-aebe-4481-d985-0ec7de53c3ae" }, "source": [ "fig1, ax1 = plt.subplots()\n", "ax1.set_title('R^2 в результате перемешки (+ регуляризация)')\n", "ax1.boxplot(fin_score)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'boxes': [],\n", " 'caps': [,\n", " ],\n", " 'fliers': [],\n", " 'means': [],\n", " 'medians': [],\n", " 'whiskers': [,\n", " ]}" ] }, "metadata": { "tags": [] }, "execution_count": 27 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "WAJIdMiW3fpS" }, "source": [ "# Вывод\n", "\n", "- Реализован стохастический градиентный спуск (СГС) для задачи линейной регрессии\n", "\n", "- Приведено сравнение написанного СГС и реализованного в библиотеке sklearn\n" ] }, { "cell_type": "markdown", "metadata": { "id": "NUmxK0kjNwXe" }, "source": [ "# Рекомендованная литература\n", "\n", "\n", "- [Линейная регрессия в подробностях](https://habr.com/ru/company/ods/blog/322076/)\n", "\n", "\n", "- [Функция ошибок в задачах регрессии](https://alexanderdyakonov.files.wordpress.com/2018/10/book_08_metrics_12_blog1.pdf)\n", "\n", "- Градиентный спуск:\n", " - [Часть 1](https://habr.com/ru/post/307312/)\n", " - [Часть 2](https://habr.com/ru/post/308604/)\n" ] } ] }