{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Линейные_методы,_аналитическое_решение.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "PNK_0ITKDx_c" }, "source": [ "# **План**\n", "\n", "- Линейные методы. Определение и постановка задачи линейной регрессии. *Lp* норма\n", "\n", "- Ridge и Lasso регрессия. *Lp* регуляризация\n", "\n", "- Рекомендованная литература" ] }, { "cell_type": "markdown", "metadata": { "id": "qCBVKNE-FeP9" }, "source": [ "ссылка на ноутбук в колаб: https://drive.google.com/file/d/1au2UdCsZKCHwYEKvdH95AQig_g4B9izr/view?usp=sharing" ] }, { "cell_type": "markdown", "metadata": { "id": "2paKyvORD_EA" }, "source": [ "# Линейные методы. Регрессия." ] }, { "cell_type": "markdown", "metadata": { "id": "DW6oRt_HKcpE" }, "source": [ "Матричные производные:\n", "\n", "http://www.machinelearning.ru/wiki/images/archive/9/93/20170127140036!MO17_seminar3.pdf" ] }, { "cell_type": "markdown", "metadata": { "id": "fRVu5ANBLGGo" }, "source": [ "Линейные методы предполагают, что между признаками объекта и целевой переменной существует линейная зависимость, то есть:\n", "$$ y = w_1 x_1 + w_2 x_2 + ... + w_k x_k + b $$,\n", "где у - целевая переменная (что мы хотим предсказать), $x_i$ -- признак объекта х, $w_i$ -- вес i-го признака, b -- bias (смещение, свободный член)\n", "\n", "Часто предполагают, что объект х содержит в себе фиктивный признак, который всегда равен 1, тогда bias это есть вес этого признака. В этом случае формула принимает простой вид:\n", "$$ y = $$,\n", "где $<\\cdot, \\cdot>$ -- скалярное произведение векторов.\n", "\n", "В матричной форме, в случае, когда у нас есть n объектов формулу можно переписать следующим образом:\n", "$$ y = Xw $$,\n", "y -- вектор размера n, X -- матрица объекты-признаки размера $n \\times k$, w -- вектор весов размера k.\n", "\n", "Решение по методу наименьших квадратов дает \n", "$$ w = (X^TX)^{-1}X^Ty $$" ] }, { "cell_type": "markdown", "metadata": { "id": "xx_zFZBo6iKb" }, "source": [ "**Определение (Lp-норма):**\n", "\n", "$$\n", " \\|\\cdot\\|_{p}: \\mathbb{R}^{d} \\to \\mathbb{R}\\\\\n", " \\forall p \\geq 1: \\forall x \\in \\mathbb{R}^{d}: \\|x\\|_{p} = \\sqrt[p]{\\sum_{i=1}^{n} x_{i}^{p}}\n", "$$\n", "\n", "**Доказательство**\n", "\n", "Вспомним, как выглядит задача оптимизации:\n", "\n", "$$\n", " \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\langle x_i, w \\rangle)^2 \\to \\min\\limits_{w}\n", "$$\n", "\n", "Эта задача оптимизации допускает следующую более удобную запись:\n", "\n", "$$\n", " \\frac{1}{n} \\| Xw - y \\|_{2}^{2} \\to \\min\\limits_{w}\n", "$$\n", "\n", "Утверждается, что:\n", "\n", "$$\n", " \\frac{1}{n} \\| Xw - y \\|_{2}^{2} = \\frac{1}{n} (Xw - y)^{\\top} (Xw - y)\n", "$$\n", "\n", "(потому что $\\| x \\|_{2}^{2} = \\langle x, x\\rangle$)\n", "\n", "Раскроем это выражение:\n", "\n", "$$\n", "\\begin{align*}\n", " & (Xw - y)^{\\top} (Xw - y) =\\\\\n", " &= (w^{\\top} X^{\\top} - y^{\\top}) (Xw - y) =\\\\\n", " &= (w^{\\top} X^{\\top} X w - w^{\\top} X^{\\top} y) - (y^{\\top} X w - y^{\\top} y) =\\\\\n", " &= w^{\\top} X^{\\top} X w - 2 y^{\\top} X w + y^{\\top} y\n", "\\end{align*}\n", "$$\n", "\n", "Найдём градиент этой функции, т.е. все частные производные по весам (т.е. по $w_1, \\ldots, w_d$).\n", "\n", "$$\n", "\\begin{align*}\n", " &\\frac{\\partial}{\\partial w} (w^{\\top} X^{\\top} X w - 2 y^{\\top} X w + y^{\\top} y) =\\\\\n", " &= 2 X^{\\top} X w - 2 X^{\\top} y = 0\n", "\\end{align*}\n", "$$\n", "\n", "Отсюда получаем итоговый ответ:\n", "\n", "$$\n", "\\begin{align*}\n", " X^{\\top} X w &= X^{\\top} y\\\\\n", " w &= (X^{\\top} X)^{-1} X^{\\top} y\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "id": "xOt0ORJluEhQ" }, "source": [ "Полезная статья про решение Линейной регрессии: https://habr.com/ru/post/474602/" ] }, { "cell_type": "code", "metadata": { "id": "e9dsBCqULS3F" }, "source": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Ss3Xirk33T5x" }, "source": [ "# Сгенерируем рандомный набор точек\n", "X = np.linspace(-5, 5, 100)\n", "y = 10 * X - 7\n", "\n", "\n", "X_train = X[0::2].reshape(-1, 1)\n", "y_train = y[0::2] + np.random.randn(50) * 10\n", "\n", "X_test = X[1::2].reshape(-1, 1)\n", "y_test = y[1::2] + np.random.randn(50) * 10" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "MTCJ6Jc2LP3h", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3297fb6e-dfc1-4ae3-8ea0-e04ebe6f5f8e" }, "source": [ "X_train[1], y_train[1]" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(array([-4.7979798]), -45.68415850975976)" ] }, "metadata": { "tags": [] }, "execution_count": 3 } ] }, { "cell_type": "code", "metadata": { "id": "RV_yTmiAMcZc", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b1bd2b17-9b33-4d3a-e3a5-83999cde21b2" }, "source": [ "def fit(X, y):\n", " n, k = X.shape\n", " X = np.hstack((X, np.ones((n, 1))))\n", " w = np.linalg.inv(X.T @ X) @ X.T @ y\n", " return w\n", "\n", "def predict(X, w):\n", " n, k = X.shape\n", " X = np.hstack((X, np.ones((n, 1))))\n", " y_pred = X @ w\n", " return y_pred\n", "\n", "weights = fit(X_train, y_train)\n", "weights" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([ 9.60296716, -8.10179328])" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "code", "metadata": { "id": "HDM9jvhOyFp8" }, "source": [ "y_hat = predict(X_test, weights)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "lSr7D-CsyRXY", "colab": { "base_uri": "https://localhost:8080/", "height": 418 }, "outputId": "a3e338e6-6843-4a25-d9cd-04e1d550db0b" }, "source": [ "plt.hist((y_test - y_hat)**2, bins = 20)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(array([27., 7., 3., 2., 3., 0., 0., 1., 2., 0., 1., 1., 1.,\n", " 1., 0., 0., 0., 0., 0., 1.]),\n", " array([3.92997287e-02, 3.71952408e+01, 7.43511818e+01, 1.11507123e+02,\n", " 1.48663064e+02, 1.85819005e+02, 2.22974946e+02, 2.60130887e+02,\n", " 2.97286828e+02, 3.34442769e+02, 3.71598710e+02, 4.08754651e+02,\n", " 4.45910592e+02, 4.83066533e+02, 5.20222474e+02, 5.57378415e+02,\n", " 5.94534356e+02, 6.31690297e+02, 6.68846238e+02, 7.06002179e+02,\n", " 7.43158120e+02]),\n", " )" ] }, "metadata": { "tags": [] }, "execution_count": 6 }, { "output_type": "display_data", "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAANZUlEQVR4nO3db4xldX3H8fenLGgLhD8y2WyAOGAIDQ/ahUxQgyFWqgFsRBPSsGnsPtCsaSWR1KRZbdLaZ7Sp2jZpsGuh8kCpVqAQ0CpFEmPTrJ3FFRa2FLRrhCzsWKOQPmnBbx/cMzAdZpjZuXdm79e+X8nNPed3zp3zyZzdz5577jl3U1VIkvr5hRMdQJK0MRa4JDVlgUtSUxa4JDVlgUtSU9u2cmPnnHNOzc7ObuUmJam9AwcO/KiqZpaPb2mBz87OMj8/v5WblKT2kvxgpXFPoUhSUxa4JDVlgUtSUxa4JDVlgUtSUxa4JDVlgUtSUxa4JDVlgUtSU1t6J+Y4Zvfev+HXHrn53RNMIknTwSNwSWrKApekpixwSWrKApekpixwSWrKApekpixwSWrKApekpixwSWrKApekpixwSWpqzQJPcn6Sh5I8nuSxJB8Zxj+R5JkkB4fHtZsfV5K0aD1fZvUi8NGqejjJ6cCBJA8Myz5dVX+2efEkSatZs8Cr6ihwdJh+Iclh4NzNDiZJem3HdQ48ySxwKbB/GLoxySNJbkty1iqv2ZNkPsn8wsLCWGElSa9Yd4EnOQ24E7ipqp4HbgHeBOxkdIT+yZVeV1X7qmququZmZmYmEFmSBOss8CQnMyrvz1fVXQBV9VxVvVRVPwM+C1y+eTElScut5yqUALcCh6vqU0vGdyxZ7X3AocnHkyStZj1XoVwBvB94NMnBYezjwK4kO4ECjgAf2pSEkqQVrecqlG8BWWHRVyYfR5K0Xt6JKUlNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNWeCS1JQFLklNrVngSc5P8lCSx5M8luQjw/jZSR5I8uTwfNbmx5UkLVrPEfiLwEer6hLgLcCHk1wC7AUerKqLgAeHeUnSFlmzwKvqaFU9PEy/ABwGzgWuA24fVrsdeO9mhZQkvdpxnQNPMgtcCuwHtlfV0WHRs8D2VV6zJ8l8kvmFhYUxokqSllp3gSc5DbgTuKmqnl+6rKoKqJVeV1X7qmququZmZmbGCitJesW6CjzJyYzK+/NVddcw/FySHcPyHcCxzYkoSVrJeq5CCXArcLiqPrVk0b3A7mF6N3DP5ONJklazbR3rXAG8H3g0ycFh7OPAzcCXknwA+AHwm5sTUZK0kjULvKq+BWSVxVdNNo4kab28E1OSmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJampNQs8yW1JjiU5tGTsE0meSXJweFy7uTElScut5wj8c8DVK4x/uqp2Do+vTDaWJGktaxZ4VX0T+PEWZJEkHYdxzoHfmOSR4RTLWautlGRPkvkk8wsLC2NsTpK01EYL/BbgTcBO4CjwydVWrKp9VTVXVXMzMzMb3JwkabkNFXhVPVdVL1XVz4DPApdPNpYkaS0bKvAkO5bMvg84tNq6kqTNsW2tFZLcAbwdOCfJ08AfAW9PshMo4AjwoU3MKElawZoFXlW7Vhi+dROySJKOg3diSlJTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTaxZ4ktuSHEtyaMnY2UkeSPLk8HzW5saUJC23niPwzwFXLxvbCzxYVRcBDw7zkqQttGaBV9U3gR8vG74OuH2Yvh1474RzSZLWsNFz4Nur6ugw/SywfbUVk+xJMp9kfmFhYYObkyQtN/aHmFVVQL3G8n1VNVdVczMzM+NuTpI02GiBP5dkB8DwfGxykSRJ67HRAr8X2D1M7wbumUwcSdJ6recywjuAfwEuTvJ0kg8ANwPvTPIk8OvDvCRpC21ba4Wq2rXKoqsmnEWSdBy8E1OSmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpC1ySmlrzf+T5eTC79/6xXn/k5ndPKIkkTY5H4JLUlAUuSU1Z4JLUlAUuSU1Z4JLUlAUuSU1Z4JLUlAUuSU1Z4JLUlAUuSU1Z4JLU1FjfhZLkCPAC8BLwYlXNTSKUJGltk/gyq1+rqh9N4OdIko6Dp1AkqalxC7yAryc5kGTPSisk2ZNkPsn8wsLCmJuTJC0at8DfVlWXAdcAH05y5fIVqmpfVc1V1dzMzMyYm5MkLRqrwKvqmeH5GHA3cPkkQkmS1rbhAk9yapLTF6eBdwGHJhVMkvTaxrkKZTtwd5LFn/OFqvrHiaSSJK1pwwVeVd8HfnWCWSRJx8HLCCWpKQtckpqywCWpKQtckpqywCWpKQtckpqywCWpKQtckpqywCWpKQtckpqaxP/I83Nvdu/9G37tkZvf3W67knrwCFySmrLAJakpC1ySmrLAJakpC1ySmrLAJakpLyPcZONcCvj/kZdOSuvnEbgkNWWBS1JTFrgkNWWBS1JTFrgkNWWBS1JTFrgkNeV14Jq4jte+d8w8Lq+bPz7j/hnZjN+3R+CS1JQFLklNWeCS1JQFLklNjVXgSa5O8kSSp5LsnVQoSdLaNlzgSU4C/gq4BrgE2JXkkkkFkyS9tnGOwC8Hnqqq71fVfwN/B1w3mViSpLWkqjb2wuR64Oqq+uAw/37gzVV147L19gB7htmLgSc2mPUc4EcbfO1WMeNkmHEyzDgZ05DxjVU1s3xw02/kqap9wL5xf06S+aqam0CkTWPGyTDjZJhxMqY54zinUJ4Bzl8yf94wJknaAuMU+L8CFyW5IMkpwA3AvZOJJUlay4ZPoVTVi0luBL4GnATcVlWPTSzZq419GmYLmHEyzDgZZpyMqc244Q8xJUknlndiSlJTFrgkNdWiwKfllv0ktyU5luTQkrGzkzyQ5Mnh+axhPEn+csj8SJLLtiDf+UkeSvJ4kseSfGQKM74+ybeTfHfI+MfD+AVJ9g9Zvjh8ME6S1w3zTw3LZzc745KsJyX5TpL7pjFjkiNJHk1yMMn8MDY1+3rY7plJvpzk35IcTvLWacqY5OLh97f4eD7JTdOU8TVV1VQ/GH1A+j3gQuAU4LvAJScoy5XAZcChJWN/CuwdpvcCfzJMXwt8FQjwFmD/FuTbAVw2TJ8O/DujrzmYpowBThumTwb2D9v+EnDDMP4Z4HeG6d8FPjNM3wB8cQv39+8BXwDuG+anKiNwBDhn2djU7Othu7cDHxymTwHOnLaMS7KeBDwLvHFaM74q84nc+Dp/qW8FvrZk/mPAx05gntllBf4EsGOY3gE8MUz/NbBrpfW2MOs9wDunNSPwS8DDwJsZ3em2bfk+Z3SV01uH6W3DetmCbOcBDwLvAO4b/sJOW8aVCnxq9jVwBvAfy38X05RxWa53Af88zRmXPzqcQjkX+OGS+aeHsWmxvaqODtPPAtuH6ROae3gbfymjI9ypyjicmjgIHAMeYPQO6ydV9eIKOV7OOCz/KfCGzc4I/Dnw+8DPhvk3TGHGAr6e5EBGX1kB07WvLwAWgL8dTkX9TZJTpyzjUjcAdwzT05rx/+hQ4G3U6J/kE35dZpLTgDuBm6rq+aXLpiFjVb1UVTsZHeVeDvzyicyzXJLfAI5V1YETnWUNb6uqyxh9I+iHk1y5dOEU7OttjE453lJVlwL/xeh0xMumICMAw+cZ7wH+fvmyacm4kg4FPu237D+XZAfA8HxsGD8huZOczKi8P19Vd01jxkVV9RPgIUanI85Msnhj2dIcL2cclp8B/OcmR7sCeE+SI4y+ZfMdwF9MWUaq6pnh+RhwN6N/DKdpXz8NPF1V+4f5LzMq9GnKuOga4OGqem6Yn8aMr9KhwKf9lv17gd3D9G5G550Xx397+NT6LcBPl7wl2xRJAtwKHK6qT01pxpkkZw7Tv8joHP1hRkV+/SoZF7NfD3xjOCLaNFX1sao6r6pmGf15+0ZV/dY0ZUxyapLTF6cZnb89xBTt66p6FvhhkouHoauAx6cp4xK7eOX0yWKWacv4aifq5PtxfrhwLaMrKr4H/MEJzHEHcBT4H0ZHFx9gdK7zQeBJ4J+As4d1w+g/vPge8CgwtwX53sbord4jwMHhce2UZfwV4DtDxkPAHw7jFwLfBp5i9Db2dcP464f5p4blF27xPn87r1yFMjUZhyzfHR6PLf69mKZ9PWx3JzA/7O9/AM6awoynMnrHdMaSsanKuNrDW+klqakOp1AkSSuwwCWpKQtckpqywCWpKQtckpqywCWpKQtckpr6X+XyW3Gh8vL5AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "4JIVQm0J1bde", "colab": { "base_uri": "https://localhost:8080/", "height": 269 }, "outputId": "9288306c-51bb-489e-8f6e-5014c1dc65c3" }, "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.plot(X, y, label = 'data')\n", "plt.scatter(X_train, y_train, label ='train')\n", "plt.scatter(X_test, y_test, label ='test')\n", "plt.legend()\n", "plt.show()" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "rwSr7vva14-I", "colab": { "base_uri": "https://localhost:8080/", "height": 269 }, "outputId": "58c396f5-d340-4a3f-e15b-213425b6be5c" }, "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.plot(X, y, label = 'data')\n", "plt.scatter(X_train, y_train, label ='train')\n", "plt.scatter(X_test, y_test, label ='test')\n", "plt.plot(X[1::2], X[1::2].reshape(-1, 1).dot(weights[:-1]) + weights[-1], label = 'preds')\n", "plt.legend()\n", "plt.show()" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "3PLeQb5xGGI2" }, "source": [ "**Определение (R2-score или коэффициент детерминации):**\n", "\n", "$$ R^2 = 1 - \\frac{MSE(y, \\widehat{y})}{D y} $$" ] }, { "cell_type": "code", "metadata": { "id": "VykDMvHG2zHE", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c8684e8c-b94a-460e-ff5e-4275c6b74f40" }, "source": [ "from sklearn.metrics import r2_score\n", "\n", "train_preds = predict(X_train, weights)\n", "test_preds = predict(X_test, weights)\n", "\n", "print('train r2', r2_score(y_train, train_preds))\n", "print('test r2', r2_score(y_test, test_preds))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "train r2 0.9179964244553839\n", "test r2 0.8985775336147465\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "7VEiNra22QmH", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "931ba05f-9d79-4e2b-fae8-066b6753b34e" }, "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "train_preds = predict(X_train, weights)\n", "test_preds = predict(X_test, weights)\n", "\n", "print('train mse', mean_squared_error(y_train, train_preds))\n", "print('test mse', mean_squared_error(y_test, test_preds))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "train mse 77.84201506126992\n", "test mse 97.77251242777237\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "pgQ_SIEREWeV" }, "source": [ "# Ridge и Lasso регрессия. *Lp* регуляризация" ] }, { "cell_type": "markdown", "metadata": { "id": "G276_whHL02N" }, "source": [ "### Ridge&Lasso\n", "\n", "На практике точные формулы для подсчета коэффициентов линейной регрессии не используются, а используется метод градиентного спуска который состоит в подсчете производных от ошибки и шагу в направлении наискорейшего убывания функции (напомню, что мы стремимся минимизировать функцию потерь). Эти методы работают быстрее, чем точное вычисление обратных матриц и их перемножение.\n", "Более того, во многих задачах это единственный способ обучить модель, так как не всегда (на самом деле почти никогда) удается выписать точную формулу для минимума сложного функционала ошибки.\n", "\n", "Давайте рассмотрим реализации линейных регрессоров в библиотеке sklearn\n", "\n", "Но сначала давайте поймём, зачем вообще нужна регуляризация. Рассмотрим проблему мультиколлинеарности. В упрощённом понимании, это означает, что признаки линейно зависимы. Посмотрим, к чему это ведёт." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cLhFPkULyke-", "outputId": "d092bcca-b0b0-40b5-cf84-6c9e849d257e" }, "source": [ "X_train" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[-5. ],\n", " [-4.7979798 ],\n", " [-4.5959596 ],\n", " [-4.39393939],\n", " [-4.19191919],\n", " [-3.98989899],\n", " [-3.78787879],\n", " [-3.58585859],\n", " [-3.38383838],\n", " [-3.18181818],\n", " [-2.97979798],\n", " [-2.77777778],\n", " [-2.57575758],\n", " [-2.37373737],\n", " [-2.17171717],\n", " [-1.96969697],\n", " [-1.76767677],\n", " [-1.56565657],\n", " [-1.36363636],\n", " [-1.16161616],\n", " [-0.95959596],\n", " [-0.75757576],\n", " [-0.55555556],\n", " [-0.35353535],\n", " [-0.15151515],\n", " [ 0.05050505],\n", " [ 0.25252525],\n", " [ 0.45454545],\n", " [ 0.65656566],\n", " [ 0.85858586],\n", " [ 1.06060606],\n", " [ 1.26262626],\n", " [ 1.46464646],\n", " [ 1.66666667],\n", " [ 1.86868687],\n", " [ 2.07070707],\n", " [ 2.27272727],\n", " [ 2.47474747],\n", " [ 2.67676768],\n", " [ 2.87878788],\n", " [ 3.08080808],\n", " [ 3.28282828],\n", " [ 3.48484848],\n", " [ 3.68686869],\n", " [ 3.88888889],\n", " [ 4.09090909],\n", " [ 4.29292929],\n", " [ 4.49494949],\n", " [ 4.6969697 ],\n", " [ 4.8989899 ]])" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "code", "metadata": { "id": "ClvBLotDHCVt", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c5491929-ca1e-4f8e-e9f1-015b88aad08f" }, "source": [ "X_adversary = X_train.copy()\n", "X_adversary[:, 0] = 2\n", "print(X_train.shape, X_adversary.shape)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "(50, 1) (50, 1)\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_u6CJateyofQ", "outputId": "4bcacc73-7003-4697-808c-b957cd3041ec" }, "source": [ "X_adversary" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.],\n", " [2.]])" ] }, "metadata": { "tags": [] }, "execution_count": 13 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 358 }, "id": "X-xlOw0FyhlP", "outputId": "567355a1-2ba7-426f-c1cc-349e4181987d" }, "source": [ "w_adversary = fit(X_adversary, y_train)\n", "w_adversary" ], "execution_count": null, "outputs": [ { "output_type": "error", "ename": "LinAlgError", "evalue": "ignored", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mLinAlgError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mw_adversary\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_adversary\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mw_adversary\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m\u001b[0m in \u001b[0;36mfit\u001b[0;34m(X, y)\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mk\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mones\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m \u001b[0;34m@\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m@\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m \u001b[0;34m@\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mw\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m<__array_function__ internals>\u001b[0m in \u001b[0;36minv\u001b[0;34m(*args, **kwargs)\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py\u001b[0m in \u001b[0;36minv\u001b[0;34m(a)\u001b[0m\n\u001b[1;32m 544\u001b[0m \u001b[0msignature\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'D->D'\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misComplexType\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m'd->d'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 545\u001b[0m \u001b[0mextobj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_linalg_error_extobj\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_raise_linalgerror_singular\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 546\u001b[0;31m \u001b[0mainv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_umath_linalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msignature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msignature\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mextobj\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mextobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 547\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mainv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult_t\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 548\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py\u001b[0m in \u001b[0;36m_raise_linalgerror_singular\u001b[0;34m(err, flag)\u001b[0m\n\u001b[1;32m 86\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 87\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_raise_linalgerror_singular\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflag\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 88\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mLinAlgError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Singular matrix\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 89\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 90\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_raise_linalgerror_nonposdef\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflag\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mLinAlgError\u001b[0m: Singular matrix" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "R1iQ4tbeyuN0" }, "source": [ "**ВОПРОС** Почему так произошло??" ] }, { "cell_type": "code", "metadata": { "id": "6Pic637Cy4nm" }, "source": [ "#TODO" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GTUfDWr4Iwxr" }, "source": [ "**Что произошло**\n", "\n", "Ранг матрицы $\\mathrm{X_{adversary}}$ равен 1, а размерность признакового пространства равна 2. Из линейной алгебры известно, что ранг произведения матриц не превосходит минимального ранга этих двух матриц: \n", "\n", "$$\n", " \\mathrm{rk} AB \\leq \\min \\{\\mathrm{rk} A, \\mathrm{rk} B\\}\n", "$$\n", "\n", "К чему это здесь приводит? Посмотрим на аналитическое решение\n", "$$\n", " w = (X^{\\top} X)^{-1} X^{\\top} y\n", "$$\n", "\n", "Нас интересует компонента $(X^{\\top} X)^{-1}$. Здесь обратная не определена, потому что ранг матрицы $X^{\\top} X$ не превосходит единицы. При этом матрица $X^{\\top} X$ -- квадратная, размера 2x2. Из линейной алгебры мы знаем, что квадратная матрица обратима только тогда, когда она полноранговая. Именно об этом Вам сигналит ошибка LinAlgError: Вы пытаетесь обратить матрицу, которую обращать нельзя. Что делать?\n", "\n", "**Определение ($L_{p}$-регуляризация):**\n", "Пусть задана линейная регрессия с вектором весов $w$ и функцией ошибок $\\mathcal{L}(y, \\widehat{y}(w))$ (например, среднеквадратичная ошибка), тогда к ней можно добавить $L_{p}$-регуляризацию изменив функционал ошибки следующим образом:\n", "\n", "$$\n", " \\mathcal{L}(y, \\widehat{y}(w)) + \\alpha \\|w\\|_{p}^{p} \\to \\min\\limits_{w}\n", "$$\n", "\n", "Если $p$ равно 2, то это называют **Ridge**-регуляризацией, а если $p$ равно 1, то **Lasso**-регуляризацией.\n", "\n", "**Утверждение:** $L_{2}$-регуляризация позволяет избежать этой проблемы ($L_{p}$-нормы для произвольных $p$, на самом деле, тоже, но это уже нетривиально доказывать).\n", "\n", "**Определение (напоминание):** Собственные значения матрицы $A$ это такие числа $\\lambda$, что существует ненулевой вектор $x$, такой что $Ax = \\lambda x$\n", "\n", "А как вообще понять, насколько плохо всё с матрицей в плане того, что с ней будет происходить при попытке её обратить?\n", "Посчитать коэффициент обусловленности: отношение максимального собственного значения к минимальному. Чем он больше, тем всё хуже." ] }, { "cell_type": "code", "metadata": { "id": "DW79B9eNPHYf", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b6061c73-c5ca-45bb-e130-c666fc7502f1" }, "source": [ "eigenvals, eigenvectors = np.linalg.eig(X.reshape(1,-1).T @ X.reshape(1,-1))\n", "eigenvals.max() / eigenvals.min()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(-3.02413172630763e+16-0j)" ] }, "metadata": { "tags": [] }, "execution_count": 16 } ] }, { "cell_type": "code", "metadata": { "id": "m5x9U-gELRjY" }, "source": [ "from sklearn.linear_model import LinearRegression, Ridge, Lasso" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "UgaA6qihMav8" }, "source": [ "from sklearn.datasets import load_wine" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "zTf8-WxNMhF7", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "27e1b7c7-d56c-494e-db8d-ca62689df18e" }, "source": [ "wine_data = load_wine()\n", "wine_data" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'DESCR': '.. _wine_dataset:\\n\\nWine recognition dataset\\n------------------------\\n\\n**Data Set Characteristics:**\\n\\n :Number of Instances: 178 (50 in each of three classes)\\n :Number of Attributes: 13 numeric, predictive attributes and the class\\n :Attribute Information:\\n \\t\\t- Alcohol\\n \\t\\t- Malic acid\\n \\t\\t- Ash\\n\\t\\t- Alcalinity of ash \\n \\t\\t- Magnesium\\n\\t\\t- Total phenols\\n \\t\\t- Flavanoids\\n \\t\\t- Nonflavanoid phenols\\n \\t\\t- Proanthocyanins\\n\\t\\t- Color intensity\\n \\t\\t- Hue\\n \\t\\t- OD280/OD315 of diluted wines\\n \\t\\t- Proline\\n\\n - class:\\n - class_0\\n - class_1\\n - class_2\\n\\t\\t\\n :Summary Statistics:\\n \\n ============================= ==== ===== ======= =====\\n Min Max Mean SD\\n ============================= ==== ===== ======= =====\\n Alcohol: 11.0 14.8 13.0 0.8\\n Malic Acid: 0.74 5.80 2.34 1.12\\n Ash: 1.36 3.23 2.36 0.27\\n Alcalinity of Ash: 10.6 30.0 19.5 3.3\\n Magnesium: 70.0 162.0 99.7 14.3\\n Total Phenols: 0.98 3.88 2.29 0.63\\n Flavanoids: 0.34 5.08 2.03 1.00\\n Nonflavanoid Phenols: 0.13 0.66 0.36 0.12\\n Proanthocyanins: 0.41 3.58 1.59 0.57\\n Colour Intensity: 1.3 13.0 5.1 2.3\\n Hue: 0.48 1.71 0.96 0.23\\n OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71\\n Proline: 278 1680 746 315\\n ============================= ==== ===== ======= =====\\n\\n :Missing Attribute Values: None\\n :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\\n :Creator: R.A. Fisher\\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\\n :Date: July, 1988\\n\\nThis is a copy of UCI ML Wine recognition datasets.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\\n\\nThe data is the results of a chemical analysis of wines grown in the same\\nregion in Italy by three different cultivators. There are thirteen different\\nmeasurements taken for different constituents found in the three types of\\nwine.\\n\\nOriginal Owners: \\n\\nForina, M. et al, PARVUS - \\nAn Extendible Package for Data Exploration, Classification and Correlation. \\nInstitute of Pharmaceutical and Food Analysis and Technologies,\\nVia Brigata Salerno, 16147 Genoa, Italy.\\n\\nCitation:\\n\\nLichman, M. (2013). UCI Machine Learning Repository\\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\\nSchool of Information and Computer Science. \\n\\n.. topic:: References\\n\\n (1) S. Aeberhard, D. Coomans and O. de Vel, \\n Comparison of Classifiers in High Dimensional Settings, \\n Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of \\n Mathematics and Statistics, James Cook University of North Queensland. \\n (Also submitted to Technometrics). \\n\\n The data was used with many others for comparing various \\n classifiers. The classes are separable, though only RDA \\n has achieved 100% correct classification. \\n (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \\n (All results using the leave-one-out technique) \\n\\n (2) S. Aeberhard, D. Coomans and O. de Vel, \\n \"THE CLASSIFICATION PERFORMANCE OF RDA\" \\n Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \\n Mathematics and Statistics, James Cook University of North Queensland. \\n (Also submitted to Journal of Chemometrics).\\n',\n", " 'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,\n", " 1.065e+03],\n", " [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,\n", " 1.050e+03],\n", " [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,\n", " 1.185e+03],\n", " ...,\n", " [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,\n", " 8.350e+02],\n", " [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,\n", " 8.400e+02],\n", " [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,\n", " 5.600e+02]]),\n", " 'feature_names': ['alcohol',\n", " 'malic_acid',\n", " 'ash',\n", " 'alcalinity_of_ash',\n", " 'magnesium',\n", " 'total_phenols',\n", " 'flavanoids',\n", " 'nonflavanoid_phenols',\n", " 'proanthocyanins',\n", " 'color_intensity',\n", " 'hue',\n", " 'od280/od315_of_diluted_wines',\n", " 'proline'],\n", " 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2]),\n", " 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "Z5WwzMmiMfyr", "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "outputId": "32704791-e43b-437e-cc66-12371cdbb657" }, "source": [ "import seaborn as sns\n", "\n", "plt.figure(figsize = (10,6))\n", "sns.heatmap(X.corr())" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 72 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "tn3C6aLWNg-E", "colab": { "base_uri": "https://localhost:8080/", "height": 747 }, "outputId": "245cd141-bee6-4011-ec8d-b57605caa0bc" }, "source": [ "sns.clustermap(X.corr())" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 73 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "ZewQhvWnMipr" }, "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=99\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "yjfmaMKIMlWT" }, "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "\n", "X_train = scaler.fit_transform(X_train)\n", "X_test = scaler.transform(X_test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "VFdldjDfMsGD" }, "source": [ "from sklearn.metrics import mean_squared_error" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "oaslBvEsMnKj", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "47443722-864f-4b09-f384-4304803ebf9d" }, "source": [ "regressor = LinearRegression()\n", "\n", "regressor.fit(X_train, y_train)\n", "test_predictions = regressor.predict(X_test)\n", "\n", "print('test mse: ', mean_squared_error(y_test, test_predictions))\n", "print('r2 score: ', r2_score(y_test, test_predictions))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "test mse: 0.07475703133417169\n", "r2 score: 0.8738475096235853\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "1eGcAHp4UiMq", "colab": { "base_uri": "https://localhost:8080/", "height": 439 }, "outputId": "b2e60d30-1b26-422d-9ebc-af79d5d21144" }, "source": [ "X" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesproline
014.231.712.4315.6127.02.803.060.282.295.641.043.921065.0
113.201.782.1411.2100.02.652.760.261.284.381.053.401050.0
213.162.362.6718.6101.02.803.240.302.815.681.033.171185.0
314.371.952.5016.8113.03.853.490.242.187.800.863.451480.0
413.242.592.8721.0118.02.802.690.391.824.321.042.93735.0
..........................................
17313.715.652.4520.595.01.680.610.521.067.700.641.74740.0
17413.403.912.4823.0102.01.800.750.431.417.300.701.56750.0
17513.274.282.2620.0120.01.590.690.431.3510.200.591.56835.0
17613.172.592.3720.0120.01.650.680.531.469.300.601.62840.0
17714.134.102.7424.596.02.050.760.561.359.200.611.60560.0
\n", "

178 rows × 13 columns

\n", "
" ], "text/plain": [ " alcohol malic_acid ash ... hue od280/od315_of_diluted_wines proline\n", "0 14.23 1.71 2.43 ... 1.04 3.92 1065.0\n", "1 13.20 1.78 2.14 ... 1.05 3.40 1050.0\n", "2 13.16 2.36 2.67 ... 1.03 3.17 1185.0\n", "3 14.37 1.95 2.50 ... 0.86 3.45 1480.0\n", "4 13.24 2.59 2.87 ... 1.04 2.93 735.0\n", ".. ... ... ... ... ... ... ...\n", "173 13.71 5.65 2.45 ... 0.64 1.74 740.0\n", "174 13.40 3.91 2.48 ... 0.70 1.56 750.0\n", "175 13.27 4.28 2.26 ... 0.59 1.56 835.0\n", "176 13.17 2.59 2.37 ... 0.60 1.62 840.0\n", "177 14.13 4.10 2.74 ... 0.61 1.60 560.0\n", "\n", "[178 rows x 13 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 78 } ] }, { "cell_type": "code", "metadata": { "id": "nUrN2soBMoxb", "colab": { "base_uri": "https://localhost:8080/", "height": 501 }, "outputId": "4d0bb0c1-323c-4db2-95ff-69f3aabdef17" }, "source": [ "plt.figure(figsize=(20, 8))\n", "plt.bar(X.columns, regressor.coef_)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 79 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "BBT8gghkM0aI" }, "source": [ "Теперь обратимся к методам с регуляризацией.\n", "\n", "Ridge (L2-регуляризация) сильно штрафует за слишком большие веса и не очень за малые. При увеличении коэффициента перед регуляризатором веса меняются плавно" ] }, { "cell_type": "code", "metadata": { "id": "a4OT2jMXMzz8", "colab": { "base_uri": "https://localhost:8080/", "height": 279 }, "outputId": "7256ddb9-3686-4e18-e216-6f0f4872a751" }, "source": [ "alphas = np.linspace(1, 1000, 100)\n", "\n", "weights = np.empty((len(X.columns), 0))\n", "for alpha in alphas:\n", " ridge_regressor = Ridge(alpha)\n", " ridge_regressor.fit(X_train, y_train)\n", " weights = np.hstack((weights, ridge_regressor.coef_.reshape(-1, 1)))\n", "plt.plot(alphas, weights.T)\n", "plt.xlabel('regularization coef')\n", "plt.ylabel('weight value')\n", "plt.show()" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "dRt8FhcHM4f_" }, "source": [ "Lasso одинаково сильно штрафует малые и большие веса, поэтому при достаточно большом коэффициенте регуляризации многие признаки становятся равными нулю, при этом остаются только наиболее инфромативные. Этот факт можно использовать для решения задачи отбора признаков." ] }, { "cell_type": "code", "metadata": { "id": "cBqkQ24lM3lG", "colab": { "base_uri": "https://localhost:8080/", "height": 334 }, "outputId": "519c61ac-beaa-48c9-f408-be182f68dcb9" }, "source": [ "alphas = np.linspace(0.1, 1, 100)\n", "\n", "plt.figure(figsize=(10, 5))\n", "weights = np.empty((len(X.columns), 0))\n", "for alpha in alphas:\n", " lasso_regressor = Lasso(alpha)\n", " lasso_regressor.fit(X_train, y_train)\n", " weights = np.hstack((weights, lasso_regressor.coef_.reshape(-1, 1)))\n", "plt.plot(alphas, weights.T)\n", "plt.xlabel('regularization coef')\n", "plt.ylabel('weight value')\n", "plt.grid()\n", "plt.show()" ], "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "code", "metadata": { "id": "wNcemWQ4My2y", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b3e60240-d4d5-4cb8-b135-89dc70daf70e" }, "source": [ "ridge = Ridge(0.1)\n", "ridge.fit(X_train, y_train)\n", "print('\\n r2 score ridge: ', r2_score(y_test, ridge.predict(X_test)))\n", "print('test mse ridge: ', mean_squared_error(y_test, ridge.predict(X_test)))\n", "\n", "lasso = Lasso(0.1)\n", "lasso.fit(X_train, y_train)\n", "print('\\n r2 score lasso: ', r2_score(y_test, lasso.predict(X_test)))\n", "print('test mse lasso: ', mean_squared_error(y_test, lasso.predict(X_test)))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "\n", " r2 score ridge: 0.8740521897249454\n", "test mse ridge: 0.07463573942225457\n", "\n", " r2 score lasso: 0.8096651433986044\n", "test mse lasso: 0.11279102613416034\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "0XRm91YX2D7i" }, "source": [ "# Выводы\n", "\n", "- Реализовано аналитическое решение задачи линейной регрессии\n", "- Дано определение Lp регуляризации\n", "- Приведено сравнение Ridge и Lasso регуляризаций модели\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7rS3LA4QME_2" }, "source": [ "# Рекомендованная литература\n", "\n", "- [Матричные производные](http://www.machinelearning.ru/wiki/images/archive/9/93/20170127140036!MO17_seminar3.pdf)\n", "- [Решение уравнения простой линейной регрессии](https://habr.com/ru/post/474602/)\n", "\n" ] } ] }