Content prepared by: Berk Hakbilen

Logistic Regression Theory

Logistic Regression for Classification

Despite its name, logistic regression is a essential technique for binary classification and multiclass classification rather than a regression algorithm. It falls under the category of linear classifiers. It is a fast, and simple model, making it easier to interpret the results.

Logistic regression is like its name suggests also a regression analysis. However, unlike linear regression (which is not suitable for a classification analysis), the calculated result is the probability of an event.

Let's have a look at the linear regression model and then derive the logistic regression function from this.

Linear regression formula: $$y^i = a_0 + a_1x_1^i + .... + a_nx_n^i $$

On the other hand the so called sigmoid function is: $$P(x) = \frac{1}{1 + exp(-x)} $$

and its curve: 480px-Logistic-curve.svg.png

Here we can see that its output is always a value ranging from 0 to 1 which is the exact behaviour we want for a binary classification.

Like we just mentioned, for a classification problem, we want to get probabilities between 0 - 1 as results. We can achieve that by substituting our linear regression function into our sigmoid function, we obtain our logistic regression function: $$P(y^i) = \frac{1}{1 + exp(-(a_0 + a_1x_1^i + .... + a_nx_n^i))} $$

Looking at the formula we can see that the regression coefficients are now superscript of the exponential term (e). This way the regression coefficients from the linear regression function still effect the probability outcome of the logistic function.

Exploratory Data Analysis

from sklearn.datasets import load_breast_cancer
import pandas as pd

cancer = load_breast_cancer()

df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['result'] = pd.Series(cancer.target)
df.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension result
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  result                   569 non-null    int64  
dtypes: float64(30), int64(1)
memory usage: 137.9 KB
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))

sns.countplot(df['result'])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
<matplotlib.axes._subplots.AxesSubplot at 0x7fd242d02950>
cancer.target_names
array(['malignant', 'benign'], dtype='<U9')
benign, malignant = df['result'].value_counts()
print('Number of benign results {} corresponding to {} percent: '.format(benign,round(benign / len(df) * 100, 2)))
print('Number of malignant results {} corresponding to {} percent: '.format(malignant,round(malignant / len(df) * 100, 2)))
Number of benign results 357 corresponding to 62.74 percent: 
Number of malignant results 212 corresponding to 37.26 percent: 
cols = ['result',
        'mean radius', 
        'mean texture', 
        'mean perimeter', 
        'mean area', 
        'mean smoothness', 
        'mean compactness', 
        'mean concavity',
        'mean concave points', 
        'mean symmetry', 
        'mean fractal dimension']
sns.pairplot(data=df[cols], hue='result')
<seaborn.axisgrid.PairGrid at 0x7fd2423c4990>
f, ax = plt.subplots(figsize=(20, 20))

corr = df.corr().round(2)

# Create a mask for the lower triangle
import numpy as np
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# create the heatmap
sns.heatmap(corr, mask=mask,
            square=True, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd2348f62d0>

Logistic Regression Model

from sklearn.model_selection import train_test_split

y = df['result'].values
X = df.drop('result',axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape
(455, 30)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
logisticregression = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression.score(X_train, y_train))
print("test set score: %f" % logisticregression.score(X_test, y_test))
training set score: 0.960440
test set score: 0.956140

Because our training and test score are closer to each other, we can see that we are actually underfitting. Let's try a higher C value which means a more complex model which fits better to the data.

model = LogisticRegression(max_iter=10000,C=1000)
logisticregression_1000 = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression_1000.score(X_train, y_train))
print("test set score: %f" % logisticregression_1000.score(X_test, y_test))
training set score: 0.986813
test set score: 0.991228

Because our training and test score are closer to each other, we can see that we are actually underfitting. Let's try a higher C value which means a more complex model which fits better to the data.

model = LogisticRegression(max_iter=10000,C=0.1)
logisticregression_0_1 = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression_0_1.score(X_train, y_train))
print("test set score: %f" % logisticregression_0_1.score(X_test, y_test))
training set score: 0.949451
test set score: 0.964912

A lower C does not create a big difference since our model is already underfitting to the data.

from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred,labels=[0,1])
cf_matrix
array([[40,  3],
       [ 1, 70]])
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f895589abd0>
    from sklearn.metrics import plot_confusion_matrix
    disp = plot_confusion_matrix(model, X_test, y_test,
                                 display_labels=['Benign','Malignant'])
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=['Benign','Malignant']))
              precision    recall  f1-score   support

      Benign       0.98      0.93      0.95        43
   Malignant       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
Accuracy: 0.9649122807017544
Precision: 0.958904109589041
Recall: 0.9859154929577465