Logistic Regression
Content prepared by: Berk Hakbilen
Logistic Regression for Classification
Despite its name, logistic regression is a essential technique for binary classification and multiclass classification rather than a regression algorithm. It falls under the category of linear classifiers. It is a fast, and simple model, making it easier to interpret the results.
Logistic regression is like its name suggests also a regression analysis. However, unlike linear regression (which is not suitable for a classification analysis), the calculated result is the probability of an event.
Let's have a look at the linear regression model and then derive the logistic regression function from this.
Linear regression formula: $$y^i = a_0 + a_1x_1^i + .... + a_nx_n^i $$
On the other hand the so called sigmoid function is: $$P(x) = \frac{1}{1 + exp(-x)} $$
and its curve:
Here we can see that its output is always a value ranging from 0 to 1 which is the exact behaviour we want for a binary classification.
Like we just mentioned, for a classification problem, we want to get probabilities between 0 - 1 as results. We can achieve that by substituting our linear regression function into our sigmoid function, we obtain our logistic regression function: $$P(y^i) = \frac{1}{1 + exp(-(a_0 + a_1x_1^i + .... + a_nx_n^i))} $$
Looking at the formula we can see that the regression coefficients are now superscript of the exponential term (e). This way the regression coefficients from the linear regression function still effect the probability outcome of the logistic function.
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['result'] = pd.Series(cancer.target)
df.head()
df.info()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
sns.countplot(df['result'])
cancer.target_names
benign, malignant = df['result'].value_counts()
print('Number of benign results {} corresponding to {} percent: '.format(benign,round(benign / len(df) * 100, 2)))
print('Number of malignant results {} corresponding to {} percent: '.format(malignant,round(malignant / len(df) * 100, 2)))
cols = ['result',
'mean radius',
'mean texture',
'mean perimeter',
'mean area',
'mean smoothness',
'mean compactness',
'mean concavity',
'mean concave points',
'mean symmetry',
'mean fractal dimension']
sns.pairplot(data=df[cols], hue='result')
f, ax = plt.subplots(figsize=(20, 20))
corr = df.corr().round(2)
# Create a mask for the lower triangle
import numpy as np
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# create the heatmap
sns.heatmap(corr, mask=mask,
square=True, annot=True)
from sklearn.model_selection import train_test_split
y = df['result'].values
X = df.drop('result',axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
logisticregression = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression.score(X_train, y_train))
print("test set score: %f" % logisticregression.score(X_test, y_test))
Because our training and test score are closer to each other, we can see that we are actually underfitting. Let's try a higher C value which means a more complex model which fits better to the data.
model = LogisticRegression(max_iter=10000,C=1000)
logisticregression_1000 = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression_1000.score(X_train, y_train))
print("test set score: %f" % logisticregression_1000.score(X_test, y_test))
Because our training and test score are closer to each other, we can see that we are actually underfitting. Let's try a higher C value which means a more complex model which fits better to the data.
model = LogisticRegression(max_iter=10000,C=0.1)
logisticregression_0_1 = model.fit(X_train,y_train)
print("training set score: %f" % logisticregression_0_1.score(X_train, y_train))
print("test set score: %f" % logisticregression_0_1.score(X_test, y_test))
A lower C does not create a big difference since our model is already underfitting to the data.
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred,labels=[0,1])
cf_matrix
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(model, X_test, y_test,
display_labels=['Benign','Malignant'])
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=['Benign','Malignant']))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))