Influence of Feature Selection and PCA on a Small Dataset

November 23, 2017

scikit-learn machine learning feature selection PCA cross-validation

This study covers the influence of feature selection and PCA on the Titanic Survivors dataset. Most of the preprocessing code such as data cleaning, encoding and transformation is adapted from the Scikit-Learn ML from Start to Finish work by Jeff Delaney.

Import Data

Load the csv train and test files into a pandas dataframe and print the first 5 rows to see a sample of the data. Print also a statistics description of each feature.

Show/Hide code

import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Show/Hide code

train_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Visualise Data

To familiarise with the data and discover underlying patterns to exploit later in the machine learning models, we need to create some distribution, bar, and scatter plots. For a complete visualisation analysis check Scikit-Learn ML from Start to Finish work and my previous work here.

Engineer Features

Aside from ‘Sex’, the ‘Age’ feature is second in importance. To avoid overfitting, group people into logical human age groups.
Each ‘Cabin’ starts with a letter. Probably, this letter is more important than the number that follows, so slice it off.
‘Fare’ is another continuous value that should be simplified, placing the values into quartile bins accordingly.
Extract information from the ‘Name’ feature. Rather than use the full name, extract the last name and prefix and then append them as their own features.
Lastly, drop useless features (‘Ticket’, ‘Name’, and ‘Embarked’).

Show/Hide code

# Code adapted from https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    return df

def simplify_cabins(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    return df

def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories
    return df

def format_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df    
    
def drop_features(df):
    return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabins(df)
    df = simplify_fares(df)
    df = format_name(df)
    df = drop_features(df)
    return df

train_df = transform_features(train_df)
test_df = transform_features(test_df)
train_df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Lname	NamePrefix
0	1	0	3	male	Student	1	1_quartile	N	Braund,	Mr.
1	2	1	1	female	Adult	1	4_quartile	C	Cumings,	Mrs.
2	3	1	3	female	Young Adult	0	1_quartile	N	Heikkinen,	Miss.
3	4	1	1	female	Young Adult	1	4_quartile	C	Futrelle,	Mrs.
4	5	0	3	male	Young Adult	0	2_quartile	N	Allen,	Mr.

Encode Data

Normalize and transform categorical non-numerical features to numerical with the LabelEncoder tool from scikit-learn, making out data more flexible for various algorithms.

Show/Hide code

from sklearn import preprocessing

def encode_features(df_train, df_test):
    features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NamePrefix']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test
    
train_df, test_df = encode_features(train_df, test_df)
train_df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Lname	NamePrefix
0	1	0	3	1	4	1	0	7	100	19
1	2	1	1	0	0	1	3	2	182	20
2	3	1	3	0	7	0	0	7	329	16
3	4	1	1	0	7	1	3	2	267	20
4	5	0	3	1	7	0	1	7	15	19

Machine Learning

Split Data to Train/Test Sets

Create train/test sets using the train_test_split function. The test_size=0.2 indicates the percentage of the data that should be held over for testing.

Show/Hide code

from sklearn.model_selection import train_test_split

# Define the independent variables as features.
Xs = train_df.drop(['PassengerId', 'Survived'], axis=1)

# Define the target (dependent) variable as labels.
Ys = train_df['Survived']

# Create a train/test split using 30% test size.
X_train, X_test, y_train, y_test = train_test_split(Xs,
                                                    Ys,
                                                    test_size=0.2,
                                                    random_state=23)

# Check the split printing the shape of each set.
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(712, 9) (712,)
(179, 9) (179,)

Evaluate Algorithms

Test and evaluate 5 algorithms: 1. Naive Bayes 2. Support Vector Machines 3. K Nearest Neighbors 4. Random Forest 5. AdaBoost

Validate

Measure the effectiveness of the algorithm applying KFold. Split the data into 50 buckets, then run the algorithm using a different bucket as the test set for each iteration. Turn shuffle=True to shuffle the data points’ order before splitting into folds.

Show/Hide code

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import KFold
from time import time

# Create the classifiers.
nb = GaussianNB()
svc = SVC()
dtc = DecisionTreeClassifier()
knc = KNeighborsClassifier()
rfc = RandomForestClassifier()
abc = AdaBoostClassifier()

# Create a dictionary of classifiers to choose from.
classifiers = {"GaussianNB": nb, "SVM": svc, "Decision Trees": dtc, 
               "KNN": knc, "Random Forest": rfc, "AdaBoost": abc}

# Create a function that runs and evaluates the classifiers.
def test_clfs(clf):
    
    # Create the KFold cross validation iterator.
    kf = KFold(n_splits=50, shuffle=True, random_state=23)
    
    outcomes = []
    fold = 0
    for train_index, test_index in kf.split(Xs):
        t0 = time()
        fold += 1
        X_train, X_test = Xs.values[train_index], Xs.values[test_index]
        y_train, y_test = Ys.values[train_index], Ys.values[test_index]
           
        # Fit the classifier to the data.
        clf.fit(X_train, y_train)
            
        # Create a set of predictions.
        predictions = clf.predict(X_test)
        # Evaluate predictions with accuracy score.
        accuracy = clf.score(X_test, y_test)
            
        outcomes.append(accuracy)
            
    mean_outcome = np.mean(outcomes)
        
    # Print the results.
    print("\nMean Accuracy: {0}".format(mean_outcome))
    print("\nTime passed: ", round(time() - t0, 3), "s\n")

for name, clf in classifiers.items():
    print("#"*55)
    print(name)
    test_clfs(clf)

#######################################################
GaussianNB

Mean Accuracy: 0.7601960784313725

Time passed:  0.002 s

#######################################################
SVM

Mean Accuracy: 0.6271895424836601

Time passed:  0.03 s

#######################################################
Decision Trees

Mean Accuracy: 0.7644444444444445

Time passed:  0.002 s

#######################################################
KNN

Mean Accuracy: 0.5996078431372549

Time passed:  0.002 s

#######################################################
Random Forest

Mean Accuracy: 0.8041176470588235

Time passed:  0.02 s

#######################################################
AdaBoost

Mean Accuracy: 0.8263398692810456

Time passed:  0.088 s

Select Adaboost as the best algorithm, since it gives the best accuracy scores.

Influence of Feature Selection & PCA

Investigate if Feature Selection and PCA can improve the performance of Random Forest.

Feature Selection

Select the best features with SelectKBest, which removes all but the k highest scoring features.

PCA

Reduce the dimensionnality of the data using the Principal Component Analysis (PCA).

Dimensionality Reduction

Use GridSearchCV and Pipeline to optimize over different classes of estimators. Compare unsupervised PCA dimensionality reduction to univariate feature selection SelectKBest during the grid search.

Show/Hide code

from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, accuracy_score, classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Create the classifier.
clf = AdaBoostClassifier()

# Create the pipeline.
pipeline = Pipeline([('reduce_dim', PCA()),
                     ('clf', clf)])

# Create the parameters.
n_feature_options = [1, 2, 3, 4, 5, 6, 7, 8, 9]
n_estimators = [50]
parameters = [{'reduce_dim': [PCA(iterated_power=7)],
               'reduce_dim__n_components': n_feature_options,
               'clf__n_estimators': n_estimators},
              {'reduce_dim': [SelectKBest()],
               'reduce_dim__k': n_feature_options,
               'clf__n_estimators': n_estimators}]

reducer_labels = ['PCA', 'KBest()']

# Create a function to get the best estimator and print the reports.
def compare_estimators():
    t0 = time()

    # Create the KFold cross-validator.
    kf = KFold(n_splits=50, shuffle=True, random_state=23)

    # Create accuracy score to compare each combination.
    scoring = {'Accuracy': make_scorer(accuracy_score)}

    # Create the grid search.
    grid = GridSearchCV(estimator=pipeline,
                        param_grid=parameters,
                        scoring=scoring,
                        cv=kf, refit='Accuracy')

    # Fit grid search combinations.
    grid.fit(X_train, y_train)

    # Make predictions.
    predictions = grid.predict(X_test)

    # Evaluate using sklearn.classification_report().
    report = classification_report(y_test, predictions)

    # Get the best parameters and scores.
    best_parameters = grid.best_params_
    best_score = grid.best_score_
    
    mean_scores = np.array(grid.cv_results_['mean_test_Accuracy'])
    # scores are in the order of param_grid iteration, which is alphabetical
    mean_scores = mean_scores.reshape(len(n_estimators), -1, len(n_feature_options))
    # select score for best C
    mean_scores = mean_scores.max(axis=0)
    bar_offsets = (np.arange(len(n_feature_options)) *
                   (len(reducer_labels) + 1) + .5)

    plt.figure(figsize=(10, 5))
    for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
        plt.bar(bar_offsets + i, reducer_scores, label=label)

    plt.title("Comparing feature reduction techniques")
    plt.xlabel('Reduced number of features')
    plt.xticks(bar_offsets + len(reducer_labels) / 2, n_feature_options)
    plt.ylabel('Accuracy')
    plt.ylim((0, 1))
    plt.legend(loc='upper left')

    # Print the results.
    print("\nAccuracy score: ", accuracy_score(y_test, predictions))
    print("\nReport:\n")
    print(report)
    print("\nBest Mean Accuracy score: ", best_score)
    print("\nBest parameters:\n")
    print(best_parameters)
    print(confusion_matrix(y_test, predictions))
    print("Time passed: ", round(time() - t0, 3), "s")
    
    return grid.best_estimator_

compare_estimators()

Accuracy score:  0.821229050279

Report:

             precision    recall  f1-score   support

          0       0.85      0.87      0.86       115
          1       0.76      0.73      0.75        64

avg / total       0.82      0.82      0.82       179


Best Mean Accuracy score:  0.817415730337

Best parameters:

{'clf__n_estimators': 50, 'reduce_dim': SelectKBest(k=9, score_func=<function f_classif at 0x116dfee18>), 'reduce_dim__k': 9}
[[100  15]
 [ 17  47]]
Time passed:  101.808 s





Pipeline(memory=None,
     steps=[('reduce_dim', SelectKBest(k=9, score_func=<function f_classif at 0x116dfee18>)), ('clf', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None))])

png

It seems that SelectKBest performs better than PCA for all the cases, but the accuracy score has not improved significantly after dimensionality reduction. Actually, it was found that the best model keeps all the features (k=9).

Tune

To get an improved performance, optimise the hyperparameters that impact the model using GridSearchCV. The effectiveness of the algorithm is validated with StratifiedShuffleSplit and the evaluation with multiple metrics such as accuracy, precision, and recall.

Show/Hide code

from sklearn.model_selection import StratifiedShuffleSplit

# Create the classifier.
clf = AdaBoostClassifier()

# Create the parameters.
parameters = {'n_estimators': [10, 25, 50, 75],
              'algorithm': ['SAMME', 'SAMME.R'],
              'random_state': [3]}

# Find the best estimator and print the reports.
t0 = time()

# Create the Stratified ShuffleSplit cross-validator.
sss = StratifiedShuffleSplit(n_splits=50, test_size=0.2, random_state=3)

# Create multiple evaluation metrics to compare each combination.
scoring = {'AUC': 'roc_auc',
           'Accuracy': make_scorer(accuracy_score),
           'Precision': 'precision',
           'Recall': 'recall',
           'f1': 'f1'}

# Create the grid search.
grid = GridSearchCV(estimator=clf,
                    param_grid=parameters,
                    scoring=scoring,
                    cv=sss, refit='Accuracy')

# Fit grid search combinations.
grid.fit(X_train, y_train)

# Make predictions.
predictions = grid.predict(X_test)

# Evaluate using sklearn.classification_report().
report = classification_report(y_test, predictions)

# Get the best parameters and scores.
best_parameters = grid.best_params_
best_score = grid.best_score_

# Print the results.
print("\nAccuracy score: ", accuracy_score(y_test, predictions))
print("\nReport:\n")
print(report)
print("\nBest Accuracy score: ", best_score)
print("\nBest parameters:\n")
print(best_parameters)
print(confusion_matrix(y_test, predictions))
print("Time passed: ", round(time() - t0, 3), "s")

best_clf = grid.best_estimator_

Accuracy score:  0.821229050279

Report:

             precision    recall  f1-score   support

          0       0.85      0.87      0.86       115
          1       0.76      0.73      0.75        64

avg / total       0.82      0.82      0.82       179


Best Accuracy score:  0.819300699301

Best parameters:

{'algorithm': 'SAMME.R', 'n_estimators': 25, 'random_state': 3}
[[100  15]
 [ 17  47]]
Time passed:  42.804 s

Predict the Actual Test Data

Finally, make the predictions and export them to a csv file.

Show/Hide code

passenger_ids = test_df['PassengerId']
predictions = best_clf.predict(test_df.drop('PassengerId', axis=1))

output = pd.DataFrame({ 'PassengerId' : passenger_ids, 'Survived': predictions })
output.to_csv('titanic_predictions.csv', index = False)
output.head()

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

Anomaly Detection and Recommender Systems

March 5, 2018

Python machine learning matplotlib

K-means Clustering and Principal Component Analysis

February 23, 2018

Python machine learning matplotlib

Spam Classification

February 13, 2018

Python machine learning

Influence of Feature Selection and PCA on a Small Dataset

November 23, 2017

Import Data

Visualise Data

Engineer Features

Encode Data

Machine Learning

Split Data to Train/Test Sets

Evaluate Algorithms

Validate

Influence of Feature Selection & PCA

Feature Selection

PCA

Dimensionality Reduction

Tune

Predict the Actual Test Data

Related

Anomaly Detection and Recommender Systems

March 5, 2018

K-means Clustering and Principal Component Analysis

February 23, 2018

Spam Classification

February 13, 2018

Comments

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Lname	NamePrefix
0	1	0	3	1	4	1	0	7	100	19
1	2	1	1	0	0	1	3	2	182	20
2	3	1	3	0	7	0	0	7	329	16
3	4	1	1	0	7	1	3	2	267	20
4	5	0	3	1	7	0	1	7	15	19

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Lname	NamePrefix
0	1	0	3	1	4	1	0	7	100	19
1	2	1	1	0	0	1	3	2	182	20
2	3	1	3	0	7	0	0	7	329	16
3	4	1	1	0	7	1	3	2	267	20
4	5	0	3	1	7	0	1	7	15	19

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Lname	NamePrefix
0	1	0	3	1	4	1	0	7	100	19
1	2	1	1	0	0	1	3	2	182	20
2	3	1	3	0	7	0	0	7	329	16
3	4	1	1	0	7	1	3	2	267	20
4	5	0	3	1	7	0	1	7	15	19