18 minute read

Banner photo by Christian Stahl

Context

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Goal

Forecast use of a city bikeshare system


Exploratory Data Analysis

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_columns = 100

import warnings
warnings.filterwarnings("ignore")
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import xgboost as xgb
import lightgbm as lgbm

Let’s see the head of the data set

df = pd.read_csv("../input/train.csv")
df.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

The dataset contains the following columns:

datetime - hourly date + timestamp  
season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

‘count’ is the target, all other columns are features.

The value of ‘count’ is equal to casual plus register

df.shape
(10886, 12)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB

There are basiclly 3 types of data:

  • data concerning the weather
  • for time infos
  • linked to the number of users

No Nan, it seems that there isn’t any missing value. All types seem to be corresponding to the meaning of the data except the ‘datetime’ column wich is an object not a pandas’ datetime Let’s see the basic statistics:

df.describe()
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000

At first glance, when we look at the temp / atemp min, it seems strange that no negative temperature is recorded…

Weather informations analysis

plt.figure(figsize=(8, 4))
sns.distplot(df.temp, bins=10, label='real temp.')
sns.distplot(df.atemp, bins=10, label='feels like temp.')
plt.legend()
plt.show()

png

One can see an offset between the real temperature and the “feels like” temperature. This can probably explained by the fact that temperature on a bike is different. But the distributions looks the same. More, there is a clear correlation between the 2 features.

plt.figure(figsize=(4, 4))
sns.scatterplot(df.atemp, df.temp)
plt.show()

png

Let’s see if the difference between the 2 features is quite the same…

df['Delta'] = df.temp - df.atemp
sns.lineplot(x=df.index, y=df.Delta)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa54c4cf588>

png

There are few abnormal values corresponding to the spike in the above plot.

df[df.Delta > 5].head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count Delta
8991 2012-08-17 00:00:00 3 0 1 1 27.88 12.12 57 11.0014 21 67 88 15.76
8992 2012-08-17 01:00:00 3 0 1 1 27.06 12.12 65 7.0015 16 38 54 14.94
8993 2012-08-17 02:00:00 3 0 1 1 27.06 12.12 61 8.9981 4 15 19 14.94
8994 2012-08-17 03:00:00 3 0 1 1 26.24 12.12 65 7.0015 0 6 6 14.12
8995 2012-08-17 04:00:00 3 0 1 1 26.24 12.12 73 11.0014 0 9 9 14.12
df[(df.atemp >= 12) & (df.atemp <= 12.5)].head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count Delta
59 2011-01-03 14:00:00 1 0 1 1 10.66 12.12 30 19.0012 11 66 77 -1.46
60 2011-01-03 15:00:00 1 0 1 1 10.66 12.12 30 16.9979 14 58 72 -1.46
61 2011-01-03 16:00:00 1 0 1 1 10.66 12.12 30 16.9979 9 67 76 -1.46
109 2011-01-05 18:00:00 1 0 1 1 9.84 12.12 38 8.9981 3 166 169 -2.28
115 2011-01-06 00:00:00 1 0 1 1 7.38 12.12 55 0.0000 0 11 11 -4.74

All those measures have been recorded the same day. During that day ‘temp’ changed but not ‘atemp’ The ‘atemp’ value seems strange even during other few days… Because those 2 features are correlated, we only need to keep one. This is more safe to keep temp instead of atemp

Is there also a correlation between humidity and windspeed ?

plt.figure(figsize=(8, 4))
sns.distplot(df.humidity, bins=10, label='humidity')
sns.distplot(df.windspeed, bins=10, label='windspeed')
plt.legend()
plt.show()

png

sns.pairplot(df[['temp', 'atemp', 'humidity', 'windspeed']])
<seaborn.axisgrid.PairGrid at 0x7fa547823a58>

png

Except for the temperature, there is no clear correlation between other features. Furthermore:

  • humidity vs atemp there is something strange with missing value like a void in the scatter
  • for all windspeed plots there are null values and beginning values start at 5, that’s probably because the sensor can detect low values…this needs to be investigated further.

Modification of time data

df['casual_percentage'] = df['casual'] / df['count']
df['registered_percentage'] = df['registered'] / df['count']
def change_datetime(df):
    """ Modify the col datetime to create other cols: dow, month, week..."""
    df["datetime"] = pd.to_datetime(df["datetime"])
    df["dow"] = df["datetime"].dt.dayofweek
    df["month"] = df["datetime"].dt.month
    df["week"] = df["datetime"].dt.week
    df["hour"] = df["datetime"].dt.hour
    df["year"] = df["datetime"].dt.year
    df["season"] = df.season.map({1: "Winter", 2 : "Spring", 3 : "Summer", 4 :"Fall" })
    df["month_str"] = df.month.map({1: "Jan ", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5: "May", 6: "Jun", 7: "Jul", 8: "Aug", 9: "Sep", 10: "Oct", 11: "Nov", 12: "Dec" })
    df["dow_str"] = df.dow.map({5: "Sat", 6 : "Sun", 0 : "Mon", 1 :"Tue", 2 : "Wed", 3 : "Thu", 4: "Fri" })
    df["weather_str"] = df.weather.map({1: "Good", 2 : "Normal", 3 : "Bad", 4 :"Very Bad"})
    return df
    
    
df = change_datetime(df)
df.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count Delta casual_percentage registered_percentage dow month week hour year month_str dow_str weather_str
0 2011-01-01 00:00:00 Winter 0 0 1 9.84 14.395 81 0.0 3 13 16 -4.555 0.187500 0.812500 5 1 52 0 2011 Jan Sat Good
1 2011-01-01 01:00:00 Winter 0 0 1 9.02 13.635 80 0.0 8 32 40 -4.615 0.200000 0.800000 5 1 52 1 2011 Jan Sat Good
2 2011-01-01 02:00:00 Winter 0 0 1 9.02 13.635 80 0.0 5 27 32 -4.615 0.156250 0.843750 5 1 52 2 2011 Jan Sat Good
3 2011-01-01 03:00:00 Winter 0 0 1 9.84 14.395 75 0.0 3 10 13 -4.555 0.230769 0.769231 5 1 52 3 2011 Jan Sat Good
4 2011-01-01 04:00:00 Winter 0 0 1 9.84 14.395 75 0.0 0 1 1 -4.555 0.000000 1.000000 5 1 52 4 2011 Jan Sat Good

Rentals / target analysis

sns.kdeplot(data=df['count'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5472d17b8>

png

df['y_log'] = np.log(df['count'])
sns.kdeplot(data=df['y_log'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5452d5e80>

png

plt.figure(figsize=(12, 6))
sns.pointplot(x=df["hour"], y=df["count"], hue=df["season"])
plt.xlabel("Hour Of The Day")
plt.ylabel("Users Count") 
plt.title("Rentals Across Hours")
plt.show()

png

plt.figure(figsize=(12, 6))
sns.lineplot(x="hour", y="count", hue="season", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5452ca518>

png

# ---------------------------------------------------------
plt.figure(figsize=(6,3))
plt.stackplot(range(1,25),
              df.groupby(['hour'])['casual_percentage'].mean(), 
              df.groupby(['hour'])['registered_percentage'].mean(), 
              labels=['Casual','Registered'])
plt.legend(loc='upper left')
plt.margins(0,0)
plt.title("Evolution of casual /registered bikers' share over hours of the day")

# ---------------------------------------------------------
plt.figure(figsize=(6,6))
df_hours = pd.DataFrame(
    {"casual" : df.groupby(['hour'])['casual'].mean().values,
    "registered" : df.groupby(['hour'])['registered'].mean().values},
    index = df.groupby(['hour'])['casual'].mean().index)
df_hours.plot.bar(rot=0)
plt.title("Evolution of casual /registered bikers numbers over hours of the day")

# ---------------------------------------------------------
plt.show()

png

<Figure size 432x432 with 0 Axes>

png

plt.figure(figsize=(12, 6))
sns.pointplot(x=df["dow"], y=df["count"], hue=df["season"])
plt.xlabel("Day of the week")
plt.ylabel("Users Count") 
plt.title("Rentals Across week days")
plt.show()

png

# ---------------------------------------------------------
plt.figure(figsize=(6,3))
plt.stackplot(range(1,8),
              df.groupby(['dow'])['casual_percentage'].mean(), 
              df.groupby(['dow'])['registered_percentage'].mean(), 
              labels=['Casual','Registered'])
plt.legend(loc='upper left')
plt.margins(0,0)
plt.title("Evolution of casual /registered bikers' share over weekdays")

# ---------------------------------------------------------
plt.figure(figsize=(6,6))
df_hours = pd.DataFrame(
    {"casual" : df.groupby(['dow'])['casual'].mean().values,
    "registered" : df.groupby(['dow'])['registered'].mean().values},
    index = df.groupby(['dow'])['casual'].mean().index)
df_hours.plot.bar(rot=0)
plt.title("Evolution of casual /registered bikers numbers over weekdays")

# ---------------------------------------------------------
plt.show()

png

<Figure size 432x432 with 0 Axes>

png

fig, ax = plt.subplots()
fig.set_size_inches(10, 8)
sns.boxplot(data=df, y="count", x="month_str", orient="v")
ax.set(xlabel="Months" , ylabel="Count", title="Count Across Month");

png

sns.swarmplot(x='hour', y='temp', data=df, hue='season')
plt.show()

png

# ---------------------------------------------------------
plt.figure(figsize=(6,3))
plt.stackplot(range(1,13),
              df.groupby(['month'])['casual_percentage'].mean(), 
              df.groupby(['month'])['registered_percentage'].mean(), 
              labels=['Casual','Registered'])
plt.legend(loc='upper left')
plt.margins(0,0)
plt.title("Evolution of casual /registered bikers' share over months of the year")

# ---------------------------------------------------------
plt.figure(figsize=(6,6))
df_hours = pd.DataFrame(
    {"casual" : df.groupby(['month'])['casual'].mean().values,
    "registered" : df.groupby(['month'])['registered'].mean().values},
    index = df.groupby(['month'])['casual'].mean().index)
df_hours.plot.bar(rot=0)
plt.title("Evolution of casual /registered bikers numbers over months of the year")

# ---------------------------------------------------------
plt.show()

png

<Figure size 432x432 with 0 Axes>

png

plt.figure(figsize=(10, 5))

bars = ['casual not on working days', 'casual on working days',\
        'registered not on working days', 'registered on working days',\
        'casual not on holidays', 'casual on holidays',\
        'registered not on holidays', 'registered on holidays']

qty = [df.groupby(['workingday'])['casual'].mean()[0], df.groupby(['workingday'])['casual'].mean()[1],\
      df.groupby(['workingday'])['registered'].mean()[0], df.groupby(['workingday'])['registered'].mean()[1],\
      df.groupby(['holiday'])['casual'].mean()[0], df.groupby(['holiday'])['casual'].mean()[1],\
      df.groupby(['holiday'])['registered'].mean()[0], df.groupby(['holiday'])['registered'].mean()[1]]

y_pos = np.arange(len(bars))
plt.barh(y_pos, qty, align='center')

plt.yticks(y_pos, labels=bars)
#plt.invert_yaxis()  # labels read top-to-bottom
plt.xlabel('Mean nb of bikers')
plt.title("Number of bikers on holidays / working days")
plt.show()

png

# ---------------------------------------------------------
plt.figure(figsize=(6,3))
plt.stackplot(range(1,5),
              df.groupby(['season'])['casual_percentage'].mean(), 
              df.groupby(['season'])['registered_percentage'].mean(), 
              labels=['Casual','Registered'])
plt.legend(loc='upper left')
plt.margins(0,0)
plt.title("Evolution of casual /registered bikers' share over seasons")

# ---------------------------------------------------------
plt.figure(figsize=(6,6))
df_hours = pd.DataFrame(
    {"casual" : df.groupby(['season'])['casual'].mean().values,
    "registered" : df.groupby(['season'])['registered'].mean().values},
    index = df.groupby(['season'])['casual'].mean().index)
df_hours.plot.bar(rot=0)
plt.title("Evolution of casual /registered bikers numbers over seasons")

# ---------------------------------------------------------
plt.show()

png

<Figure size 432x432 with 0 Axes>

png

Generally speaking, the number of registered user increase between non working and working days where as the opposite can be seen when it comes to casual users (their number decrease between non working to working days). The spikes on the previous plots may corroborate this assumption because the registered users tend to use bike to go to work while casul users rent bike in the middle of the day…

Correlations

sns.set(style="white")

# Compute the correlation matrix
corr = df[["temp","atemp","casual","registered","humidity","windspeed","count"]].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7fa544b6bbe0>

png

plt.figure(figsize=(5, 5))
sns.scatterplot(df.registered, df['count'])
plt.show()

png

Data preparation for models

# target 
y = (df["count"])

# drop irrelevant cols and target
cols_dropped = ["count", "datetime", "atemp", "month_str", "season", "dow_str", "weather_str",\
                "casual", "registered", "casual_percentage", "registered_percentage", "y_log", "Delta"] 
X = df.drop(columns=cols_dropped)
            
X.shape, y.shape
((10886, 11), (10886,))
y.head()
0    16
1    40
2    32
3    13
4     1
Name: count, dtype: int64
X.head()
holiday workingday weather temp humidity windspeed dow month week hour year
0 0 0 1 9.84 81 0.0 5 1 52 0 2011
1 0 0 1 9.02 80 0.0 5 1 52 1 2011
2 0 0 1 9.02 80 0.0 5 1 52 2 2011
3 0 0 1 9.84 75 0.0 5 1 52 3 2011
4 0 0 1 9.84 75 0.0 5 1 52 4 2011

Split the data set into a training part and one for testing purpose.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Features importance

def get_rmse(reg, model_name):
    """Print the score for the model passed in argument and retrun scores for the train/test sets"""
    
    y_train_pred, y_pred = reg.predict(X_train), reg.predict(X_test)
    rmse_train, rmse_test = np.sqrt(mean_squared_error(y_train, y_train_pred)), np.sqrt(mean_squared_error(y_test, y_pred))
    print(model_name, f'\t - RMSE on Training  = {rmse_train:.2f} / RMSE on Test = {rmse_test:.2f}')
    
    return rmse_train, rmse_test
rf = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
_, _ = get_rmse(rf, 'rondom forrest')

features = pd.DataFrame()
features["features"] = X_train.columns
features["coefficient"] = rf.feature_importances_

features.sort_values(by=["coefficient"], ascending=False, inplace=True)
fig,ax= plt.subplots()
fig.set_size_inches(20,5)
sns.barplot(data=features, x="coefficient", y="features");
rondom forrest 	 - RMSE on Training  = 14.94 / RMSE on Test = 42.95

png

gb = GradientBoostingRegressor(n_estimators=100).fit(X_train, y_train)
_, _ = get_rmse(gb, 'gb')

features = pd.DataFrame()
features["features"] = X_train.columns
features["coefficient"] = gb.feature_importances_

features.sort_values(by=["coefficient"], ascending=False, inplace=True)
fig,ax= plt.subplots()
fig.set_size_inches(20,5)
sns.barplot(data=features, x="coefficient", y="features");
gb 	 - RMSE on Training  = 68.02 / RMSE on Test = 72.10

png

xgb_reg = xgb.XGBRegressor(n_estimators=100).fit(X_train, y_train)
_, _ = get_rmse(xgb_reg, 'xgb_reg')

features = pd.DataFrame()
features["features"] = X_train.columns
features["coefficient"] = xgb_reg.feature_importances_

features.sort_values(by=["coefficient"], ascending=False, inplace=True)
fig,ax= plt.subplots()
fig.set_size_inches(20,5)
sns.barplot(data=features, x="coefficient", y="features");
xgb_reg 	 - RMSE on Training  = 68.91 / RMSE on Test = 72.77

png

lgbm_reg = lgbm.LGBMRegressor(n_estimators=100).fit(X_train, y_train)
_, _ = get_rmse(lgbm_reg, 'lgbm_reg')

features = pd.DataFrame()
features["features"] = X_train.columns
features["coefficient"] = lgbm_reg.feature_importances_

features.sort_values(by=["coefficient"], ascending=False, inplace=True)
fig,ax= plt.subplots()
fig.set_size_inches(20,5)
sns.barplot(data=features, x="coefficient", y="features");
lgbm_reg 	 - RMSE on Training  = 31.57 / RMSE on Test = 40.24

png

The same feature come first, sometimes the order vary a little bit…


Models training and predictions

Metric - Root Mean Squared Error

Using logarithmic is an indirect way of measuring the performance of a loss function in terms of something more easily understandable

def rmsle(y, y_,convertExp=True):
    if convertExp:
        y = np.exp(y),
        y_ = np.exp(y_)
    log1 = np.nan_to_num(np.array([np.log(v + 1) for v in y]))
    log2 = np.nan_to_num(np.array([np.log(v + 1) for v in y_]))
    calc = (log1 - log2) ** 2
    return np.sqrt(np.mean(calc))

Please, also refer to the get_rmse function coded previously.

Base line & basic models

# list of all the basic models used at first
model_list = [
    LinearRegression(), Lasso(), Ridge(), ElasticNet(),
    RandomForestRegressor(), GradientBoostingRegressor(), ExtraTreesRegressor(),
    xgb.XGBRegressor(), lgbm.LGBMRegressor()
             ]

# creation of list of names and scores for the train / test
model_names = [str(m)[:str(m).index('(')] for m in model_list]
rmse_train, rmse_test = [], []

# fit and predict all models
for model, name in zip(model_list, model_names):
    model.fit(X_train, y_train)
    sc_train, sc_test = get_rmse(model, name)
    rmse_train.append(sc_train)
    rmse_test.append(sc_test)
LinearRegression 	 - RMSE on Training  = 140.53 / RMSE on Test = 146.52
Lasso 	 - RMSE on Training  = 140.57 / RMSE on Test = 146.59
Ridge 	 - RMSE on Training  = 140.53 / RMSE on Test = 146.52
ElasticNet 	 - RMSE on Training  = 143.30 / RMSE on Test = 149.20
RandomForestRegressor 	 - RMSE on Training  = 18.81 / RMSE on Test = 45.08
GradientBoostingRegressor 	 - RMSE on Training  = 68.02 / RMSE on Test = 72.10
ExtraTreesRegressor 	 - RMSE on Training  = 0.00 / RMSE on Test = 44.67
XGBRegressor 	 - RMSE on Training  = 68.91 / RMSE on Test = 72.77
LGBMRegressor 	 - RMSE on Training  = 31.57 / RMSE on Test = 40.24

Polynomial regression

from sklearn.preprocessing import PolynomialFeatures

poly_lin_reg = Pipeline([
    ("poly_feat", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("linear_reg", LinearRegression())
])

poly_lin_reg.fit(X_train, y_train)

sc_train, sc_test = get_rmse(poly_lin_reg, "Poly Linear Reg")

model_names.append('Poly Linear Reg')
rmse_train.append(sc_train)
rmse_test.append(sc_test)
Poly Linear Reg 	 - RMSE on Training  = 107.29 / RMSE on Test = 111.72

Hyperparameters tuning

Lasso

rd_cv = Ridge()
rd_params_ = {'max_iter':[1000, 2000, 3000],
                 'alpha':[0.1, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000]}

rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)
rd_cv = GridSearchCV(rd_cv,
                  rd_params_,
                  scoring = rmsle_scorer,
                  cv=5)

rd_cv.fit(X_train, y_train)
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_iter': [1000, 2000, 3000], 'alpha': [0.1, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=make_scorer(rmsle, greater_is_better=False), verbose=0)

Lasso

sc_train, sc_test = get_rmse(rd_cv, "Ridge CV")

model_names.append('Ridge CV')
rmse_train.append(sc_train)
rmse_test.append(sc_test)
Ridge CV 	 - RMSE on Training  = 140.53 / RMSE on Test = 146.52
la_cv = Lasso()

alpha  = 1/np.array([0.1, 1, 2, 3, 4, 10, 30, 100, 200, 300, 400, 800, 900, 1000])
la_params = {'max_iter':[1000, 2000, 3000],'alpha':alpha}

la_cv = GridSearchCV(la_cv, la_params, scoring = rmsle_scorer, cv=5)
la_cv.fit(X_train, y_train).best_params_
{'alpha': 10.0, 'max_iter': 1000}
sc_train, sc_test = get_rmse(la_cv, "Lasso CV")

model_names.append('Lasso CV')
rmse_train.append(sc_train)
rmse_test.append(sc_test)
Lasso CV 	 - RMSE on Training  = 142.24 / RMSE on Test = 148.05

Knn regressor

knn_reg = KNeighborsRegressor()
knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6]}

knn_reg = GridSearchCV(knn_reg, knn_params, scoring = rmsle_scorer, cv=5)
knn_reg.fit(X_train, y_train).best_params_
{'n_neighbors': 1}
sc_train, sc_test = get_rmse(knn_reg, "kNN Reg")

model_names.append('kNN Reg')
rmse_train.append(sc_train)
rmse_test.append(sc_test)
kNN Reg 	 - RMSE on Training  = 0.00 / RMSE on Test = 140.08

LinearSVR

svm_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svr", LinearSVR())
])

svm_reg.fit(X_train, y_train)

sc_train, sc_test = get_rmse(svm_reg, "SVM Reg")

model_names.append('SVM Reg')
rmse_train.append(sc_train)
rmse_test.append(sc_test)
SVM Reg 	 - RMSE on Training  = 146.86 / RMSE on Test = 153.43

Just for fun: a MLP (Multi Layer Perceptron)

I don’t expect this model to be reall efficient because it’s probably far too complex for our needs… but that’s just for fun!

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train_, X_test_, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
import tensorflow as tf
def model_five_layers(input_dim):

    model = tf.keras.models.Sequential()

    # Add the first Dense layers of 100 units with the input dimension
    model.add(tf.keras.layers.Dense(100, input_dim=input_dim, activation='sigmoid'))

    # Add four more layers of decreasing units
    model.add(tf.keras.layers.Dense(100, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(100, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(100, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(100, activation='sigmoid'))

    # Add finally the output layer with one unit: the predicted result
    model.add(tf.keras.layers.Dense(1, activation='relu'))
    
    return model
model = model_five_layers(input_dim=X_train.shape[1])

# Compile the model with mean squared error (for regression)
model.compile(optimizer='SGD', loss='mean_squared_error')

# Now fit the model on XXX epoches with a batch size of XXX
# You can add the test/validation set into the fit: it will give insights on this dataset too
model.fit(X_train_, y_train, validation_data=(X_test_, y_test), epochs=200, batch_size=8)
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/keras/utils/losses_utils.py:170: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 8708 samples, validate on 2178 samples
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/200
8708/8708 [==============================] - 1s 127us/sample - loss: 38212.8034 - val_loss: 37116.1185
Epoch 2/200
8708/8708 [==============================] - 1s 101us/sample - loss: 36155.7200 - val_loss: 30416.4880
Epoch 3/200
8708/8708 [==============================] - 1s 102us/sample - loss: 34061.7786 - val_loss: 33251.2080
Epoch 4/200
8708/8708 [==============================] - 1s 104us/sample - loss: 34448.0311 - val_loss: 33964.8975
Epoch 5/200
8708/8708 [==============================] - 1s 102us/sample - loss: 34450.1169 - val_loss: 35355.5311
7us/sample - loss: 35475.3280 - val_loss: 38654.4456

...

Epoch 195/200
8708/8708 [==============================] - 1s 95us/sample - loss: 35762.5541 - val_loss: 36520.5384
Epoch 196/200
8708/8708 [==============================] - 1s 97us/sample - loss: 35349.9739 - val_loss: 39536.0127
Epoch 197/200
8708/8708 [==============================] - 1s 95us/sample - loss: 35585.8599 - val_loss: 33362.7076
Epoch 198/200
8708/8708 [==============================] - 1s 95us/sample - loss: 35124.1187 - val_loss: 42101.3971
Epoch 199/200
8708/8708 [==============================] - 1s 96us/sample - loss: 35766.2078 - val_loss: 34425.0247
Epoch 200/200
8708/8708 [==============================] - 1s 93us/sample - loss: 35553.1240 - val_loss: 35824.4136





<tensorflow.python.keras.callbacks.History at 0x7fa53c534860>
y_train_pred, y_pred = model.predict(X_train_), model.predict(X_test_)
rmse_train_, rmse_test_ = np.sqrt(mean_squared_error(y_train, y_train_pred)), np.sqrt(mean_squared_error(y_test, y_pred))
print("MLP reg", f'\t - RMSE on Training  = {rmse_train_:.2f} / RMSE on Test = {rmse_test_:.2f}')

#sc_train, sc_test = get_rmse(model, "MLP reg")

model_names.append('MLP reg')
rmse_train.append(rmse_train_)
rmse_test.append(rmse_test_)
MLP reg 	 - RMSE on Training  = 188.90 / RMSE on Test = 189.27

Conclusion and submission

Before making a submission we have to choose the best model i.e with the smallest RMSE.

df_score = pd.DataFrame({'model_names' : model_names,
                         'rmse_train' : rmse_train,
                         'rmse_test' : rmse_test})
df_score = pd.melt(df_score, id_vars=['model_names'], value_vars=['rmse_train', 'rmse_test'])
df_score.head(10)
model_names variable value
0 LinearRegression rmse_train 140.527242
1 Lasso rmse_train 140.566345
2 Ridge rmse_train 140.527244
3 ElasticNet rmse_train 143.301234
4 RandomForestRegressor rmse_train 18.810298
5 GradientBoostingRegressor rmse_train 68.017857
6 ExtraTreesRegressor rmse_train 0.001515
7 XGBRegressor rmse_train 68.907661
8 LGBMRegressor rmse_train 31.570179
9 Poly Linear Reg rmse_train 107.292639
plt.figure(figsize=(12, 10))
sns.barplot(y="model_names", x="value", hue="variable", data=df_score)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa544c9af98>

png

  • The MLP model is indeed useless
  • All linear regression have the same poor result, regularization using L1 or L2 with Lasso & Ridge doesn’t change anything which is quite obvious because those models are underfiting and need to be complexified (not regularized!). The same argument applies to ElasticNet which is a combinaison of the 2 above.
  • Gridsearch don’t change anything because tuning regularization hyperparameters is not interesting for the same reason.
  • The polynomial use help to improve results a little bit.
  • Gradient Boosting & XGBoost Regressors are performing well without too overfitting.
  • ExtraTree Regressor is out
  • Random Forrest & LGBM Regressors are the best but the first one is clearly overfitting, so it would be better to choose the second one : LGBM Regressor

Now let’s see how do we have to sumbit our answer

y_sample = pd.read_csv("../input/sampleSubmission.csv")
y_sample.head()
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
3 2011-01-20 03:00:00 0
4 2011-01-20 04:00:00 0
df_test = pd.read_csv("../input/test.csv")
df_test = change_datetime(df_test)

# keep this col for the submission
datetimecol = df_test["datetime"]

test_cols_dropped = ['datetime',
 'atemp',
 'month_str',
 'season',
 'dow_str',
 'weather_str']

df_test = df_test.drop(columns=test_cols_dropped)
df_test.head()
holiday workingday weather temp humidity windspeed dow month week hour year
0 0 1 1 10.66 56 26.0027 3 1 3 0 2011
1 0 1 1 10.66 56 0.0000 3 1 3 1 2011
2 0 1 1 10.66 56 0.0000 3 1 3 2 2011
3 0 1 1 10.66 56 11.0014 3 1 3 3 2011
4 0 1 1 10.66 56 11.0014 3 1 3 4 2011

We train our model on the whole data set (not only the splitted part). The more data we use, the better would be the results.

lgbm_reg = lgbm.LGBMRegressor()
lgbm_reg.fit(X, y)
y_pred_final = lgbm_reg.predict(df_test)
submission = pd.DataFrame({
        "datetime": datetimecol,
        "count": [max(0, x) for x in y_pred_final]
    })
submission.to_csv('bike_prediction_output.csv', index=False)

submission.head()
datetime count
0 2011-01-20 00:00:00 12.423966
1 2011-01-20 01:00:00 6.649667
2 2011-01-20 02:00:00 3.565470
3 2011-01-20 03:00:00 0.704959
4 2011-01-20 04:00:00 0.704959

Submit to kaggle, this model scores 0.41233. Not bad :)