45 minute read

Photo by israel palacio on Unsplash

Introduction

Context

  • Planete OUI offers green electricity supply with prices adapted to the consumption profiles of its clients.
  • The electricity prices are highly variable during time (depending on the market, the global consumption and its client needs).
  • The consumption profile of an installation has to be appraised to compute the best estimation of supply tarifs
    • Most sites have a consumption varying strongly with temperature because of electrical heating systems.
    • Except industrial installations because their consumption might be highly related their uses

The 1st objectives are to analyze thermosensitivity uses and / or other factors (times series…) affecting consumption

Based on:

  • the potential client historical consumption data (i.e its profile)
  • electricity prices
  • a given percentile used to cover supply costs for different scenarios the 2nd objective of Planète OUI is to compute a distribution of supply costs in €/MWh.

Planète OUI needs

Extrapolation of one or several years consumption data rebuilt from a single year of measured data supplied by the client (in order to be Tcombined with electricity prices) and to get a larger data set of analysis.

Data

  • The client’s data is often incomplete and spread over a relatively short period.
  • 3 datasets:
    • x_train : input data of the training set
    • y_train : output data of the training set
    • x_test : input data of the testing set
  • Features:
    • “ID”: Data point ID;
    • “timestamp”: Complete timestamps with year, month, day and hour, in local time (CET and CEST);
    • “temp1”, “temp2”, “meannationaltemp”: Local and mean national temperatures (°C);
    • “humidity1”, “humidity2”: Local relative humidities (%);
    • “loc1”, “loc2” “locsecondary1”, “locsecondary2”, “locsecondary3”: the coordinates of the studied and secondary sites, in decimal degrees and of the form (latitude, longitude).
    • “consumptionsecondary1”, “consumptionsecondary2”, “consumptionsecondary3”: the consumption data of three secondary sites, whose correlations with studied sites may be of use (kWh). Indeed, the two studied sites and the three secondary sites are used for the same purposes; The output data of the model to be developed takes the following form:
      • “ID”: Data point ID;
      • “consumption1”, “consumption2”: the consumption data of the two studied sites (kWh).

Relative humidities are provided with temperature data because they represent variables of importance for electricity consumption: humidity indeed strongly available for all. It has influences thermal comfort. To replicate operational conditions, some temperature integrated BCM Energy’s and humidity data points will be missing. The imputation method must be carefully perimeter in 2017. considered. The “consumptionsecondaryi” variables are the consumption data of several sites with metering power higher than 250 kVA of the Planète OUIs portfolio.

This correlation of the various sites consumptions shall be studied to precise data completion or interpolation. Timestamps may be expressed as month or day of year, day of week and hours, to study the impact of annual, weekly and daily seasonalities. Particular attention should be paid to national holidays processing.

Challenge goal

Prediction of the consumption of two given sites during a year, based on measured data of other profiles.


First data insight

Basic infos

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 100)
from sklearn.preprocessing import MinMaxScaler
import folium
input_dir = os.path.join('..', 'input')
output_dir = os.path.join('..', 'output')
print(os.listdir(input_dir))
['input_training_ssnsrY0.csv', 'Screenshot from 2019-05-28 10-48-59.png', 'input_test_cdKcI0e.csv', 'vacances-scolaires.csv', 'jours_feries_seuls.csv', 'BCM_custom_metric.py', 'output_training_Uf11I9I.csv']
# each id is unique so we can use this column as index
df = pd.read_csv(os.path.join(input_dir, 'input_training_ssnsrY0.csv'), index_col='ID')
df.iloc[23:26]
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3
ID
23 2016-11-01T23:00:00.0 10.8 NaN 11.1 87.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 143 67 162
24 2016-11-02T00:00:00.0 10.8 NaN 11.1 86.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 144 75 161
25 2016-11-02T01:00:00.0 10.5 NaN 11.0 81.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 141 65 161
df_out = pd.read_csv(os.path.join('..', 'input', 'output_training_Uf11I9I.csv'), index_col='ID')
df_out.head()
consumption_1 consumption_2
ID
0 100 93
1 101 94
2 100 96
3 101 95
4 100 100
df.shape, df_out.shape
((8760, 14), (8760, 2))
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 14 columns):
timestamp                  8760 non-null object
temp_1                     8589 non-null float64
temp_2                     8429 non-null float64
mean_national_temp         8760 non-null float64
humidity_1                 8589 non-null float64
humidity_2                 8428 non-null float64
loc_1                      8760 non-null object
loc_2                      8760 non-null object
loc_secondary_1            8760 non-null object
loc_secondary_2            8760 non-null object
loc_secondary_3            8760 non-null object
consumption_secondary_1    8760 non-null int64
consumption_secondary_2    8760 non-null int64
consumption_secondary_3    8760 non-null int64
dtypes: float64(5), int64(3), object(6)
memory usage: 1.0+ MB

Missing or irrelevant values

df.isnull().sum()
timestamp                    0
temp_1                     171
temp_2                     331
mean_national_temp           0
humidity_1                 171
humidity_2                 332
loc_1                        0
loc_2                        0
loc_secondary_1              0
loc_secondary_2              0
loc_secondary_3              0
consumption_secondary_1      0
consumption_secondary_2      0
consumption_secondary_3      0
dtype: int64
df.duplicated().sum()
0
#sns.pairplot(df_viz.select_dtypes(['int64', 'float64']))#.apply(pd.Series.nunique, axis=0)

Transforming time informations

def transform_datetime_infos(data_frame):
    data_frame['datetime'] = pd.to_datetime(data_frame['timestamp'])
    data_frame['month'] = data_frame['datetime'].dt.month
    data_frame['week of year'] = data_frame['datetime'].dt.weekofyear
    data_frame['day of year'] = data_frame['datetime'].dt.dayofyear
    data_frame['day'] = data_frame['datetime'].dt.weekday_name
    data_frame['hour'] = data_frame['datetime'].dt.hour
    
    # for merging purposes
    data_frame['date'] = data_frame['datetime'].dt.strftime('%Y-%m-%d')
    
    return data_frame
df = transform_datetime_infos(df)
df.head(3)
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 datetime month week of year day of year day hour date
ID
0 2016-11-01T00:00:00.0 8.3 NaN 11.1 95.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 143 74 168 2016-11-01 00:00:00 11 44 306 Tuesday 0 2016-11-01
1 2016-11-01T01:00:00.0 8.0 NaN 11.1 98.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 141 60 162 2016-11-01 01:00:00 11 44 306 Tuesday 1 2016-11-01
2 2016-11-01T02:00:00.0 6.8 NaN 11.0 97.0 NaN (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 142 60 164 2016-11-01 02:00:00 11 44 306 Tuesday 2 2016-11-01

Considered sites

# Number of unique classes in each object column # Check the nb of sites
df.select_dtypes('object').apply(pd.Series.nunique, axis=0)
timestamp          8759
loc_1                 1
loc_2                 1
loc_secondary_1       1
loc_secondary_2       1
loc_secondary_3       1
day                   7
date                365
dtype: int64
for i in ['loc_1', 'loc_2', 'loc_secondary_1', 'loc_secondary_2', 'loc_secondary_3']:
    print('site : ', i, 'coordinates : ', df[i][0])
site :  loc_1 coordinates :  (50.633, 3.067)
site :  loc_2 coordinates :  (43.530, 5.447)
site :  loc_secondary_1 coordinates :  (44.838, -0.579)
site :  loc_secondary_2 coordinates :  (47.478, -0.563)
site :  loc_secondary_3 coordinates :  (48.867, 2.333)
# Create a map centered at the given latitude and longitude
france_map = folium.Map(location=[47,1], zoom_start=6)

# Add markers with labels
for i in ['loc_1', 'loc_2', 'loc_secondary_1', 'loc_secondary_2', 'loc_secondary_3']:
    temp_str = df[i][0].strip('(').strip(')').strip(' ')
    temp_str1, temp_str2 = temp_str.split(', ')
    folium.Marker([float(temp_str1), float(temp_str2)], popup=None, tooltip=i).add_to(france_map) 
    
# Display the map
display(france_map)

png

  • loc_1 is in the north near Lille.
  • loc_1 is in the south east near Marseille.
  • loc_secondary_1 is in the south west near Bordeaux.
  • loc_secondary_2 is in the west near Le Mans.
  • loc_secondary_3 is in the north near Paris.

Exploratory Data Analysis

Distribution of consumption

plt.figure(figsize=(16, 8))
sns.kdeplot(df_out['consumption_1'], shade=True) #, label = 'consumption_1')
sns.kdeplot(df_out['consumption_2'], shade=True) #, label = 'consumption_2')
sns.kdeplot(df['consumption_secondary_1'], shade=True) #, label = 'consumption_1')
sns.kdeplot(df['consumption_secondary_2'], shade=True) #, label = 'consumption_1')
sns.kdeplot(df['consumption_secondary_3'], shade=True) #, label = 'consumption_1')
plt.show()

png

Consumption variation during time

df.groupby(['hour'])['consumption_secondary_1'].mean().values
array([160.77534247, 160.8739726 , 160.5260274 , 160.83835616,
       161.46027397, 168.88767123, 191.30136986, 210.9369863 ,
       228.49863014, 248.32328767, 256.64657534, 256.21369863,
       240.23287671, 238.76438356, 244.09041096, 242.99178082,
       237.61643836, 227.66027397, 208.08767123, 195.21917808,
       182.35068493, 166.50958904, 161.92054795, 161.87671233])
df_viz = pd.concat((df, df_out), axis=1)
plt.figure(figsize=(16, 8))
plt.title("Consumption Evolution for each site over hours of the day")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(x="hour", y=c, data=df_viz, label=c)
plt.show()

png

The graph above clearly shows that the electricity consumption is higher during working hours of the day.

plt.figure(figsize=(16, 8))
plt.title("Consumption Evolution for each site over weekday")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(x="day", y=c, data=df_viz, label=c)
plt.show()

png

The consumption is lower during the week-end. So we can deduce that those sites are not housings.

plt.figure(figsize=(16, 8))
plt.title("Consumption Evolution for each site over a year")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(x="day of year", y=c, data=df_viz, label=c)
plt.show()

png

We can see that globally the mean consumption decreases until the summer, then increases until the end of the year.

plt.figure(figsize=(16, 8))
plt.title("Consumption Evolution for each site over a year during WORKING DAYS")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(x="day of year", y=c, data=df_viz[~df_viz['day'].isin(['Saturday', 'Sunday'])], label=c)
plt.show()

png

plt.figure(figsize=(16, 8))
plt.title("Mean Consumption Evolution for each site over months")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(data=df_viz.groupby(['month'])[c].mean(), label=c)
plt.show()

png

plt.figure(figsize=(16, 8))
plt.title("Mean Consumption Evolution for each site over weeks")
for c in ['consumption_1', 'consumption_2', 'consumption_secondary_1', 'consumption_secondary_2', 'consumption_secondary_3']:
    sns.lineplot(data=df_viz.groupby(['week of year'])[c].mean(), label=c)
plt.show()

png

temp_consumption_cols = ['consumption_1', 'consumption_2', 'temp_1', 'temp_2']
df_viz[temp_consumption_cols] = MinMaxScaler().fit_transform(df_viz[temp_consumption_cols])

plt.figure(figsize=(16, 8))
plt.title("Correlation between Mean Consumption & Mean Temperatures / for each site over months")
for c in temp_consumption_cols:
    sns.lineplot(data=df_viz.groupby(['month'])[c].mean(), label=c)
plt.show()
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)

png

Electricity consumption and temperatures seem to be negatively correlated.


Features Engineering

Special dates - non working days “Jours fériés”

(credits: Antoine Augusti https://github.com/AntoineAugusti/jours-feries-france)

jf = pd.read_csv(os.path.join(input_dir, 'jours_feries_seuls.csv')).drop(columns=['nom_jour_ferie'])
# the date column is kept as a string for merging purposes
jf['est_jour_ferie'] = jf['est_jour_ferie'].astype('int')
jf.tail()
date est_jour_ferie
1102 2050-07-14 1
1103 2050-08-15 1
1104 2050-11-01 1
1105 2050-11-11 1
1106 2050-12-25 1
jf.est_jour_ferie.unique()
array([1])

Holidays

(credits: Antoine Augusti https://www.data.gouv.fr/fr/datasets/vacances-scolaires-par-zones/)

holidays = pd.read_csv(os.path.join(input_dir, 'vacances-scolaires.csv')).drop(columns=['nom_vacances'])
# the date column is kept as a string for merging purposes
for col in ['vacances_zone_a', 'vacances_zone_b', 'vacances_zone_c']:
    holidays[col] = holidays[col].astype('int')
holidays.tail()
date vacances_zone_a vacances_zone_b vacances_zone_c
11318 2020-12-27 0 0 0
11319 2020-12-28 0 0 0
11320 2020-12-29 0 0 0
11321 2020-12-30 0 0 0
11322 2020-12-31 0 0 0

Sunlight hours

work in progress, more to come in the next days…stay tuned :)

Merging all infos

def merge_infos(data_frame):
    data_frame = pd.merge(data_frame, holidays, on='date', how='left')
    data_frame = pd.merge(data_frame, jf, on='date', how='left')
    return data_frame
df = merge_infos(df)
df.vacances_zone_a.value_counts()/24
0    243.958333
1    121.041667
Name: vacances_zone_a, dtype: float64
df.est_jour_ferie.value_counts()/24
1.0    11.0
Name: est_jour_ferie, dtype: float64

Cleaning Data

def cleaning_data(data_frame):

    # The Nan values of the column "est_jour_ferie" correspond to working days 
    # because in the dataset merged with, there is only non working days 
    data_frame['est_jour_ferie'] = data_frame['est_jour_ferie'].fillna(0)
    
    # At first, missing values in the temperatures and humidity columns are replaced by the median ones
    # later another approach with open data of means of the corresponding months will be used
    for c in ['temp_1', 'temp_2', 'humidity_1', 'humidity_2']:
        data_frame[c] = data_frame[c].fillna(data_frame[c].median())
        
    return data_frame
df = cleaning_data(df)
df.isnull().sum()
timestamp                  0
temp_1                     0
temp_2                     0
mean_national_temp         0
humidity_1                 0
humidity_2                 0
loc_1                      0
loc_2                      0
loc_secondary_1            0
loc_secondary_2            0
loc_secondary_3            0
consumption_secondary_1    0
consumption_secondary_2    0
consumption_secondary_3    0
datetime                   0
month                      0
week of year               0
day of year                0
day                        0
hour                       0
date                       0
vacances_zone_a            0
vacances_zone_b            0
vacances_zone_c            0
est_jour_ferie             0
dtype: int64
df.head()
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 datetime month week of year day of year day hour date vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie
0 2016-11-01T00:00:00.0 8.3 14.5 11.1 95.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 143 74 168 2016-11-01 00:00:00 11 44 306 Tuesday 0 2016-11-01 1 1 1 1.0
1 2016-11-01T01:00:00.0 8.0 14.5 11.1 98.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 141 60 162 2016-11-01 01:00:00 11 44 306 Tuesday 1 2016-11-01 1 1 1 1.0
2 2016-11-01T02:00:00.0 6.8 14.5 11.0 97.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 142 60 164 2016-11-01 02:00:00 11 44 306 Tuesday 2 2016-11-01 1 1 1 1.0
3 2016-11-01T03:00:00.0 7.5 14.5 10.9 99.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 139 60 162 2016-11-01 03:00:00 11 44 306 Tuesday 3 2016-11-01 1 1 1 1.0
4 2016-11-01T04:00:00.0 6.1 14.5 10.8 98.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 154 60 164 2016-11-01 04:00:00 11 44 306 Tuesday 4 2016-11-01 1 1 1 1.0

Correlations

# after dummies
# sns.pairplot(df_viz.select_dtypes(['int64', 'float64']))#.apply(pd.Series.nunique, axis=0)
df_viz = pd.concat((df, df_out), axis=1)
corr = df_viz.corr()

# makes all correlations positive for the heatmap
corr = np.sqrt(corr * corr)
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
plt.title('Correlations - beware all are made positive, in reality some are negative ones')
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fce98470f28>

png

There medium correlations between weather informations. The given sites have also positively correlated consumption, this makes sense because all sites are housings. The targets seem to be weakly correlated with time infos…


Comparison between X_train & X_test

X_test = pd.read_csv(os.path.join(input_dir, 'input_test_cdKcI0e.csv'), index_col='ID')
X_test.iloc[23:26]
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3
ID
8783 2017-11-01T23:00:00.0 9.0 9.0 9.4 78.0 95.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 172 131 170
8784 2017-11-02T00:00:00.0 8.8 8.6 9.4 79.0 95.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 172 136 172
8785 2017-11-02T01:00:00.0 7.6 8.8 9.4 82.0 96.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 174 126 171
df.shape, X_test.shape
((8760, 25), (8736, 14))

There are 24 lines (hours) more wich correspond to one day, because 2016 was a leap year

X_test.iloc[[0, -1]]
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3
ID
8760 2017-11-01T00:00:00.0 6.5 7.1 8.8 91.0 82.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 190 126 177
17495 2018-10-30T23:00:00.0 7.5 11.2 6.7 85.0 77.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 228 114 178
df.iloc[[0, -1]]
timestamp temp_1 temp_2 mean_national_temp humidity_1 humidity_2 loc_1 loc_2 loc_secondary_1 loc_secondary_2 loc_secondary_3 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 datetime month week of year day of year day hour date vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie
0 2016-11-01T00:00:00.0 8.3 14.5 11.1 95.0 65.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 143 74 168 2016-11-01 00:00:00 11 44 306 Tuesday 0 2016-11-01 1 1 1 1.0
8759 2017-10-31T23:00:00.0 7.0 7.9 8.8 90.0 78.0 (50.633, 3.067) (43.530, 5.447) (44.838, -0.579) (47.478, -0.563) (48.867, 2.333) 198 128 189 2017-10-31 23:00:00 10 44 304 Tuesday 23 2017-10-31 1 1 1 0.0

There is a difference in the length of the two dataframes X_train & X_test


Data preparation

feat_to_drop = [
'timestamp',
'loc_1',
'loc_2',
'loc_secondary_1',
'loc_secondary_2',
'loc_secondary_3',
'datetime',
'date']
feat_to_scale = [
'temp_1',
'temp_2',
'mean_national_temp',
'humidity_1',
'humidity_2',
'consumption_secondary_1',
'consumption_secondary_2',
'consumption_secondary_3']
feat_to_dummies = [
'day',
'month',
'week of year',
#'day of year',
'hour']

Side note : let’s try to run models without dummification of day of year

def prepare_feat_df(data_frame):
    data_frame = data_frame.drop(columns=feat_to_drop)
    data_frame[feat_to_scale] = MinMaxScaler().fit_transform(data_frame[feat_to_scale])
    data_frame = pd.get_dummies(data=data_frame, columns=feat_to_dummies, drop_first=True)
    return data_frame
df = prepare_feat_df(df)
df.head(2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
temp_1 temp_2 mean_national_temp humidity_1 humidity_2 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 day of year vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie day_Monday day_Saturday day_Sunday day_Thursday day_Tuesday day_Wednesday month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 week of year_2 week of year_3 week of year_4 week of year_5 week of year_6 week of year_7 week of year_8 week of year_9 week of year_10 week of year_11 week of year_12 week of year_13 week of year_14 week of year_15 week of year_16 week of year_17 week of year_18 week of year_19 week of year_20 week of year_21 ... week of year_26 week of year_27 week of year_28 week of year_29 week of year_30 week of year_31 week of year_32 week of year_33 week of year_34 week of year_35 week of year_36 week of year_37 week of year_38 week of year_39 week of year_40 week of year_41 week of year_42 week of year_43 week of year_44 week of year_45 week of year_46 week of year_47 week of year_48 week of year_49 week of year_50 week of year_51 week of year_52 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7 hour_8 hour_9 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23
0 0.356234 0.466667 0.428571 0.936709 0.609195 0.155263 0.208451 0.155462 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0.348601 0.466667 0.428571 0.974684 0.609195 0.150000 0.169014 0.142857 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 rows × 104 columns

Data Preparation

def prepare_data(input_file_name, output_file_name1, output_file_name2):
    data_frame = pd.read_csv(os.path.join(input_dir, input_file_name), index_col='ID')
    
    data_frame = transform_datetime_infos(data_frame)
    data_frame = merge_infos(data_frame)
    data_frame = cleaning_data(data_frame)
    data_frame = prepare_feat_df(data_frame)
    
    df_1 = data_frame.drop(columns=['temp_2', 'humidity_2'])
    df_2 = data_frame.drop(columns=['temp_1', 'humidity_1'])
    
    df_1.to_csv(os.path.join(output_dir, output_file_name1), index=False)
    df_2.to_csv(os.path.join(output_dir, output_file_name2), index=False)
    
    #return df_1, df_2
prepare_data('input_training_ssnsrY0.csv', 'X_train_1.csv', 'X_train_2.csv')
prepare_data('input_test_cdKcI0e.csv', 'X_test_1.csv', 'X_test_2.csv')
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
pd.read_csv(os.path.join(output_dir, 'X_train_1.csv')).head()
temp_1 mean_national_temp humidity_1 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 day of year vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie day_Monday day_Saturday day_Sunday day_Thursday day_Tuesday day_Wednesday month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 week of year_2 week of year_3 week of year_4 week of year_5 week of year_6 week of year_7 week of year_8 week of year_9 week of year_10 week of year_11 week of year_12 week of year_13 week of year_14 week of year_15 week of year_16 week of year_17 week of year_18 week of year_19 week of year_20 week of year_21 week of year_22 week of year_23 ... week of year_26 week of year_27 week of year_28 week of year_29 week of year_30 week of year_31 week of year_32 week of year_33 week of year_34 week of year_35 week of year_36 week of year_37 week of year_38 week of year_39 week of year_40 week of year_41 week of year_42 week of year_43 week of year_44 week of year_45 week of year_46 week of year_47 week of year_48 week of year_49 week of year_50 week of year_51 week of year_52 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7 hour_8 hour_9 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23
0 0.356234 0.428571 0.936709 0.155263 0.208451 0.155462 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0.348601 0.428571 0.974684 0.150000 0.169014 0.142857 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0.318066 0.425249 0.962025 0.152632 0.169014 0.147059 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0.335878 0.421927 0.987342 0.144737 0.169014 0.142857 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0.300254 0.418605 0.974684 0.184211 0.169014 0.147059 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 102 columns

df_out['consumption_1'].to_csv(os.path.join('..', 'output', 'y_train_1.csv'), index=False)
df_out['consumption_2'].to_csv(os.path.join('..', 'output', 'y_train_2.csv'), index=False)

Conclusion

In this first part, we’ve made an exploration of the data, thus you can see :

  • how the weather (and particularly the temperature) could influence the electricity consumption. And the other correlations.
  • how cyclic time infos are, and how the consumption vary depending on the hour of the day, the day of the week and the week of the year Then we’ve have cleaned and prepared the datasets that will be used in the following part by the machine learning models.

Predictions with various types of ML models

In this second part i’ll use machine learning models to make predictions and submission to see my score for this challenge.

  • At first i’ll use linear regressorts, then the SVM model with 2 different kernels.
  • Then, later i’ll see if random forrest and gradient boosting libs are more effective
  • At the end, i’ll try to use deep learning models such as RNN and especially GRU.

Data preparation

Some parts of the data set have aldready been prepared in the first notebook / blogpost, and saved as different csv file. Anyway i’ll reuse the same function in order to prepare the data for the deep learning models (see below).

First let’s import all the libraries needed, set up notebook parameter, create file / folder paths and load the data :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.set_option('display.max_columns', 100)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import Lasso, ElasticNet, ridge_regression, LinearRegression, Ridge
from sklearn.svm import SVR
input_dir = os.path.join('..', 'input')
output_dir = os.path.join('..', 'output')
# each id is unique so we can use this column as index
X_train_1 = pd.read_csv(os.path.join(output_dir, "X_train_1.csv"))
X_train_2 = pd.read_csv(os.path.join(output_dir, "X_train_2.csv"))
X_test_1 = pd.read_csv(os.path.join(output_dir, "X_test_1.csv"))
X_test_2 = pd.read_csv(os.path.join(output_dir, "X_test_2.csv"))
X_train_1.head()
temp_1 mean_national_temp humidity_1 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 day of year vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie day_Monday day_Saturday day_Sunday day_Thursday day_Tuesday day_Wednesday month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 week of year_2 week of year_3 week of year_4 week of year_5 week of year_6 week of year_7 week of year_8 week of year_9 week of year_10 week of year_11 week of year_12 week of year_13 week of year_14 week of year_15 week of year_16 week of year_17 week of year_18 week of year_19 week of year_20 week of year_21 week of year_22 week of year_23 ... week of year_26 week of year_27 week of year_28 week of year_29 week of year_30 week of year_31 week of year_32 week of year_33 week of year_34 week of year_35 week of year_36 week of year_37 week of year_38 week of year_39 week of year_40 week of year_41 week of year_42 week of year_43 week of year_44 week of year_45 week of year_46 week of year_47 week of year_48 week of year_49 week of year_50 week of year_51 week of year_52 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7 hour_8 hour_9 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23
0 0.356234 0.428571 0.936709 0.155263 0.208451 0.155462 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0.348601 0.428571 0.974684 0.150000 0.169014 0.142857 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0.318066 0.425249 0.962025 0.152632 0.169014 0.147059 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0.335878 0.421927 0.987342 0.144737 0.169014 0.142857 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0.300254 0.418605 0.974684 0.184211 0.169014 0.147059 306 1 1 1 1.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 102 columns

X_test_1.head()
temp_1 mean_national_temp humidity_1 consumption_secondary_1 consumption_secondary_2 consumption_secondary_3 day of year vacances_zone_a vacances_zone_b vacances_zone_c est_jour_ferie day_Monday day_Saturday day_Sunday day_Thursday day_Tuesday day_Wednesday month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 week of year_2 week of year_3 week of year_4 week of year_5 week of year_6 week of year_7 week of year_8 week of year_9 week of year_10 week of year_11 week of year_12 week of year_13 week of year_14 week of year_15 week of year_16 week of year_17 week of year_18 week of year_19 week of year_20 week of year_21 week of year_22 week of year_23 ... week of year_26 week of year_27 week of year_28 week of year_29 week of year_30 week of year_31 week of year_32 week of year_33 week of year_34 week of year_35 week of year_36 week of year_37 week of year_38 week of year_39 week of year_40 week of year_41 week of year_42 week of year_43 week of year_44 week of year_45 week of year_46 week of year_47 week of year_48 week of year_49 week of year_50 week of year_51 week of year_52 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7 hour_8 hour_9 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23
0 0.315315 0.368750 0.884615 0.202381 0.380665 0.189583 305 1 1 1 1.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0.322072 0.365625 0.858974 0.200000 0.350453 0.179167 305 1 1 1 1.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0.322072 0.365625 0.846154 0.214286 0.353474 0.185417 305 1 1 1 1.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0.299550 0.362500 0.871795 0.219048 0.347432 0.177083 305 1 1 1 1.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0.281532 0.359375 0.871795 0.221429 0.347432 0.183333 305 1 1 1 1.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 102 columns

Let’s verify the shape (nb of lines and columns) of the various data sets

X_train_1.shape, X_train_2.shape, X_test_1.shape, X_test_2.shape
((8760, 102), (8760, 102), (8736, 102), (8736, 102))

An overview of the test data set use to make submissions for the challenge :

y_train_1 = pd.read_csv("../output/y_train_1.csv", header=None)
y_train_2 = pd.read_csv("../output/y_train_2.csv", header=None)
y_train_1.head()
0
0 100
1 101
2 100
3 101
4 100
y_train_1.shape, y_train_2.shape
((8760, 1), (8760, 1))

Metric / Benchmark

For this challenge we used the mean absolute error.

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier dstricts. In that case, you may consider using the Mean Absolute Error.

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms, are possible:

  • Computing the root of a sum of squares (RMSE) corresponds to the Euclidian norm: it is the notion of distance you are familiar with. It is also called the l 2 norm, noted ∥ · ∥ 2 (or just ∥ · ∥).

  • Computing the sum of absolutes (MAE) corresponds to the l 1 norm, noted ∥ · ∥ 1 . It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.

The RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

def weighted_mean_absolute_error(y_true, y_pred):
    """ Simplified version without loading csv for testing purposes on train sets"""
    c12 = np.array([1136987, 1364719])
    return 2 * mean_absolute_error(y_true*c12[0], y_pred*c12[1]) / np.sum(c12)

On the assumption that a safety reserve of 20% is needed to covert the supply

Functions to print/show predictions and make submissions

Please refer to the doc string of each function to know what’s its purpose, name should be self explanatory :

def print_metric_on_train(fitted_model_1, fitted_model_2):
    """ prints and returns the metric on the train datasets"""
    
    y_train_pred_1, y_train_pred_2 = fitted_model_1.predict(X_train_1), fitted_model_2.predict(X_train_2)
    # arg = dataframe_1_y_true, dataframe_2_y_pred
    
    wmae_1 = weighted_mean_absolute_error(y_train_1, y_train_pred_1)
    wmae_2 = weighted_mean_absolute_error(y_train_2, y_train_pred_2)
    
    print(f'weighted_mean_absolute_error on X_train_1 : {wmae_1}')
    print(f'weighted_mean_absolute_error on X_train_2 : {wmae_2}')
    
    return wmae_1, wmae_2, y_train_pred_1, y_train_pred_2
def display_pred_on_train(y_train, y_train_pred):
    """ plots the prediction and the target of the ONE train data sets in order to see how the model has learnt"""
      
    plt.figure(figsize=(10, 5))
    plt.title("Real & Predicted Consumption Evolution for ONE site over a year")
    sns.lineplot(x=y_train.index, y=y_train[0], label='truth')
    #sns.lineplot(x=y_train.index, y=y_train_pred[:, 0], label='prediction')
    sns.lineplot(x=y_train.index, y=y_train_pred, label='prediction')
    plt.show()
def create_submission(fitted_model_1, fitted_model_2, model_name):
    """ make the prediction on the test set and craft the specific csv file to be submitted"""

    y_pred_1 = pd.DataFrame(fitted_model_1.predict(X_test_1).astype(int))
    y_pred_2 = pd.DataFrame(fitted_model_2.predict(X_test_2).astype(int))

    #y_pred_1, y_pred_2 = 1.2 * y_pred_1, 1.2 * y_pred_2 # no need of 20% more
    res = pd.concat((y_pred_1, y_pred_2), axis=1)
    res.columns = ['consumption_1', 'consumption_2']
    res = res.set_index(pd.Index(range(8760, 17496)))
    res.index.name = 'ID'
    
    name = 'y_pred_' + model_name + '_.csv' 
    res.to_csv(os.path.join(output_dir, name), sep=',', index=True)
    
    return y_pred_1, y_pred_2
def general_wrapper(fitted_model_1, fitted_model_2, model_name, y_train_1, y_train_2):
    """ wrapper of the 3 functions above, so you only have to call this function instead"""
    
    wmae_1, wmae_2, y_train_pred_1, y_train_pred_2 = print_metric_on_train(fitted_model_1, fitted_model_2)    
    display_pred_on_train(y_train_1, y_train_pred_1)
    display_pred_on_train(y_train_2, y_train_pred_2)
    create_submission(fitted_model_1, fitted_model_2, model_name)
    
    return wmae_1, wmae_2

Side note

Remember that there is a difference in the length of the two dataframes X_train & X_test : there are 24 lines (hours) more wich correspond to one day, because 2016 was a leap year


Linear regression and Support Vector Machine

Here is a king of decision tree for choosing the right model :

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

  • samples nb > 100k
    • SGD regressor
  • samples nb < 100k
    • Few features should be important : YES:
      • Lasso
      • ElasticNet
    • Few features should be important : NO:
      • RidgeRegression
      • SVR(kernel=’linear’) If not working
      • SVR(kernel=’rbf’)
      • Ensemble regressor

Linear reg

linear_base_1, linear_base_2 = LinearRegression(), LinearRegression()
linear_base_1.fit(X_train_1, y_train_1)
linear_base_2.fit(X_train_2, y_train_2)
res_lin_reg_1, res_lin_reg_2 = general_wrapper(linear_base_1, linear_base_2, 'linear_reg', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 26.05531062095511
weighted_mean_absolute_error on X_train_2 : 28.69959036370805

png

png

Your submission score is : 19.24

We can see that the linear regression model doesn’t make good predictions for spikes… this is more obvious for the second site. And that why the MAE was choosen !

Lasso

lasso_base_1, lasso_base_2 = Lasso(alpha=0.01, max_iter=10e5), Lasso(alpha=0.01, max_iter=10e5)
lasso_base_1.fit(X_train_1, y_train_1)
lasso_base_2.fit(X_train_2, y_train_2)
res_lasso_1, res_lasso_2 = general_wrapper(lasso_base_1, lasso_base_2, 'lasso', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 26.061965525312893
weighted_mean_absolute_error on X_train_2 : 28.710534100722768

png

png

Your submission score is : 23.70

ElasticNet

elast_base_1, elast_base_2 = ElasticNet(alpha=0.01, max_iter=10e8), ElasticNet(alpha=0.1, max_iter=10e8)
elast_base_1.fit(X_train_1, y_train_1)
elast_base_2.fit(X_train_2, y_train_2)
res_elasticnet_1, res_elasticnet_2 = general_wrapper(elast_base_1, elast_base_2, 'elastic_net', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 26.16342615122434
weighted_mean_absolute_error on X_train_2 : 31.322618027652748

png

png

Your submission score is : 23.70

RidgeRegression

ridge_base_1, ridge_base_2 = Ridge(alpha=0.1, max_iter=10e10), Ridge(alpha=0.1, max_iter=10e10)
ridge_base_1.fit(X_train_1, y_train_1)
ridge_base_2.fit(X_train_2, y_train_2)
create_submission(ridge_base_1, ridge_base_2, 'ridge_reg')
res_ridge_1, res_ridge_2 = general_wrapper(ridge_base_1, ridge_base_2, 'ridge_reg', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 26.055361466167927
weighted_mean_absolute_error on X_train_2 : 28.695385473745706

png

png

Your submission score is : 23.70

SVR(kernel=’linear’)

svr_base_1 = SVR(kernel='linear')
svr_base_2 = SVR(kernel='linear')
svr_base_1.fit(X_train_1, y_train_1)
svr_base_2.fit(X_train_2, y_train_2)
res_svr_1, res_svr_2 = general_wrapper(svr_base_1, svr_base_2, 'svr_lin', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


weighted_mean_absolute_error on X_train_1 : 24.78463504218071
weighted_mean_absolute_error on X_train_2 : 26.999353814493194

png

png

Your submission score is : 21.906794341911045

SVR(kernel=’rbf’)

#svr_base_1 = SVR(kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)
svr_base_rbf_1 = SVR(kernel='rbf')
svr_base_rbf_2 = SVR(kernel='rbf')
svr_base_rbf_1.fit(X_train_1, y_train_1)
svr_base_rbf_2.fit(X_train_2, y_train_2)
res_svr_base_rbf_1, res_svr_base_rbf_2 = general_wrapper(svr_base_rbf_1, svr_base_rbf_2, 'svr_base_rbf', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


weighted_mean_absolute_error on X_train_1 : 39.22806743679658
weighted_mean_absolute_error on X_train_2 : 33.98994529863917

png

png

Your submission score is : 23.70


Using other models (random forrest type)

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor

Adaboost Regressor

adaboost_base_1, adaboost_base_2 = AdaBoostRegressor(), AdaBoostRegressor()
adaboost_base_1.fit(X_train_1, y_train_1)
adaboost_base_2.fit(X_train_2, y_train_2)
res_adaboost_1, res_adaboost_2 = general_wrapper(adaboost_base_1, adaboost_base_2, 'adaboost_reg', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


weighted_mean_absolute_error on X_train_1 : 42.93679688417296
weighted_mean_absolute_error on X_train_2 : 46.10227435920769

png

png

Your submission score is : 30.29205877610584

GradientBoosting Regressor

gbreg_base_1, gbreg_base_2 = GradientBoostingRegressor(), GradientBoostingRegressor()
gbreg_base_1.fit(X_train_1, y_train_1)
gbreg_base_2.fit(X_train_2, y_train_2)
res_gbreg_1, res_gbreg_2 = general_wrapper(gbreg_base_1, gbreg_base_2, 'gradientboost_reg', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


weighted_mean_absolute_error on X_train_1 : 25.6597272086074
weighted_mean_absolute_error on X_train_2 : 27.55942042756231

png

png

Your submission score is : ??????

RandomForestRegressor

rfreg_base_1, rfreg_base_2 = RandomForestRegressor(), RandomForestRegressor()
rfreg_base_1.fit(X_train_1, y_train_1)
rfreg_base_2.fit(X_train_2, y_train_2)
res_rfreg_1, res_rfreg_2 = general_wrapper(rfreg_base_1, rfreg_base_2, 'randomforrest_reg', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/ipykernel_launcher.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  
/home/sunflowa/Anaconda/lib/python3.7/site-packages/ipykernel_launcher.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  This is separate from the ipykernel package so we can avoid doing imports until


weighted_mean_absolute_error on X_train_1 : 25.072702321232793
weighted_mean_absolute_error on X_train_2 : 27.064664489582444

png

png

Your submission score is : ??????

ExtraTreesRegressor

etreg_base_1, etreg_base_2 = ExtraTreesRegressor(), ExtraTreesRegressor()
etreg_base_1.fit(X_train_1, y_train_1)
etreg_base_2.fit(X_train_2, y_train_2)
res_etreg_1, res_etreg_2 = general_wrapper(etreg_base_1, etreg_base_2, 'extratree_reg', y_train_1, y_train_2)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/ipykernel_launcher.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  
/home/sunflowa/Anaconda/lib/python3.7/site-packages/ipykernel_launcher.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  This is separate from the ipykernel package so we can avoid doing imports until


weighted_mean_absolute_error on X_train_1 : 25.042457117491335
weighted_mean_absolute_error on X_train_2 : 27.041557371295493

png

png

Your submission score is : ??????


import other libraries

import xgboost as xgb
import lightgbm as lgbm

XGBoost Regressor

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

https://xgboost.readthedocs.io/en/latest/

xgbreg_base_1, xgbreg_base_2 = xgb.XGBRegressor(), xgb.XGBRegressor()
xgbreg_base_1.fit(X_train_1, y_train_1)
xgbreg_base_2.fit(X_train_2, y_train_2)
res_xgbreg_1, res_xgbreg_2 = general_wrapper(xgbreg_base_1, xgbreg_base_2, 'xgb_reg', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 25.670042893699705
weighted_mean_absolute_error on X_train_2 : 27.57397787858485

png

png

Your submission score is : ??????

Light GBM Regressor

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages: https://lightgbm.readthedocs.io/en/latest/

- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel and GPU learning.
- Capable of handling large-scale data.

For more details, please refer to Features.

lgbmreg_base_1, lgbmreg_base_2 = lgbm.LGBMRegressor(), lgbm.LGBMRegressor()
lgbmreg_base_1.fit(X_train_1, y_train_1)
lgbmreg_base_2.fit(X_train_2, y_train_2)
res_lgbmreg_1, res_lgbmreg_2 = general_wrapper(lgbmreg_base_1, lgbmreg_base_2, 'lgbm_reg', y_train_1, y_train_2)
weighted_mean_absolute_error on X_train_1 : 25.169500507539915
weighted_mean_absolute_error on X_train_2 : 27.088282579499793

png

png

Your submission score is : ??????


Using Recurrent Neural Networks

let’s import all the needed libs from tensorflow / keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, GRU

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

The term “recurrent neural network” is used indiscriminately to refer to two broad classes of networks with a similar general structure, where one is finite impulse and the other is infinite impulse. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

Both finite impulse and infinite impulse recurrent networks can have additional stored state, and the storage can be under direct control by the neural network. The storage can also be replaced by another network or graph, if that incorporates time delays or has feedback loops. Such controlled states are referred to as gated state or gated memory, and are part of long short-term memory networks (LSTMs) and gated recurrent units.This is also called Feedback Neural Network.

https://en.wikipedia.org/wiki/Recurrent_neural_network

Data preparation

same as before in the 1st part : see intro & EDA

def transform_datetime_infos(data_frame):
    data_frame['datetime'] = pd.to_datetime(data_frame['timestamp'])
    data_frame['month'] = data_frame['datetime'].dt.month
    data_frame['week of year'] = data_frame['datetime'].dt.weekofyear
    data_frame['day of year'] = data_frame['datetime'].dt.dayofyear
    data_frame['day'] = data_frame['datetime'].dt.weekday_name
    data_frame['hour'] = data_frame['datetime'].dt.hour
    # for merging purposes
    data_frame['date'] = data_frame['datetime'].dt.strftime('%Y-%m-%d')
    return data_frame


def merge_infos(data_frame):
    data_frame = pd.merge(data_frame, holidays, on='date', how='left')
    data_frame = pd.merge(data_frame, jf, on='date', how='left')
    return data_frame


def cleaning_data(data_frame):
    # The Nan values of the column "est_jour_ferie" correspond to working days 
    # because in the dataset merged with, there is only non working days 
    data_frame['est_jour_ferie'] = data_frame['est_jour_ferie'].fillna(0)
    # At first, missing values in the temperatures and humidity columns are replaced by the median ones
    # later another approach with open data of means of the corresponding months will be used
    for c in ['temp_1', 'temp_2', 'humidity_1', 'humidity_2']:
        data_frame[c] = data_frame[c].fillna(data_frame[c].median())
    return data_frame


feat_to_drop = [
'timestamp',
'loc_1',
'loc_2',
'loc_secondary_1',
'loc_secondary_2',
'loc_secondary_3',
'datetime',
'date']


feat_to_scale = [
'temp_1',
'temp_2',
'mean_national_temp',
'humidity_1',
'humidity_2',
'consumption_secondary_1',
'consumption_secondary_2',
'consumption_secondary_3']


feat_to_dummies = [
'day',
'month',
'week of year',
#'day of year',
'hour']


def prepare_feat_df(data_frame):
    data_frame = data_frame.drop(columns=feat_to_drop)
    data_frame[feat_to_scale] = StandardScaler().fit_transform(data_frame[feat_to_scale])
    data_frame = pd.get_dummies(data=data_frame, columns=feat_to_dummies, drop_first=True)
    return data_frame


holidays = pd.read_csv(os.path.join(input_dir, 'vacances-scolaires.csv')).drop(columns=['nom_vacances'])
# the date column is kept as a string for merging purposes
for col in ['vacances_zone_a', 'vacances_zone_b', 'vacances_zone_c']:
    holidays[col] = holidays[col].astype('int')

    
jf = pd.read_csv(os.path.join(input_dir, 'jours_feries_seuls.csv')).drop(columns=['nom_jour_ferie'])
# the date column is kept as a string for merging purposes
jf['est_jour_ferie'] = jf['est_jour_ferie'].astype('int')
input_dir = os.path.join('..', 'input')
output_dir = os.path.join('..', 'output')

X_train = pd.read_csv(os.path.join(input_dir, 'input_training_ssnsrY0.csv'), index_col='ID')
X_test  = pd.read_csv(os.path.join(input_dir, 'input_test_cdKcI0e.csv'), index_col='ID')
y_train = pd.read_csv(os.path.join(input_dir, 'output_training_Uf11I9I.csv'), index_col='ID')

X_train, X_test = transform_datetime_infos(X_train), transform_datetime_infos(X_test)
X_train, X_test = merge_infos(X_train), merge_infos(X_test)
X_train, X_test = cleaning_data(X_train), cleaning_data(X_test)
X_train, X_test = prepare_feat_df(X_train), prepare_feat_df(X_test)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
X_train.shape, X_test.shape, y_train.shape
((8760, 104), (8736, 104), (8760, 2))

Simple RNN

X_train_rnn = np.expand_dims(np.array(X_train.iloc[:48, :]), axis=0)

i = 49
while i < X_train.shape[0]:
    X_temp = np.expand_dims(np.array(X_train.iloc[(i-48):i, :]), axis=0)
    X_train_rnn = np.concatenate((X_train_rnn, X_temp), axis=0)
    i += 1
y_train = y_train.iloc[48:]
X_train_rnn.shape, y_train.shape
((8712, 48, 104), (8712, 2))
shape = X_train_rnn.shape
shape
(8712, 48, 104)
X_train_rnn = X_train_rnn.reshape(shape[0], -1)
scaler = StandardScaler()
X_train_rnn = scaler.fit_transform(X_train_rnn)

X_train_rnn = X_train_rnn.reshape(shape[0], shape[1], shape[2])
X_train_rnn.shape
(8712, 48, 104)
X_test_init = X_train[-48:]
X_test_init.shape


X_test_temp = pd.concat([X_test_init, X_test])
X_test_temp.shape
(8784, 104)
X_test_rnn = np.expand_dims(np.array(X_test_temp.iloc[:48, :]), axis=0)

i = 49
while i < X_test_temp.shape[0]:
    X_temp = np.expand_dims(np.array(X_test_temp.iloc[(i-48):i, :]), axis=0)
    X_test_rnn = np.concatenate((X_test_rnn, X_temp), axis=0)
    i += 1

X_test_rnn.shape
(8736, 48, 104)
shape = X_test_rnn.shape
shape
(8736, 48, 104)
X_test_rnn = X_test_rnn.reshape(shape[0], -1)
X_test_rnn = scaler.transform(X_test_rnn)
X_test_rnn = X_test_rnn.reshape(shape[0], shape[1], shape[2])
X_test_rnn.shape
(8736, 48, 104)
def my_simple_RNN():
    model = Sequential()
    model.add(SimpleRNN(units=32, return_sequences=True))
    model.add(SimpleRNN(units=32, return_sequences=False))
    model.add(Dense(units=1, activation='linear'))
    return model


simple_rnn_1, simple_rnn_2 = my_simple_RNN(), my_simple_RNN()
simple_rnn_1.compile(optimizer='SGD', loss='mean_squared_error')
simple_rnn_2.compile(optimizer='SGD', loss='mean_squared_error')


X_train_rnn.shape, y_train.shape
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.





((8712, 48, 104), (8712, 2))
simple_rnn_1.fit(x=X_train_rnn, y=np.array(y_train.iloc[:,0]), epochs=20, batch_size=32)
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/keras/utils/losses_utils.py:170: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/sunflowa/Anaconda/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/20
8712/8712 [==============================] - 3s 391us/sample - loss: 4892.4747
Epoch 2/20
8712/8712 [==============================] - 2s 280us/sample - loss: 3132.7046
Epoch 3/20
8712/8712 [==============================] - 2s 281us/sample - loss: 3124.5193
Epoch 4/20
8712/8712 [==============================] - 3s 291us/sample - loss: 3087.9552
Epoch 5/20
8712/8712 [==============================] - 2s 281us/sample - loss: 2996.4712
Epoch 6/20
8712/8712 [==============================] - 3s 306us/sample - loss: 3059.7507
Epoch 7/20
8712/8712 [==============================] - 3s 300us/sample - loss: 3109.1124
Epoch 8/20
8712/8712 [==============================] - 2s 285us/sample - loss: 3041.8741
Epoch 9/20
8712/8712 [==============================] - 2s 284us/sample - loss: 2947.8813
Epoch 10/20
8712/8712 [==============================] - 2s 284us/sample - loss: 2876.6954
Epoch 11/20
8712/8712 [==============================] - 2s 286us/sample - loss: 2844.9381
Epoch 12/20
8712/8712 [==============================] - 2s 284us/sample - loss: 2847.9984
Epoch 13/20
8712/8712 [==============================] - 2s 284us/sample - loss: 2920.4317
Epoch 14/20
8712/8712 [==============================] - 2s 287us/sample - loss: 2921.8935
Epoch 15/20
8712/8712 [==============================] - 2s 287us/sample - loss: 2920.0885
Epoch 16/20
8712/8712 [==============================] - 2s 286us/sample - loss: 2901.7930
Epoch 17/20
8712/8712 [==============================] - 2s 285us/sample - loss: 2922.2052
Epoch 18/20
8712/8712 [==============================] - 2s 285us/sample - loss: 2872.7533
Epoch 19/20
8712/8712 [==============================] - 2s 286us/sample - loss: 2823.4115
Epoch 20/20
8712/8712 [==============================] - 2s 285us/sample - loss: 2872.8206





<tensorflow.python.keras.callbacks.History at 0x7fe376d67828>
simple_rnn_2.fit(x=X_train_rnn, y=np.array(y_train.iloc[:,1]), epochs=20, batch_size=32)
Epoch 1/20
8712/8712 [==============================] - 3s 308us/sample - loss: 4618.0560
Epoch 2/20
8712/8712 [==============================] - 2s 280us/sample - loss: 3643.0039
Epoch 3/20
8712/8712 [==============================] - 2s 280us/sample - loss: 4173.2772
Epoch 4/20
8712/8712 [==============================] - 2s 280us/sample - loss: 5285.1180
Epoch 5/20
8712/8712 [==============================] - 2s 279us/sample - loss: 4142.8450
Epoch 6/20
8712/8712 [==============================] - 3s 303us/sample - loss: 3779.4054
Epoch 7/20
8712/8712 [==============================] - 3s 290us/sample - loss: 3443.7070
Epoch 8/20
8712/8712 [==============================] - 3s 299us/sample - loss: 3273.2908
Epoch 9/20
8712/8712 [==============================] - 3s 287us/sample - loss: 4005.3873
Epoch 10/20
8712/8712 [==============================] - 3s 302us/sample - loss: 3682.9635
Epoch 11/20
8712/8712 [==============================] - 3s 299us/sample - loss: 3561.5160
Epoch 12/20
8712/8712 [==============================] - 3s 292us/sample - loss: 3362.9157
Epoch 13/20
8712/8712 [==============================] - 3s 288us/sample - loss: 4069.3645
Epoch 14/20
8712/8712 [==============================] - 2s 284us/sample - loss: 3971.3454
Epoch 15/20
8712/8712 [==============================] - 2s 287us/sample - loss: 3869.3355
Epoch 16/20
8712/8712 [==============================] - 2s 284us/sample - loss: 3743.0655
Epoch 17/20
8712/8712 [==============================] - 2s 285us/sample - loss: 3666.5237
Epoch 18/20
8712/8712 [==============================] - 2s 285us/sample - loss: 3797.3478
Epoch 19/20
8712/8712 [==============================] - 2s 284us/sample - loss: 3826.5725
Epoch 20/20
8712/8712 [==============================] - 2s 285us/sample - loss: 3646.0687





<tensorflow.python.keras.callbacks.History at 0x7fe35c443f60>
y_train_pred_1 = simple_rnn_1.predict(X_train_rnn)
y_train_pred_2 = simple_rnn_2.predict(X_train_rnn)
y_test_pred_1  = simple_rnn_1.predict(X_test_rnn)
y_test_pred_2  = simple_rnn_2.predict(X_test_rnn)


plt.figure(figsize=(16, 6))
plt.plot(y_train.iloc[:, 0], color='blue')
plt.plot(y_train_pred_1, color='green')
plt.show()

png

plt.figure(figsize=(16, 6))
plt.plot(y_train.iloc[:, 1], color='red')
plt.plot(y_train_pred_2, color='green')
plt.show()

png

plt.figure(figsize=(16, 6))
plt.plot(y_test_pred_1, color='green')
plt.plot(y_test_pred_2, color='orange')
plt.show()

png

GRU

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with forget gate but has fewer parameters than LSTM, as it lacks an output gate. GRU’s performance on certain tasks of polyphonic music modeling and speech signal modeling was found to be similar to that of LSTM. GRUs have been shown to exhibit even better performance on certain smaller datasets.

However, as shown by Gail Weiss & Yoav Goldberg & Eran Yahav, the LSTM is “strictly stronger” than the GRU as it can easily perform unbounded counting, while the GRU cannot. That’s why the GRU fails to learn simple languages that are learnable by the LSTM.

Similarly, as shown by Denny Britz & Anna Goldie & Minh-Thang Luong & Quoc Le of Google Brain, LSTM cells consistently outperform GRU cells in “the first large-scale analysis of architecture variations for Neural Machine Translation.

https://en.wikipedia.org/wiki/Gated_recurrent_unit

def my_GRU():
    model = Sequential()
    model.add(GRU(units=32, return_sequences=True))
    model.add(GRU(units=32, return_sequences=False))
    model.add(Dense(units=1, activation='linear'))
    return model


gru_1, gru_2 = my_GRU(), my_GRU()
gru_1.compile(optimizer='SGD', loss='mean_squared_error')
gru_2.compile(optimizer='SGD', loss='mean_squared_error')


gru_1.fit(x=X_train_rnn, y=np.array(y_train.iloc[:,0]), epochs=20, batch_size=32)
Epoch 1/20
8712/8712 [==============================] - 8s 946us/sample - loss: 4140.4470
Epoch 2/20
8712/8712 [==============================] - 7s 829us/sample - loss: 2775.6613
Epoch 3/20
8712/8712 [==============================] - 7s 834us/sample - loss: 2582.9657
Epoch 4/20
8712/8712 [==============================] - 7s 834us/sample - loss: 2586.7794
Epoch 5/20
8712/8712 [==============================] - 7s 836us/sample - loss: 2607.1739
Epoch 6/20
8712/8712 [==============================] - 7s 835us/sample - loss: 2584.2904
Epoch 7/20
8712/8712 [==============================] - 7s 838us/sample - loss: 2497.3655
Epoch 8/20
8712/8712 [==============================] - 7s 838us/sample - loss: 2488.9214
Epoch 9/20
8712/8712 [==============================] - 7s 842us/sample - loss: 2455.1357
Epoch 10/20
8712/8712 [==============================] - 7s 840us/sample - loss: 2407.8460
Epoch 11/20
8712/8712 [==============================] - 7s 841us/sample - loss: 2365.4689
Epoch 12/20
8712/8712 [==============================] - 7s 845us/sample - loss: 2310.7510
Epoch 13/20
8712/8712 [==============================] - 7s 846us/sample - loss: 2316.9849
Epoch 14/20
8712/8712 [==============================] - 7s 844us/sample - loss: 2298.8346
Epoch 15/20
8712/8712 [==============================] - 7s 843us/sample - loss: 2296.2690
Epoch 16/20
8712/8712 [==============================] - 7s 842us/sample - loss: 2269.9935
Epoch 17/20
8712/8712 [==============================] - 7s 842us/sample - loss: 2289.1222
Epoch 18/20
8712/8712 [==============================] - 7s 841us/sample - loss: 2261.0284
Epoch 19/20
8712/8712 [==============================] - 7s 843us/sample - loss: 2282.3054
Epoch 20/20
8712/8712 [==============================] - 7s 845us/sample - loss: 2274.4464





<tensorflow.python.keras.callbacks.History at 0x7fe2fee01d68>
gru_2.fit(x=X_train_rnn, y=np.array(y_train.iloc[:,1]), epochs=20, batch_size=32)
Epoch 1/20
8712/8712 [==============================] - 8s 915us/sample - loss: 4867.0947
Epoch 2/20
8712/8712 [==============================] - 7s 832us/sample - loss: 3309.2133
Epoch 3/20
8712/8712 [==============================] - 7s 833us/sample - loss: 3212.8495
Epoch 4/20
8712/8712 [==============================] - 7s 833us/sample - loss: 3180.4730
Epoch 5/20
8712/8712 [==============================] - 7s 853us/sample - loss: 3156.5066
Epoch 6/20
8712/8712 [==============================] - 8s 891us/sample - loss: 3167.0353
Epoch 7/20
8712/8712 [==============================] - 8s 874us/sample - loss: 3118.7520
Epoch 8/20
8712/8712 [==============================] - 7s 852us/sample - loss: 3020.1702
Epoch 9/20
8712/8712 [==============================] - 8s 863us/sample - loss: 3062.5642
Epoch 10/20
8712/8712 [==============================] - 7s 857us/sample - loss: 3062.4002
Epoch 11/20
8712/8712 [==============================] - 7s 845us/sample - loss: 3042.2752
Epoch 12/20
8712/8712 [==============================] - 7s 851us/sample - loss: 2970.8858
Epoch 13/20
8712/8712 [==============================] - 7s 856us/sample - loss: 3070.2262
Epoch 14/20
8712/8712 [==============================] - 7s 855us/sample - loss: 3030.4804
Epoch 15/20
8712/8712 [==============================] - 7s 852us/sample - loss: 3084.9177
Epoch 16/20
8712/8712 [==============================] - 8s 862us/sample - loss: 3090.2180
Epoch 17/20
8712/8712 [==============================] - 7s 853us/sample - loss: 3071.2044
Epoch 18/20
8712/8712 [==============================] - 7s 852us/sample - loss: 3070.6416
Epoch 19/20
8712/8712 [==============================] - 7s 847us/sample - loss: 3054.4008
Epoch 20/20
8712/8712 [==============================] - 7s 855us/sample - loss: 3024.6538





<tensorflow.python.keras.callbacks.History at 0x7fe2fd90ceb8>
y_train_pred_1 = gru_1.predict(X_train_rnn)
y_train_pred_2 = gru_2.predict(X_train_rnn)
y_test_pred_1  = gru_1.predict(X_test_rnn)
y_test_pred_2  = gru_2.predict(X_test_rnn)


plt.figure(figsize=(16, 6))
plt.plot(y_train.iloc[:, 0], color='blue')
plt.plot(y_train_pred_1, color='green')
plt.show()

png

plt.figure(figsize=(16, 6))
plt.plot(y_train.iloc[:, 1], color='red')
plt.plot(y_train_pred_2, color='green')
plt.show()

png

plt.figure(figsize=(16, 6))
plt.plot(y_test_pred_1, color='green')
plt.plot(y_test_pred_2, color='orange')
plt.show()

png


Results comparison and conclusions

df_score = pd.DataFrame({'Model / Site':['Linear Reg / Site1', 'Linear Reg / Site2',
                                  'Lasso Reg / Site1', 'Lasso Reg / Site2',
                                  'ElasticNet Reg / Site1', 'ElasticNet Reg / Site2',
                                  'SVM base linear / Site1', 'SVM base linear / Site2',
                                  'SVM base rbf / Site1', 'SVM base rbf / Site2'],
                        'weighted MSA':[res_lin_reg_1, res_lin_reg_2, 
                                        res_lasso_1, res_lasso_2, 
                                        res_elasticnet_1, res_elasticnet_2,  
                                        res_svr_1, res_svr_2, 
                                        res_svr_base_rbf_1, res_svr_base_rbf_2]})
ax = df_score.plot.barh(y='weighted MSA', x='Model / Site')

png

  • Except the SVM model with an rbf base, all models doesn’t perform better for the 2nd site, which can be explained by the fact that site doesn’t present regular oscillations.
  • Finally, Linear Reg, Lasso and ElasticNet have similar results, because regularization doesn’t help to predict well spikes
  • SVM base rbf and Adaboost have worst results than linear regression w/ or w/o regularization.

All these models don’t work well when it comes to time series, in the next part we’ll use the SARIMA library and Recurrent Neural Networks R.N.N which are more appropriate for this kind of prediction.

Ideas

  • fill NA with continuous values, mean or median of the week depending where NAs are
  • open data for missing temperatures
  • add sunlight hours
  • try w & w/o get dummies for hours, weekdays
  • scaler : fit transform on X/y train & transform on X test
  • more function for automation
  • RNN
  • SARIMA or ARIMA

https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/

https://www.kaggle.com/niyamatalmass/machine-learning-for-time-series-analysis

https://mapr.com/blog/deep-learning-tensorflow/

https://towardsdatascience.com/how-not-to-use-machine-learning-for-time-series-forecasting-avoiding-the-pitfalls-19f9d7adf424

https://machinelearningmastery.com/how-to-grid-search-deep-learning-models-for-time-series-forecasting/

Really interesting posts :

https://machinelearningmastery.com/multi-step-time-series-forecasting-with-machine-learning-models-for-household-electricity-consumption/

https://medium.com/@alexrachnog/deep-learning-the-final-frontier-for-signal-processing-and-time-series-analysis-734307167ad6

https://towardsdatascience.com/time-series-machine-learning-regression-framework-9ea33929009a