Wind energy generation 2/2 - exploratory data analysis

5 minute read

banner made from an image of Narcisa Aciko on pexels.com

Introduction

This dataset contains hourly estimates of an area’s energy potential for 1986-2015 as a percentage of a power plant’s maximum output.

In the previous part, we’ve made clusters of countries with similar profiles of wind generation. In this 2nd part we’re going to analyse and explore datas for one country representative of each cluster. As a reminder, here are what those 6 clusters made of : Countries grouped by cluster
cluster nb : 0 EE FI LT LV PL SE
cluster nb : 1 ES PT
cluster nb : 2 AT CH CZ HR HU IT SI SK
cluster nb : 3 BE DE DK FR IE LU NL UK
cluster nb : 4 CY NO
cluster nb : 5 BG EL RO

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_columns = 300

import warnings
warnings.filterwarnings("ignore")

path = "../../../datasets/_classified/kaggle/"


countries_lst = ['FI', 'PT', 'IT', 'FR', 'NO', 'RO']
df_wind_on = pd.read_csv(path + "EMHIRES_WIND_COUNTRY_June2019.csv")
df_wind_on = df_wind_on[countries_lst]
df_wind_on.head(2)

	FI	PT	IT	FR	NO	RO
0	0,31303	0,22683	0,33069	0,17573	0,26292	0,05124
1	0,33866	0,25821	0,30066	0,16771	0,26376	0,04665

Dealing with timestamps

def add_time(_df):
    "Returns a DF with two new cols : the time and hour of the day"
    t = pd.date_range(start='1/1/1986', periods=_df.shape[0], freq = 'H')
    t = pd.DataFrame(t)
    _df = pd.concat([_df, t], axis=1)
    _df.rename(columns={ _df.columns[-1]: "time" }, inplace = True)
    _df['hour'] = _df['time'].dt.hour
    _df['month'] = _df['time'].dt.month
    _df['week'] = _df['time'].dt.week
    return _df


for c in df_wind_on.columns:
    df_wind_on[c] = df_wind_on[c].str.replace(',', '.').astype('float64')

df_wind_on = add_time(df_wind_on)
df_wind_on.tail(2)

	FI	PT	IT	FR	NO	RO	time	hour	month	week
262966	0.47398	0.08945	0.00625	0.21519	0.54109	0.11247	2015-12-31 22:00:00	22	12	53
262967	0.47473	0.10206	0.00859	0.17319	0.54552	0.12690	2015-12-31 23:00:00	23	12	53

df_wind_on.dtypes

FI              float64
PT              float64
IT              float64
FR              float64
NO              float64
RO              float64
time     datetime64[ns]
hour              int64
month             int64
week              int64
dtype: object

Data Analysis

First, let’s take a look at the last 24 hours, the least we can say is that there isn’t any similarities or pattern in the graph below :

def plot_hourly(df, title):
    plt.figure(figsize=(14, 9))
    for c in df.columns:
        if c != 'hour':
            sns.lineplot(x="hour", y=c, data=df, label=c)
            #plt.legend(c)
    plt.title(title)
    plt.show()

plot_hourly(df_wind_on[df_wind_on.columns.difference(['time', 'month', 'week'])][-24:], "Efficiency of solar stations per country during the 24 hours")

png

Then let’s compare the efficiency of onshore wind stations for each profile during hours of a typical day (based on the data of the last 30 years)

plot_hourly(df_wind_on[df_wind_on.columns.difference(['time', 'month', 'week'])], "Efficiency of onshore wind stations per country during hours of a typical day")

png

When we look at the distribution of the station’s efficiencies, there isn’t many hours in a day the installation produces actually energy

temp_df = df_wind_on[df_wind_on.columns.difference(['time', 'hour', 'month', 'week'])]
plt.figure(figsize=(14, 9))
for col in temp_df.columns:
    sns.distplot(temp_df[temp_df[col] != 0][col], label=col, hist=False)
plt.title("Distribution of the station's efficiency for non null values (ie during the day)")

Text(0.5, 1.0, "Distribution of the station's efficiency for non null values (ie during the day)")

png

Anyway there are many short spikes each year

plt.figure(figsize=(14, 9))
sns.lineplot(x = df_wind_on.time, y = df_wind_on['FR'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fd833bdd350>

png

A more interesting insight can be obtained when the efficiency is plotted across the months. It seems that at the opposite of the solar energy generation, the wind produces more in winter

plt.figure(figsize=(12, 6))
for c in countries_lst:
    temp_df = df_wind_on[[c, 'month']]
    sns.lineplot(x=temp_df["month"], y=temp_df[c], label=c)
plt.xlabel("Month of year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the months per country")

Text(0.5, 1.0, 'Efficiency across the months per country')

png

There are more variations at the week level, but the same conclusion can be made :

plt.figure(figsize=(12, 6))
for c in countries_lst:
    temp_df = df_wind_on[[c, 'week']]
    sns.lineplot(x=temp_df["week"], y=temp_df[c], label=c)
plt.xlabel("Week of year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the weeks per country")

Text(0.5, 1.0, 'Efficiency across the weeks per country')

png

temp_df = df_wind_on.copy()
temp_df['year'] = temp_df['time'].dt.year
temp_df.head()

	FI	PT	IT	FR	NO	RO	time	hour	month	week	year
0	0.31303	0.22683	0.33069	0.17573	0.26292	0.05124	1986-01-01 00:00:00	0	1	1	1986
1	0.33866	0.25821	0.30066	0.16771	0.26376	0.04665	1986-01-01 01:00:00	1	1	1	1986
2	0.36834	0.27921	0.27052	0.15877	0.26695	0.04543	1986-01-01 02:00:00	2	1	1	1986
3	0.39019	0.33106	0.24614	0.14818	0.27101	0.04455	1986-01-01 03:00:00	3	1	1	1986
4	0.40209	0.38668	0.21655	0.13631	0.28097	0.05438	1986-01-01 04:00:00	4	1	1	1986

When we compare the energy production yearly, the average performance is rather poor and can change a lot one year to an other :

plt.figure(figsize=(12, 6))
for c in countries_lst:
    temp_df_ = temp_df[[c, 'year']]
    sns.lineplot(x=temp_df_["year"], y=temp_df_[c], label=c)
plt.xlabel("Year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the years per country")

Text(0.5, 1.0, 'Efficiency across the years per country')

png

temp_df = temp_df.drop(columns=['time', 'hour', 'month', 'week', 'year'])
temp_df.describe()

	FI	PT	IT	FR	NO	RO
count	262968.000000	262968.000000	262968.000000	262968.000000	262968.000000	262968.000000
mean	0.195797	0.231411	0.168171	0.235092	0.234455	0.206581
std	0.158490	0.200070	0.174528	0.197681	0.138302	0.199103
min	0.000000	0.000070	0.000040	0.000070	0.000260	0.000000
25%	0.062760	0.080390	0.036400	0.082140	0.119130	0.058230
50%	0.153390	0.164075	0.101570	0.170620	0.214960	0.136680
75%	0.298873	0.323300	0.246170	0.337360	0.339190	0.291813
max	0.634540	0.959080	0.870380	0.845160	0.579430	0.992050

This is confirmed by the bar plot below :

def plot_by_country(_df, title, nb_col):
    _df = _df.describe().iloc[nb_col, :]
    plt.figure(figsize=(10, 6))
    sns.barplot(x=_df.index, y=_df.values)
    plt.title(title)

#plot_by_country("Mean efficiency by country", 1)

plot_by_country(temp_df, "75% efficiency by country", 6)

png

An other interesting way is to use violin plots :

# credits : https://stackoverflow.com/questions/49554139/boxplot-of-multiple-columns-of-a-pandas-dataframe-on-the-same-figure-seaborn
# This works because pd.melt converts a wide-form dataframe
plt.figure(figsize=(10, 6))
sns.violinplot(x="variable", y="value", data=pd.melt(temp_df))

<matplotlib.axes._subplots.AxesSubplot at 0x7fd840723890>

png

## Correlations

Further more there is no real correlations we could find :

def plot_corr(df_):
    corr = df_.corr()
    corr

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(10, 18))

    # Generate a custom diverging colormap
    #cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, center=0, square=True, cmap='Spectral', linewidths=.5, cbar_kws={"shrink": .5}) #annot=True

plot_corr(temp_df)

png

temp_df.corr()

	FI	PT	IT	FR	NO	RO
FI	1.000000	0.055113	0.087366	0.147905	0.530650	0.120007
PT	0.055113	1.000000	0.139590	0.222474	0.077719	0.050059
IT	0.087366	0.139590	1.000000	0.209693	0.134060	0.286878
FR	0.147905	0.222474	0.209693	1.000000	0.208549	0.085928
NO	0.530650	0.077719	0.134060	0.208549	1.000000	0.168180
RO	0.120007	0.050059	0.286878	0.085928	0.168180	1.000000

Heatmap month vs hours

Using the same heatmap used for the solar energy generation but for wind, we can guess that the efficiency differences are significant at the month level not at the hour…

# credits S Godinho @ https://www.kaggle.com/sgodinho/wind-energy-potential-prediction

df_wind_on['year'] = df_wind_on['time'].dt.year
plt.figure(figsize=(8, 6))
temp_df = df_wind_on[['FR', 'month', 'hour']]
temp_df = temp_df.groupby(['hour', 'month']).mean()
temp_df = temp_df.unstack('month').sort_index(ascending=False)
sns.heatmap(temp_df, vmin = 0.09, vmax = 0.29, cmap = 'plasma')

<matplotlib.axes._subplots.AxesSubplot at 0x7fd83ac16a90>

png

Conclusion

In this second part, we’ve explored the data set but we haven’t found real pattern in order to assess the impact of meteorological and climate variability on the generation of wind power. Only the variation during the months of the year is significant. Being able to predict the efficiency of wind energy generation is a really complex issue. It could have been interesting to study it at the region / geographic level but i lack the time to achieve it :)

Twitter Facebook LinkedIn

Olivier Brunet

Wind energy generation 2/2 - exploratory data analysis

Introduction

Data Analysis

Heatmap month vs hours

Conclusion

You May Also Enjoy

H&M Personalized Fashion 2/2 - Recommendation system

H&M Personalized Fashion 1/2 - EDA

Computation of Connected Component in Graphs with Spark

What is federated learning?