Solar energy generation 2/3 - exploratory data analysis

January 16, 2020 8 minute read

Introduction

This dataset contains hourly estimates of an area’s energy potential for 1986-2015 as a percentage of a power plant’s maximum output.

In the previous part, we’ve made clusters of countries with similar profiles of solar generation. In this 2nd part we’re going to analyse and explore datas for one country representative of each cluster. As a reminder, here are what those 6 clusters made of :

cluster nb : 0 CY IE NO SE
cluster nb : 1 AT CH CZ HR HU IT SI SK
cluster nb : 2 BE DE DK FR LU NL UK
cluster nb : 3 EE FI LT LV PL
cluster nb : 4 BG EL RO
cluster nb : 5 ES PT

Goals

Assess the impact of meteorological and climate variability on the generation of solar power in Europe.
Understand how the datas are structured
Determine how clean is the dataset? Older solar estimates used to contain impossible values around sunset (ie more energy than the sun releases) or negative sunlight.
Show what does a typical year look like? One common approach is to stitch together 12 months of raw data, using the 12 most typical months per this ISO standard.

First look

Let’s see the first rows of our data set :

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_columns = 300

import warnings
warnings.filterwarnings("ignore")

df_solar_co = pd.read_csv("solar_generation_by_country.csv")
df_solar_co.head(2)

	AT	BE	BG	CH	CY	CZ	DE	DK	EE	ES	FI	FR	EL	HR	HU	IE	IT	LT	LU	LV	NL	NO	PL	PT	RO	SI	SK	SE	UK
0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

We keep only one country of each cluster, and take a look a the end of the data set :

df_solar_co = df_solar_co[['NO', 'AT', 'FR', 'FI', 'RO', 'ES']]
df_solar_co.tail(2)

	NO	AT	FR	FI	RO	ES
262966	0.0	0.0	0.0	0.0	0.0	0.0
262967	0.0	0.0	0.0	0.0	0.0	0.0

Data cleaning and preparation

Until now we’ve consider that all datas are clean and “normal”, but is it really the case ? We can easily verify that values are indeed between 0 and 1 :

print("Number of negative values :")
(df_solar_co[['NO', 'AT', 'FR', 'FI', 'RO', 'ES']] < 0).sum()

Number of negative values :





NO    0
AT    0
FR    0
FI    0
RO    0
ES    0
dtype: int64

print("Number of values greater than 1 :")
(df_solar_co[['NO', 'AT', 'FR', 'FI', 'RO', 'ES']] > 1).sum()

Number of values greater than 1 :





NO    0
AT    0
FR    0
FI    0
RO    0
ES    0
dtype: int64

Now, we have to add date time informations in order to use the data :

def add_time(_df):
    "Returns a DF with two new cols : the time and hour of the day"
    t = pd.date_range(start='1/1/1986', periods=df_solar_co.shape[0], freq = 'H')
    t = pd.DataFrame(t)
    _df = pd.concat([_df, t], axis=1)
    _df.rename(columns={ _df.columns[-1]: "time" }, inplace = True)
    _df['hour'] = _df['time'].dt.hour
    _df['month'] = _df['time'].dt.month
    _df['week'] = _df['time'].dt.week
    return _df

df_solar_co = add_time(df_solar_co)
df_solar_co.tail(2)

	NO	AT	FR	FI	RO	ES	time	hour	month	week
262966	0.0	0.0	0.0	0.0	0.0	0.0	2015-12-31 22:00:00	22	12	53
262967	0.0	0.0	0.0	0.0	0.0	0.0	2015-12-31 23:00:00	23	12	53

Data Analysis

Considering night and day

Obviously there is no generation of energy during the night :) But first we’re goint to take a look at the distribution of the values of solar efficiency during different spans of time and generally. Let’s begin with the last day of the records :

def plot_hourly(df, title):
    plt.figure(figsize=(12, 6))
    for c in df.columns:
        if c != 'hour':
            sns.lineplot(x="hour", y=c, data=df, label=c)
            #plt.legend(c)
    plt.title(title)
    plt.show()
    
plot_hourly(df_solar_co[df_solar_co.columns.difference(['time', 'month', 'week'])][-24:], "Efficiency of solar stations per country during the last 24 hours")

png

Values are normally distributed : the plot looks like a typical Gaussian distribution. The maximum efficiency during the day may vary among countries. Further more, there is an offset along the horizontal axis. This can be explain by the differnet longitude, the sun don’t appear at the same hour depending on countries. Those observations can also be seen if we plot the means of those value during the hours of the day :

plot_hourly(df_solar_co[df_solar_co.columns.difference(['time', 'month', 'week'])], "Mean solar efficiency per country during the day")

png

Now let’s look at a statistical distribution of the station’s efficiencies for non null values (ie during the day), we can see that there are still many values (see the spike) around zero :

temp_df = df_solar_co[df_solar_co.columns.difference(['time', 'hour', 'month', 'week'])]
plt.figure(figsize=(12, 6))
for col in temp_df.columns:
    sns.distplot(temp_df[temp_df[col] != 0][col], label=col, hist=False)
plt.title("Distribution of the station's efficiency for non null values (ie during the day)")

Text(0.5, 1.0, "Distribution of the station's efficiency for non null values (ie during the day)")

png

What about the evolution during the recorded years ? For each year there are a spike :

plt.figure(figsize=(12, 6))
sns.lineplot(x = df_solar_co.time, y = df_solar_co['FR'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fea5e095ef0>

png

In order to understand more accurately this phenomenon, let’s plot the mean values per month. As we can see, solar efficiency is better during the summer (which can be easily understood) : :

countries = ['NO', 'AT', 'FR', 'FI', 'RO', 'ES']

plt.figure(figsize=(12, 6))
for c in countries:
    temp_df = df_solar_co[[c, 'month']]
    sns.lineplot(x=temp_df["month"], y=temp_df[c], label=c)
    
plt.xlabel("Month of year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the months per country")

Text(0.5, 1.0, 'Efficiency across the months per country')

png

We can observe the same thing, with means on a week basis, there is finally no real variations :

plt.figure(figsize=(12, 6))
for c in countries:
    temp_df = df_solar_co[[c, 'week']]
    sns.lineplot(x=temp_df["week"], y=temp_df[c], label=c)
    
plt.xlabel("Week of year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the weeks per country")

Text(0.5, 1.0, 'Efficiency across the weeks per country')

png

We create a temporary dataframe with the year, in order to see the variation of the mean solar efficiency accross 30 years. As you can see, the solar generation is quite the same years after years :

temp_df = df_solar_co.copy()
temp_df['year'] = temp_df['time'].dt.year


plt.figure(figsize=(12, 6))
for c in countries:
    temp_df_ = temp_df[[c, 'year']]
    sns.lineplot(x=temp_df_["year"], y=temp_df_[c], label=c)
    
plt.xlabel("Year")
plt.ylabel("Efficiency") 
plt.title("Efficiency across the years per country")

Text(0.5, 1.0, 'Efficiency across the years per country')

png

Considering ONLY values between 5 AM & 10 PM

We’re going to take an other look at the distribution of the values but this same considering only during the sunlight hours of the day. Let’s begin with a summary of the statistics :

temp_df = df_solar_co[(5 < df_solar_co.hour) & (df_solar_co.hour < 22)]
temp_df = temp_df.drop(columns=['time', 'hour', 'month', 'week'])
temp_df.describe()

	NO	AT	FR	FI	RO	ES
count	175312.000000	175312.000000	175312.000000	175312.000000	175312.000000	175312.000000
mean	0.075220	0.187375	0.191680	0.099533	0.204717	0.257699
std	0.103796	0.191641	0.187002	0.142873	0.211509	0.228028
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.025330	0.131469	0.147011	0.020212	0.133705	0.233841
75%	0.111076	0.335664	0.339958	0.157080	0.386842	0.458925
max	0.487921	0.715303	0.701985	0.615942	0.722990	0.793842

Then, we can plot the values of the 3rd quartile (splits off the highest 25% of data from the lowest 75%) for each country :

def plot_by_country(_df, title, nb_col):
    _df = _df.describe().iloc[nb_col, :]
    plt.figure(figsize=(10, 6))
    sns.barplot(x=_df.index, y=_df.values)
    plt.title(title)

#plot_by_country("Mean efficiency by country", 1)
plot_by_country(temp_df, "75% efficiency by country", 6)

png

An other way to look at the distribution can be achieved with violin plots (a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side) :

# credits : https://stackoverflow.com/questions/49554139/boxplot-of-multiple-columns-of-a-pandas-dataframe-on-the-same-figure-seaborn
# This works because pd.melt converts a wide-form dataframe
plt.figure(figsize=(10, 6))
sns.violinplot(x="variable", y="value", data=pd.melt(temp_df))

<matplotlib.axes._subplots.AxesSubplot at 0x7fea5df1e128>

png

Just for fun, we can also use box plot as previously mentionned (In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles) :

plt.figure(figsize=(10, 6))
sns.boxplot(x="variable", y="value", data=pd.melt(temp_df))

<matplotlib.axes._subplots.AxesSubplot at 0x7fea5de6b3c8>

png

And finally the distribution, we can see that Norway and Finland present many more values around zero :

plt.figure(figsize=(10, 6))
for col in temp_df.columns:
    sns.distplot(temp_df[temp_df[col] != 0][col], label=col, hist=False)
plt.title("Distribution of the station's efficiency")

Text(0.5, 1.0, "Distribution of the station's efficiency")

png

## Correlations

Is there any dependence between the solar generation among countries ?

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related.

Correlations are useful because they can indicate a predictive relationship that can be exploited in practice.

def plot_corr(df_):
    corr = df_.corr()
    corr

    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(9, 7))

    # Generate a custom diverging colormap
    #cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, center=0, square=True, cmap='Spectral', linewidths=.5, cbar_kws={"shrink": .5}) #annot=True
    
plot_corr(temp_df)

png

Since values are higher than 0.6, there are considered highly positively correlated. Once again, this is not suprising because the countries are situated close each others, so the sun has a tendency to rise and set at the same time and in the same way for all those countries. An other way to see those correlations is to show the following matrix :

temp_df.corr()

	NO	AT	FR	FI	RO	ES
NO	1.000000	0.668562	0.724858	0.723009	0.648708	0.641740
AT	0.668562	1.000000	0.818610	0.684129	0.819165	0.741216
FR	0.724858	0.818610	1.000000	0.646947	0.718909	0.888815
FI	0.723009	0.684129	0.646947	1.000000	0.718531	0.547065
RO	0.648708	0.819165	0.718909	0.718531	1.000000	0.653520
ES	0.641740	0.741216	0.888815	0.547065	0.653520	1.000000

Heatmap month vs hours

# credits S Godinho @ https://www.kaggle.com/sgodinho/wind-energy-potential-prediction

df_solar_co['year'] = df_solar_co['time'].dt.year
plt.figure(figsize=(8, 6))
temp_df = df_solar_co[['FR', 'month', 'hour']]
temp_df = temp_df.groupby(['hour', 'month']).mean()
temp_df = temp_df.unstack('month').sort_index(ascending=False)
sns.heatmap(temp_df, vmin = 0.09, vmax = 0.29, cmap = 'plasma')

<matplotlib.axes._subplots.AxesSubplot at 0x7fea5e095e48>

png

Conclusion

In this second part, we’ve explored the data set in order to assess the impact of meteorological and climate variability on the generation of solar power. We’ve also shown the variation during the day, the months of the year and accross years. The dataset seems to be clean and a function to add date time informations is already implemented. It will be usefull in the third and final part of this study where we’ll train machine learning models to make predictions.

Share on

Twitter Facebook LinkedIn

Olivier Brunet

Solar energy generation 2/3 - exploratory data analysis

Introduction

Goals

First look

Data cleaning and preparation

Data Analysis

Considering night and day

Considering ONLY values between 5 AM & 10 PM

Heatmap month vs hours

Conclusion

Share on

You may also enjoy

H&M Personalized Fashion 2/2 - Recommendation system

H&M Personalized Fashion 1/2 - EDA

Computation of Connected Component in Graphs with Spark

What is federated learning?

	AT	BE	BG	CH	CY	CZ	DE	DK	EE	ES	FI	FR	EL	HR	HU	IE	IT	LT	LU	LV	NL	NO	PL	PT	RO	SI	SK	SE	UK
0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	AT	BE	BG	CH	CY	CZ	DE	DK	EE	ES	FI	FR	EL	HR	HU	IE	IT	LT	LU	LV	NL	NO	PL	PT	RO	SI	SK	SE	UK
0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	AT	BE	BG	CH	CY	CZ	DE	DK	EE	ES	FI	FR	EL	HR	HU	IE	IT	LT	LU	LV	NL	NO	PL	PT	RO	SI	SK	SE	UK
0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0