Flu Prediction

In this tutorial, we will analyze the evolution of the flu in France over 4 years. The data is produced by OpenHealth (download csv).

First, we will explore the data, which is a time series. Then, we will study the periodicity of flu episodes. To finish, we will predict the tendency of the flu over a year.

Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

Data cleaning

We load and prepare the data. We replace the values 0 by the mean of the previous value and the following one.

df = pd.read_csv("data/OpenhealthS-Grippal.csv", sep=";")

# Clean data and change data types
df = df.drop(["IAS_lisse", "Incidence_Sentinelles"], axis=1)
df.rename(columns={'PERIODE': 'period'}, inplace=True)
df['period'] = pd.to_datetime(df['period'])
df['IAS_brut'] = pd.to_numeric(df['IAS_brut'].str.replace(",", "."), errors="coerce")

# Replace 0 values by Nan and interpolate them
df["IAS_brut"].replace(0, np.nan, inplace=True)
df["IAS_brut"] = df["IAS_brut"].interpolate(method='linear')

# Display dates
print("Start date:", df["period"].min())
print("End date:", df["period"].max())
df.head()
Start date: 2009-07-01 00:00:00
End date: 2014-04-17 00:00:00
period IAS_brut
0 2009-07-01 1.7
1 2009-07-02 3.4
2 2009-07-03 2.1
3 2009-07-04 1.3
4 2009-07-05 14.2

Data exploration

We display the time serie:

plt.figure(figsize=(20,10))
plt.plot(df["period"], df["IAS_brut"])
plt.grid()
plt.title("Number of cases of influenza")
plt.show()

png

The graph represents the number of cases of influenza detected per day, over 4 years.
There was a strong peak in the beginning of 2013. It appears in the press that this was the longest epidemic for 30 years.
It seems that there were few cases in 2014, but this is explained by the fact that the data stopped in April 2014.

plt.figure(figsize=(20,10))
plt.hist(df["IAS_brut"], 100)
plt.grid()
plt.title("Histogram of marginal law")
plt.show()

png

The histogram shows that most of the days, there are few cases of influenza.

In order to better visualize the variations, we will apply the natural logarithm to the data.

df["IAS_brut"] = np.log(df["IAS_brut"])

plt.figure(figsize=(20,10))
plt.plot(df["period"], df["IAS_brut"])
plt.grid()
plt.title("Number of cases of influenza")
plt.show()

plt.figure(figsize=(20,10))
plt.hist(df["IAS_brut"], 100)
plt.grid()
plt.title("Histogram of marginal law")
plt.show()

png

png

We now notice that the number of cases of influenza is cyclical. It follows a similar pattern from year to year.
The histogram is now more extensive. Transformation by the natural logarithm smoothed the data and made it more readable.

Analysis of the flu cycles

To analyze the periodicity of the flu episodes, we will display a periodogram, which estimates the power spectral density of a signal by taking the square of its Fourier transform.

from scipy import signal

plt.figure(figsize=(20,8))
f, Pxx_den = signal.periodogram(df["IAS_brut"])
plt.semilogy(f, Pxx_den)
plt.title("Périodogram")
plt.ylim([1e-3, 1e4])
plt.xticks(np.arange(0,0.55,0.05))
plt.grid()
plt.show()

png

The periodogram shows at which frequencies appear peaks of flu cases. We notice 4 main peaks (around 0, 0.14, 0.28, 0.43).

We compute the exact frequency (in days) corresponding to the peaks:

a = [np.argmax(Pxx_den)]
max_indexes = np.argpartition(Pxx_den, -10)[-10:]
b = np.extract(f[max_indexes] > 0.05, max_indexes)
index = sorted(np.concatenate((a,b)))
1/f[index]
array([350.4       ,   7.008     ,   3.49700599,   2.33288948])

We note that the number of flu cases is mainly cyclical every year (first peak: 350 days).
We also note a cycle every week (second peak: 7 days). The frequency of the week may be explained by the facts that the patients certainly visit the doctor more on certain days of the week than others (on Monday rather than weekends for instance).

Remove a periodic trend

In order to better study some variations, we can remove periodic trends. Here, we will remove the one-year periodic trend from the data. By doing this, we will loose the data of the first year.

df2 = df.copy()
df2["IAS_brut_diff_period"] = df2["IAS_brut"].diff(periods=365)
df2 = df2.dropna()
df2.head()

plt.figure(figsize=(20,8))
f, Pxx_den = signal.periodogram(df2["IAS_brut_diff_period"])
plt.semilogy(f, Pxx_den)
plt.title("Périodogramme")
plt.ylim([1e-3, 1e3])
plt.xticks(np.arange(0,0.55,0.05))
plt.grid()
plt.show()
period IAS_brut IAS_brut_diff_period
365 2010-07-01 0.000000 -0.530628
366 2010-07-02 -0.105361 -1.329136
367 2010-07-03 -0.693147 -1.435085
368 2010-07-04 0.182322 -0.080043
369 2010-07-05 -0.510826 -3.164068

png

By removing the annual trend the first peak is now reduced and become as important as the others. We will set back this frequency for the rest of the tutorial.

Predict the tendency of the flu over a year

We will predict the number of cases of influenza for a year, based on periodic trends.

To estimate the periodic trend, for a fixed number of harmonics (n_harm), we will compute explanatory variables for the learning period: for each k = 1, 2, … n_harm, we add two explanatory variables, $x_{i,2k−1} = cos(t_i \cdot k \cdot \frac{2\pi}{T})$ and $x_{i,2k} = sin(t_i \cdot k \cdot \frac{2\pi}{T})$, where $t_i$ is the moment of time (we use whole numbers instead of dates) and T = 365.
We first compute with n_harm = 10.

# Save the dataframe for later work
df_initial = df.copy()

n_harm = 10
for k in range(1, n_harm+1):
    df["cos_" + str(k)] = np.cos(df.index*k*2*np.pi/365)

for k in range(1, n_harm+1):
    df["sin_" + str(k)] = np.sin(df.index*k*2*np.pi/365)

df.head()
period IAS_brut cos_1 cos_2 cos_3 cos_4 cos_5 cos_6 cos_7 cos_8 ... sin_1 sin_2 sin_3 sin_4 sin_5 sin_6 sin_7 sin_8 sin_9 sin_10
0 2009-07-01 0.530628 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 2009-07-02 1.223775 0.999852 0.999407 0.998667 0.997630 0.996298 0.994671 0.992749 0.990532 ... 0.017213 0.034422 0.051620 0.068802 0.085965 0.103102 0.120208 0.137279 0.154309 0.171293
2 2009-07-03 0.741937 0.999407 0.997630 0.994671 0.990532 0.985220 0.978740 0.971100 0.962309 ... 0.034422 0.068802 0.103102 0.137279 0.171293 0.205104 0.238673 0.271958 0.304921 0.337523
3 2009-07-04 0.262364 0.998667 0.994671 0.988023 0.978740 0.966848 0.952378 0.935368 0.915864 ... 0.051620 0.103102 0.154309 0.205104 0.255353 0.304921 0.353676 0.401488 0.448229 0.493776
4 2009-07-05 2.653242 0.997630 0.990532 0.978740 0.962309 0.941317 0.915864 0.886071 0.852078 ... 0.068802 0.137279 0.205104 0.271958 0.337523 0.401488 0.463550 0.523416 0.580800 0.635432

5 rows × 22 columns

We divide the dataset into two, for the training period (2009-07 - 2013-03) and the testing period (2013-04 - 2014-04):

df_train = df[(df["period"] >= "2009-07-01") & (df["period"] <= "2013-03-31")]
df_test = df[(df["period"] >= "2013-04-01") & (df["period"] <= "2014-04-17")]
X_train = df_train.drop(["period", "IAS_brut"], axis=1)
y_train = df_train["IAS_brut"]
X_test = df_test.drop(["period", "IAS_brut"], axis=1)
y_test = df_test["IAS_brut"]

We estimate the periodic tendency by regressing the training data with a linear regression:

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
plt.figure(figsize=(20,10))
plt.plot(df_test["period"], y_test, label="True values")
plt.plot(df_test["period"], y_pred, label="Prediction")
plt.grid()
plt.legend()
plt.title("Forecast of the number of influenza cases")
plt.show()

png

The predictions follow the same trend as the true values.
However, the predictions are higher than the reality. There were likely to be more flu cases overall in the training years (in particular in 2013 as we have seen before) than in the year of predictions.

We compute the mean squared error and the residues:

r_quad = sum((y_test - y_pred)**2)/len(y_test) #mean_squared_error(y_test, y_pred)
residus = y_test - y_pred
print("Mean squared error: {:.3f}".format(r_quad))
plt.figure(figsize=(20,10))
plt.plot(df_test["period"], residus)
plt.grid()
plt.title("Residues of the flu prediction")
plt.show()
Mean squared error: 1.869

png

The residues highlight that the deviations from the predictions were the largest between October and February (greatest deviation from the x-axis).

We will compute the mean squared error for multiple n_harm parameter, in order to find the better one:

r_quads = []

for n_harm in range(1,30):
    df = df_initial.copy()
    for k in range(1, n_harm+1):
        df["cos_" + str(k)] = np.cos(df.index*k*2*np.pi/365)
    for k in range(1, n_harm+1):
        df["sin_" + str(k)] = np.sin(df.index*k*2*np.pi/365)

    df_train = df[(df["period"] >= "2009-07-01") & (df["period"] <= "2013-03-31")]
    df_test = df[(df["period"] >= "2013-04-01") & (df["period"] <= "2014-04-17")]

    X_train = df_train.drop(["period", "IAS_brut"], axis=1)
    y_train = df_train["IAS_brut"]
    X_test = df_test.drop(["period", "IAS_brut"], axis=1)
    y_test = df_test["IAS_brut"]

    lr = LinearRegression()
    lr = lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    r_quads.append(sum((y_test - y_pred)**2)/len(y_test)) #mse
plt.figure(figsize=(20,5))
plt.title("Mean squared error of the prediction according to the number of harmonics")
plt.plot(range(1,30), r_quads)
plt.xticks(range(1,30))
plt.grid()
plt.show()

png

The best (lowest) mean squared error is obtained with 11 harmonics.

You can see the complete code and explanations on the associated GitHub repository.

Xavier B.

Xavier B.

My name is Xavier. I love data science.
Feel free to contact me!