Handing null values in Multivariate Time Series

The air quality data contains 38% of null values. The data is recorded at an interval of 15 minutes for the 2019 year. The data is missing in a continuous row for two months i.e. 2880 data points at a stretch. How to fill these gaps in the data.

dataset

You can fill empty values using previous good value (“forward fill”) or next good value (“back fill”)

df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)

You can use mean() (average) values from columns

df.fillna(df.mean(), inplace=True)

Doc: pandas.DataFrame.fillna()


OR you can try to use shift() to move values from previous month

mask = df.isnull().any(axis=1) # mask to get rows without data

offset = 2880
#offset = 2660

df.loc[mask, ['SO2', 'NOX', 'NO', 'NO2', 'CO', 'PM10']] = df.shift(offset).loc[mask,['SO2', 'NOX', 'NO', 'NO2', 'CO', 'PM10']]

#print(df.shift(offset)[mask].head(10))
#print(df.shift(offset)[mask].tail(10))

Doc: pandas.DataFrame.shift()


Full code which I used for tests

import pandas as pd

def compare(df1, df2):
    print('\n--- head ---\n')
    print('\nBEFORE')
    print(df1.head(10))
    print('\nAFTER')
    print(df2.head(10))
    print('\n--- tail ---\n')
    print('\nBEFORE')
    print(df1.tail(10))
    print('\nAFTER')
    print(df2.tail(10))
    print('\n--- mask head ---\n')
    print('\nBEFORE')
    print(df1[mask].head(10))
    print('\nAFTER')
    print(df2[mask].head(10))
    print('\n--- mask tail ---\n')
    print('\nBEFORE')
    print(df1[mask].tail(10))
    print('\nAFTER')
    print(df2[mask].tail(10))
        
# --- main ---

print('reading ...') # I display it because reading take longer time

# read local file
#df = pd.read_excel('BLR_1600_2019.xlsx')

# read directly from Google Drive
df = pd.read_excel('https://drive.google.com/uc?export=download&id=1M2vQbJXc3abWEuWBRipxt3WZWzECA_e6')

mask = df.isnull().any(axis=1) # mask to get rows without data

print('working ...') # I display it because reading take longer time

print('\n=== fillna ffill ===\n')

df2 = df.fillna(method='ffill')
compare(df, df2)

print('\n=== fillna mean ===\n')

df2 = df.fillna(df.mean())
compare(df, df2)

print('\n=== shift mask ===\n')

df2 = df.copy()
offset = 2880
offset = 2660
print(df.shift(offset)[mask].head(10))
print(df.shift(offset)[mask].tail(10))

df2.loc[mask, ['SO2', 'NOX', 'NO', 'NO2', 'CO', 'PM10']] = df.shift(offset).loc[mask,['SO2', 'NOX', 'NO', 'NO2', 'CO', 'PM10']]
compare(df, df2)
© Copyright 2013-2019 Analytics Vidhya