Extracting data belonging to a day from a given range of dates on a dataset

data_extraction
pandas
python

#1

I have a dataset with a date range of January 12th to August 3rd of 2018 with some values:

values

The dimensionality of my_df DataFrame is:

my_df.shape 
(9752, 2)

Each row data have a frequency of half hour

The first row begins at 2018-01-12

my_df.iloc[0]
Date:       2018-01-12 00:17:28
Value                      1
Name: 0, dtype: object

And the last row ending at 2018-08-03

my_df.tail(1)
                  Date:	    Value
9751	2018-08-03 23:44:59	 1

My objective is to select the data rows corresponding to each day and export it to a comma separated values file.

To get only the January 12th data and save to readable file, I perform:

# Selecting data value of each day
my_df_Jan12 = my_df[(my_df['Fecha:']>='2018-01-12 00:00:00') 
              & 
              (my_df['Fecha:']<='2018-01-12 23:59:59')
                                   ]
my_df_Jan12.to_csv('Data_Jan_12.csv', sep=',', header=True, index=False)

From January 12 to August 03 there are 203 days (28 weeks)

I don’t want to perform this query by each day of a manual way, then I am trying the following basic analysis:

  • I need generate 203 files (1 file by each day)
  • The day on January starting on 12 (January 12)
  • January is a first month (01) and August is the eighth month(08)

Then:

  • I need iterate over the 203 days totality
    • and is necessary in each date row value check
      the month and day value date with the order to check the change of
      each one of them

According to the above, I am trying this approach:

# Selecting data value of each day (203 days)
for i in range(203):
    for j in range(1,9): # month
        for k in range(12,32): # days of the month
            values = my_df[(my_df['Fecha:']>='2018-0{}-{} 00:00:00'.format(j,k)) 
            &  
            (my_df['Fecha:']<='2018-0{}-{} 23:59:59'.format(j,k))]
            values.to_csv('Values_day_{}.csv'.format(i), sep=',', header=True, index=False)

But I have the problem in the sense of when I iterate of range(12,32) in the days of the months, this range(12,32) only apply to first January month, I think so …

Finally, I get 203 empty comma separated value files, due to something I am doing wrong …

How to can I address this small challenge of the suited way?
Any orientation is highly appreciated


#2

Hi,
Outlined below is an example. I highly recommend installing the lubridate package as it makes date manipulation in R much more bearable.

It should be easy to adapt this code to your needs. The gist of it being:

  1. Obtain the unique dates within your dataset

  2. Loop through each date

  3. Extract subset of data that matches each date (at which point you can write that to a csv)

library(lubridate)
library(data.table)

# Create some sample data
date_range <- seq(ymd_hms('2018-01-01 00:00:00'), ymd_hms('2018-01-05 12:00:00'), by = 60*60*4)

sample_df <- data.table(
  date_value = date_range,
  value = sample(seq(1,1000), length(date_range))
)

# Check what unique dates you have
unique(date(sample_df$date_value))

# Loop through each unique date
for (current_date in unique(date(sample_df$date_value))) {
  # print no. of rows for given date
  print(nrow(sample_df[date(date_value) == current_date]))

  # print subset of data for given date - this can be written to csv, etc.
  print(sample_df[date(date_value) == current_date])
}

And the output is:

> # Create some sample data
> date_range <- seq(ymd_hms('2018-01-01 00:00:00'), ymd_hms('2018-01-05 12:00:00'), by = 60*60*4)
> 
> sample_df <- data.table(
+   date_value = date_range,
+   value = sample(seq(1,1000), length(date_range))
+ )
> 
> # Check what unique dates you have
> unique(date(sample_df$date))
[1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"
> 
> # Loop through each unique date
> for (current_date in unique(date(sample_df$date))) {
+   # print no. of rows for given date
+   print(nrow(sample_df[date(date_value) == current_date]))
+ 
+   # print subset of data for given date - this can be written to csv, etc.
+   print(sample_df[date(date_value) == current_date])
+ }
[1] 6
            date_value value
1: 2018-01-01 00:00:00   456
2: 2018-01-01 04:00:00   606
3: 2018-01-01 08:00:00   167
4: 2018-01-01 12:00:00   910
5: 2018-01-01 16:00:00   433
6: 2018-01-01 20:00:00    61
[1] 6
            date_value value
1: 2018-01-02 00:00:00   600
2: 2018-01-02 04:00:00   864
3: 2018-01-02 08:00:00   320
4: 2018-01-02 12:00:00   541
5: 2018-01-02 16:00:00   635
6: 2018-01-02 20:00:00   186
[1] 6
            date_value value
1: 2018-01-03 00:00:00   233
2: 2018-01-03 04:00:00   261
3: 2018-01-03 08:00:00   449
4: 2018-01-03 12:00:00   639
5: 2018-01-03 16:00:00   424
6: 2018-01-03 20:00:00   967
[1] 6
            date_value value
1: 2018-01-04 00:00:00   390
2: 2018-01-04 04:00:00   661
3: 2018-01-04 08:00:00   734
4: 2018-01-04 12:00:00     2
5: 2018-01-04 16:00:00   815
6: 2018-01-04 20:00:00   459
[1] 4
            date_value value
1: 2018-01-05 00:00:00   269
2: 2018-01-05 04:00:00    12
3: 2018-01-05 08:00:00    24
4: 2018-01-05 12:00:00    21
>