Clarity on "Build Data Pipeline"


While building this data pipeline, what are the assumptions that we make? What is the schema of data? Is it the same present in the “Train” dataset?

Do we read from a Kafka Stream or we Push to it?

Can anyone please shed some light on this?




  1. The schema of the dataset is same as provided in the train link.

  2. Build a component which will help you read the data from the decompressed csv file and push it to Kafka.

@ankit2106 can I assume that the decompressed file is available on the cluster?
Or is there a location outside of the cluster?

Yes, you can assume that.


Streaming data should be cleaned and stored and then alert should be triggered at the end of each hour … ie batch jobs of 1 hour is my understanding correct…

Building a data pipeline: Build a data pipeline to stream power consumption/load data using kafka, you may use any processing system of your choice to ingest the data.
Generate real-time Alert: Generate real-time alerts on power consumption (coming from kafka stream) when:
(Alert Type 1) Hourly consumption for a household is higher than 1 standard deviation for that hour for that household’s mean consumption historically for that hour
(Alert Type 2) Hourly consumption for a household is higher than 1 standard deviation of mean consumption across all households within that particular hour on that day

Yes, you got it right.

@ankit2106 The first requirement is, Data must be consumed in a streaming fashion.

But we are directly reading from a decompressed file and sending it to a kafka topic.

Here, do we assume that the streamed data is written directly on the file and the file is continuously appended?

Can you please clarify?

No, Do not assume this. The data is coming from each device and you have to stream it using the Kafka Module.

@ankit2106 I am confused :frowning:

Then what’s with the decompressed file? Where the data from each device lands?

Can I use a cloud service provider to implement the streams or just the code logic is sufficient ?

is the sample output is the actual output?
because in the dataset, the minimum timestamp is August 31, 2013, 10:00 GMT whereas in sample output set timestamp starts from 01-Sept-2013
Please someone suggest

Hello Renu21,
Try timezone of GMT + 2:00 Hrs (Say Paris)

Aug 31 is also included in test file . Please check

© Copyright 2013-2019 Analytics Vidhya