[Discussions] Big Break in Big Data: Sapient Talent Hunt for Data Engineering

Use this category for the discussions related to contest: Big Break in Big Data: Sapient Talent Hunt for Data Engineers Hackathon which will be starting from 30th June. Feel free to share your approach & ask your questions here.

For more information, visit:

I have a query. The description mentions that there can be no pre-computations on whole data set. I wanted to ask whether the cleaning and missing value removal etc. can be done on the whole data set before pushing the data to Kafka for streaming? Would that be valid or even cleaning the data before pushing it is not allowed?

Hi, it is entirely upto you. Cleaning and imputation can be done before feeding to Kafka.

Thanks a lot

Should there be certain number of days skipped for “Learning” ? Or alerts should be fired right from second day where standard deviation would be 0 and so if consumption for an hour on second day is greater than that of first day it will fire Alert-1

@ankit2106 Queries regarding the below metrics for raising an alert:
(Alert Type 1) Hourly consumption for a household is higher than 1 standard deviation for that hour for that household’s mean consumption historically for that hour.
(Alert Type 2) Hourly consumption for a household is higher than 1 standard deviation of mean consumption across all households within that particular hour on that day

Can the metric be displayed mathematically in addition to the above sentence? I am unable to understand what exactly is to be computed, Or can you please provide an example?


  1. For the first alert type, consider a household say household 1 in house 1. Now, for each passing hour, you have to consider the history of only that household and compare the historical mean for that hour with the total consumption for that hour at present and generate alert if this is greater than 1 SDE of the historical mean.
  2. For the second alert type, we consider a household and compare its consumption against mean consumption across all households within that hour on that day and generate an alert if this is greater than 1 SDE.

@bhargav.pendse Generate alerts right from the second day. Thanks.

Understood, Thank you.

@ankit2106 Are the entries where value column equals 0 to be considered as missing/malformed data?

No, These points are not missing/malformed.

@ankit2106 Thanks for the explanation. But what is the frequency of the alert generation. I can understand that the window length of data selection in an hour of data but how frequently this selection has to be made. Is it hourly or Daily.

note: In alert_type_1.csv, I can see only one entry per perHouse_perHousehold_perday at 00 hrs.

Thanks in advance.

No need for apologies. I am happy to entertain the queries. The alerts have to be generated on hourly basis, so for each day and hour for a particular household in the stream, a seperate alert would be generated.

@ankit2106 I am Summarizing my assumptions and what I have understood. Please let me know if I am mistaking any of the points.

  1. I am assuming that hour starts at 00:00 so window will be 00:00-00:59.999, and it is a tumbling window type operation.
  2. I have to compare today’s hour’s sum of all values, lets assume 00:00- 00:59.999 periods data, with the mean of all the data available till yesterday but only for same hour duration OR with only yesterday’s data(mean and standard deviation)??

Quoting Submission format…
“Participant needs to submit the Code files and a document describing the choice of hardware/cluster and the query performance metrics. Participant can upload these files after compressing it under upload code file section on solution checker.”

Question: What is “query performance matrics”

@ankit2106 Could you please reply to this query?

Apologies for the late reply. Not with yesterday’s data but the entire history for that hour.

@shruti259 @scorp95 Hey guys, can you please guide me through the key points while building the pipeline.
As in, I am unaware of the process of bringing the data to a cluster. The data generated by the sensor need to be pushed to a Kafka topic, which is to be then consumed in a streaming application. I am aware about the Streaming application.

Can you please direct me to some knowledge on how to bring in the data to the cluster?

Thanks in advance!

@vihit I think we should simply read the data from the csv file and use a kafka producer to push data to the kafka topic. The data in kafka can be consumed in turn by a kafka consumer and processed as needed.

Hi @ankit2106, Anyone else who can help.

I know it is very late to ask this question, but please reply as soon as possible.

Could you guys please please help me with the correct time zone of data. Because key is formed by converting seconds into time, if i use date converter it as per my time zone, it is producing some keys which are not valid or out of bound.

Let’s say 1380578340 is GMT: Monday, September 30, 2013 9:59:00 PM but in time zone it is Tuesday, October 1, 2013 3:29:00 AM hence it produce a key which is not valid.

Please help.

© Copyright 2013-2019 Analytics Vidhya