I am trying to do a performance comparison study between HiveQL and Spark SQL on around 2TB of data. But I am facing difficulties in getting the data.
Could you plz suggest any data source freely available & downloadable where data set sizes more than 2TB.
Data domain: Healthcare, Retail, Census data… I am not looking for data for any predictive analysis kind of thing.



I could find 2 resources. I am sharing them below.

  1. Dataset by Criteo Labs
  2. Reddit comments dataset