Harvesting Big Data (500TB+)



How can we collect/ mine textual data to the order of 1000TB in Ebglish and other languages from the internet? All the readily available data dumps when extracted barely yield 10TB text data put together. What strategies can be used to collect 1000TB data by crawling the internet?



This is a bad approach to burn resources!

You should first define the problem you are trying to solve and then collect reasonable amount of data to solve the problem.

The amount of data does not matter - if you can solve a problem with 1 GB of data, why would you crawl 1 TB?



Agreed Kunal.
This is actually for an enterprise level Translation Machine Learning project. 1GB or even 1TB is never going to be enough to reach accuracy of a human translator. We are aiming for 1000TB to increase chances of accuracy.


Hi @sindhukem49, did you get this dataset by google? It may be the largest n-gram till now. Also as per them

(You could have) all of it, if you have the bandwidth and space. We’re happy to oblige.


Hi @jalFaizy,
Thanks for the response! We did come across the Google n-grams.
Why it was not suitable is:

  1. It is just a repository of words, while we were looking for text (grammatically correct sentences)
  2. The files are still pretty small and don’t measure up to more than 25 TB.

In fact there are many data dumps like that (Wacky corpus, Reddit comments etc). All these, while being a good source, don’t contribute much to the size we are looking for. Therefore, only solution was to build a generic crawler.


Hi @sindhukem49, that seems like the best approach (and the last resort!).

As far as I have heard, large companies such as google and microsoft are working on this problem too. Maybe you could ask them (I doubt they’ll reveal their trade secrets but still, you could try).

Also you may find this paper by stanford’s NLP group useful.

Your idea is certainly noteworthy, I wish you the best.


PS: If you succeed, could you please make that freely available. It would be great to work on this humongous data!