I want to use the Cookiecutter Data science project structure, to my project. Looks great http://drivendata.github.io/cookiecutter-data-science/
I am analyzing the different directory on their structure and I have some question related to the different data stages. In the
README.md file [setup the difference between external, interim, processed and raw data.]
├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump.
I am working on a project, in which the data are originated from sensors and are managed them via a web application dashboard. Additionally, I have been performed some JOINS on SQL database dump with the order to extract other features or data which I need to start to work.
What is the difference between raw data and external data?
The data which I describe the extract process above or the way in how do I get them to make that they are to be cataloged like raw data?
Why aren’t these considered like external data?
These will be considered external data whether I get them from other sources different to my organization which owns of the sensors and web application dashboard data administration?
About of raw data
They make approach especially to:
Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis
I understand this best practice
To illustrate my question, I want to select some indexes from one dataset sample which I am working:
I read some raw dataset which I extract using SQL joins. The data are changed
Then, these are my raw data:
# I read some raw dataset data = pd.read_csv('fruit-RawData.csv') data.head() weight date number lat lng farmName 0 3.09 2012-07-27 07:08:58 15 57.766231 -16.762676 Totti 1 1.50 2012-07-27 07:09:01 15 57.766231 -16.762676 Totti 2 10.50 2012-07-27 07:09:02 15 57.766231 -16.762676 Totti 3 2.50 2012-07-27 07:09:04 15 57.766231 -16.762676 Totti 4 6.50 2012-07-27 07:09:06 15 57.766231 -16.762676 Totti
If I select only the weight, date and number …
data = data[['weight','date','number']] data.to_csv('fruits.csv', sep=',', header=True, index=False)
And I get:
weight date number 0 23.09 2012-07-27 07:08:58 5 1 30.50 2012-07-27 07:08:58 5 2 19.50 2012-07-27 07:08:58 5 3 25.50 2012-07-27 07:08:58 5 4 26.50 2012-07-27 07:08:58 5
These data subset could be considered like intermediate data which has been transformed, or still are raw data?
I unknow if these questions are valid.