About of original raw data and intermediate data has been transformed

data_science

#1

I want to use the Cookiecutter Data science project structure, to my project. Looks great http://drivendata.github.io/cookiecutter-data-science/

I am analyzing the different directory on their structure and I have some question related to the different data stages. In the README.md file [setup the difference between external, interim, processed and raw data.][1]

 ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.

I am working on a project, in which the data are originated from sensors and are managed them via a web application dashboard. Additionally, I have been performed some JOINS on SQL database dump with the order to extract other features or data which I need to start to work.

What is the difference between raw data and external data?
The data which I describe the extract process above or the way in how do I get them to make that they are to be cataloged like raw data?

Why aren’t these considered like external data?

These will be considered external data whether I get them from other sources different to my organization which owns of the sensors and web application dashboard data administration?

About of raw data
They make approach especially to:

Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis

I understand this best practice :slight_smile:

To illustrate my question, I want to select some indexes from one dataset sample which I am working:

I read some raw dataset which I extract using SQL joins. The data are changed

Then, these are my raw data:

# I read some raw dataset
data = pd.read_csv('fruit-RawData.csv')
data.head()


    weight	date	            number	lat	     lng	      farmName
0	3.09	2012-07-27 07:08:58		15   57.766231 -16.762676	Totti
1	1.50	2012-07-27 07:09:01		15	57.766231 -16.762676	Totti
2	10.50	2012-07-27 07:09:02		15	57.766231 -16.762676	Totti
3	2.50	2012-07-27 07:09:04		15	57.766231 -16.762676	Totti
4	6.50	2012-07-27 07:09:06		15	57.766231 -16.762676	Totti 

If I select only the weight, date and number …

data = data[['weight','date','number']]
data.to_csv('fruits.csv', sep=',', header=True, index=False)

And I get:

	weight	date	           number
0	23.09	2012-07-27 07:08:58	5
1	30.50	2012-07-27 07:08:58	5
2	19.50	2012-07-27 07:08:58	5
3	25.50	2012-07-27 07:08:58	5
4	26.50	2012-07-27 07:08:58	5

These data subset could be considered like intermediate data which has been transformed, or still are raw data?

I unknow if these questions are valid.