Seeking insight from experienced practitioners out there. I have a very complete dataset from Agriculture - specifically strawberry, raspberry and blackberry growing/farming. There is sales, production, expense and cultural practice (fertilizer. pesticide applications etc) data. In addition to demonstrating new visualizations, I want to work towards demonstrating data science techniques with the dataset; visualizations in R and ML. What would be effective questions to ask/consider to demonstrate more predictive techniques such as regression, R plotting and ML? I am new to predictive concepts and would like to start working with the data outside of a traditional “BI” context. I have ideas of such questions but not sure if I am in the ballpark.
I’m not an expert here but I’d like to give a try!
What I’d do when I get this kind of dataset is:
- Plot graphs of each attribute (sales, production etc). Also plot some comparisons between each categories (straw/rasp-berry etc) and their attributes. Here what I want to what information does the data contain.Is there a pattern which could be inferred. This is exploratory data analysis.
- Do a statistical survey. Find mean, median, mode (etc etc) of data. Here what I want is to know how is the data. What are the frequent attributes of the data? Are there any outliers. This is descriptive statistics.
- Do hypothesis testing. Let me explain this with an example. Suppose you found that high pesticide application is harmful. This is a good find, because now you can use this knowledge for better production. So this is your hypothesis. To test it, you take a random sample from the data and test your hypothesis. Here what I want is to find what are the dependent and independent variables. Does one attribute correlate with the other? This is inferential statistics.
- Finally build a predictive model. From your valid hypotheses, find which are actually important for your problem and make a model (which includes or excludes machine learning) which predicts you findings for a new data. Here what I want is to know is my finding replicable for a new data. Can I increase my productivity by doing predictions in the future? This is predictive modelling.
This will be my standard approach. In the meantime, I’ll try to discuss my findings with my peers or knowledgeable people and hear them out. You never know what you may find useful!
Thank you for this, it is the general outline Ive been seeking for such an application.
Regarding Do a statistical survey, can you expand on “how is the data”? What exactly do you mean here? What the “shape” of the data is? Also, what would examples of frequent attributes be?
Regarding hypothesis testing - in my case would a random sample be data for just one commodity and its pest/fert applications? Data for a random (short) period of time instead of the entire growing cycle ? Is there a resource you would recommended re: hypothesis testing and using R to do it?
Predictive Modeling; If a predictive model excludes ML, what else would I be using? What do you mean “is my finding replicable for a new data”, I do not understand here.
Thank you for your time
I’m glad you found it useful. I’ll try to answer your questions again.
What is statistical survey? It is knowing your data, and describing your data. You investigate the distributions of the data and use this to make predictions on the data.
For example, you find the mean score of a class in maths is 30 out of 50, whereas it is 20 out of 50 for physics, you say that maths paper was comparatively easier than physics.
What is an example of frequent attribute? It is the attribute which comes most often in a dataset. It could be a fertilizer (which most of the farmers use because its efficient) or some location (because it is near a water bank). It is important as it gives the value that is most likely to occur in a dataset.
This course by udacity gives an in depth concepts of this descriptiveness.
- How should a random sample be? It is very much dependent on the problem you are trying to solve. Suppose you see that strawberry production is low. So here you keep strawberry attribute constant, and pick some values from other columns. You take these values and see the distribution. For example, you could see that people buy strawberry’s more when it is summer season, whereas it declines in rainy. To pinpoint more, you could keep strawberry constant again and variate season, to see that the expenses is high in rainy whereas it is low in summer. So you see that as your problem changes your sample changes.
This is a good resource for hypothesis testing in R.
Predictive modelling without Machine Learning? This article describes clearly the boundaries of both. The line aptly answers it all “Predictive Analytics is a use and Machine Learning is a technique”.
Replication of hypothesis to a new data. The main objective of your survey and hypothesis testing is to predict (with a good guarantee) the future. So if you say that more pesticide application increases your produce, then in the upcoming future (new data) this hypothesis should stay true. This ensures that what you predict stays parallel with what the reality is.
Faizan, thank you for the follow up and the articles, this gives me so much to go with.
One more question for you - what would be like minded questions to ask of a data set from residential and commercial property management software? Example attributes of the dataset would be properties, tenants, vendors (repair), and related transactions.
The path I said above is pretty generalized and could be applied on many diverse datasets (with some exceptions of course, you will have to adjust according to your needs).
For a property management data, it would be interesting to see the correlations like between properties and their tenant and the time-series analysis of the attributes.