Please explain statistical modeling to an SQL Developer

predictive_model
python
statistics

#1

Sup guys!!
I am a Business Intelligence and data architect carrying experience into the field of data analytics, ETL and reporting of more than 10 years, and still have following lame questions regarding statistical models; I am really keen for the meaningful answers. (Apologies if you feel these are very-very basic questions, and I should know about it- sadly I have very limited idea on it)
I have very strong hold over SQL, ETL, data visualization and reporting technologies (9.5/10).

What are Predictive/statistical models?

• Is it series of instructions like a normal Program written in any programing language like C or SQL?
• If it’s a program, why can’t we use stored procedures?
• Why R/Python is ideal language for them?

What is its ideal/actual output of statistical model?

• I have vague idea of its output, like it provides some sort of score against various criteria for each line item/sample data (Feel free to correct me)

What is “Training model”?

• No idea at all, or I don’t know how to express my knowledge. I know little bit about it and even worked closely with people (team of young data scientists- fresh out of grad schools) who are actually using data I have provided to train their model, still I do not have any clue what the heck they do when they mean they are “training their models”


#2

Hi @kotvir

What are Predictive/statistical models?

  • Suppose you have are given a data with details of a person’s loan history, like his loan amount, EMI, etc and the requirement is to find out (or predict) whether the loan will be approved or not. For this task, you will need to use predictive models which use the given data to predict the result.

  • We cannot use stored procedures because if the data changes, or suppose the details for a new person our added, our model should consider that and work accordingly.

  • Python and R have specially designed packages for statistical computing and predictive modeling and hence are extensively used.

What is its ideal/actual output of statistical model?

  • Considering the previous example, you need to predict whether the loan will be approved or not. So the output of your prediction could be 0 or 1.

  • The predicted values are checked against the actual values for each row, to find how good the model is. So higher the score, better is the model. There are various metrics you can use to determine the performance of your model.

  • Also, the output for each row are the predicted values 0 or 1, while score is for the overall model performance.

What is “Training model”?

  • So when you provide the data, it contains the details of each person along with the result (that is whether his loan was approved or not). This is basically like knowing what has happened in the past cases and use this knowledge to predict values for a new case.

  • Suppose your model knows that a person with x loan amount and y EMI in the past didn’t get a loan. If a similar case occurs for the next person, you model can predict that the answer is 0.

  • In short, the data has two parts, train data and test data. Train data has the details along with the actual value (loan approved or not) while the test data only has the details (you have to predict whether the loan will be approved or not). So you will train the model on train data and use this model to predict the values for test data.