Find a algorithm to apply my prevention system to predict the incidents

classification
clustering

#1

Dear sir

I would like to build the system for the prevention activities in emergency response center or community. But I need to find a algorithm to apply my prevention system to predict the incidents. But I am a beginner of data analysis and algorithm of clustering, classification, or association to analyze the data and find a pattern to predict the incidents.
Can I have your support what kind clustering, classification algorithm I need to apply for those purpose?

  1. Total Data size : 5,068

  2. This data has the street, day, time, age, place, and incident type. Using 5 factors, I want to predict the incident at what day and time, which street and place, what age will occur. Using those algorithm, I would like to recommend the prevention activities to 911 emergency center and community to help each other.

  3. Data structure
    Street Day Time Age Place Incident type
    Bangcheon Tues 10:04 81 Home High pressure
    Bangcheon Tues 23:14 44 Home Falling
    Dongcheon Tues 16:27 62 Office High pressure
    Bulro Tues 13:16 86 Home Injuried
    Anshim3-4 Wendes 11:55 79 Home Diabetic
    Anshim3-4 Tues 18:48 57 Home Diabetic
    Heomok Wendes 11:04 79 Mountain Falling
    Heomok Fri 05:00 42 Regidential Falling
    Jijeo Thurs 08:47 61 Home Diabetic
    Anshim3-4 Tues 18:48 57 Home Injuried
    Anshim3-4 Tues 18:48 57 Home ??? what incidents


#2

@darum2002 - Hi, there is wide variety of techniques which you can use for classification problem .Starting from clustering algorithm, you can use k-means clustering to find the pattern in your data so that it helps in improving the performance of your model.

There is also a wide variety of classification model like logistic regression, random forest, decision tree by which you can create your classification model.I would suggest you first start with logistic regression model for variable selection then go for random forest for improving the performance of the model.

Hope this helps!

Regards,
Hinduja


#3

Dear sir
How are you? thank so much for your explanation.
When I analyzed those data, I applied K-means, and KNN, and J48.
But one of my colleague told that I can not apply K-means because those data can apply the supervised algorithm only such as the classification algorithm.
But I do not know what the supervised or unsupervised algorithm is different.
What I understood is that K-means is a unsupervised algorithm.

Also, Linear algorithm is only for the numeric estimation.

What do you think?

thanks so much for your kindness.

best regards


#4

@darum2002 - Yes it is right that if you are solving classification problem you have to use one supervised algorithm, for example, Decision tree, Logistic Regression.But k-means helps in finding the cluster the data by which we can reduce the complexity of our predictive model.

supervised algorithm - The output datasets are provided which are used to train the machine and get the desired outputs.

unsupervised algorithm - no taget variable is provided, instead the data is clustered into different classes.

Hope this helps!

Regards,
Hinduja


#5

region_place_day_time_age_incidents_final_data_data_standard.zip (58.1 KB)

Dear sir

thank so much your kindness.
Now I fully understand.
I attached a file of data set…

Admin_dong : district
incident_place : incident place
incident_day : incident day


pressure_p : blood pressure
falling_p : falling
diabet_p : diabet
injury_p : injuried
heart_p : heart disease…

the type of incident is a count of number (numeric type).
But using weka, when analyze them, it should be a nominal type.
I change those incident data type to nominal, and apply to K-means.
Is it a right to do?
I attached a file of data set as a zip file.

thanks
really thanks
best regards


#6

@darum2002,

Basic difference between Supervised and Unsupervised learning is presence or absence of target variable. In supervised learning, we have target variable to predict whereas unsupervised learning is used for clustering population in groups (similar population in same group).

Your data set has target variable (Incident_Type) so I would recommend you to go with supervised learning first and check the accuracy of model.

You can also use unsupervised learning algorithm (K-means) to create k clusters and fit individual model to each cluster using supervised learning algorithms.

Hope this helps!

Regards,
Sunil


#7

@darum2002:

I think Sunil has summed it up really well.
We can help you further only once you provide the data. Your previous upload doesn’t seem to work.

Some quick thoughts which you might want to consider before going into actual modeling the data:

  1. You have only 5 features but the problem you are trying to solve is dependent on many more factors. You might want to look for more open source information. Check out: http://www.analyticsvidhya.com/blog/2015/03/building-features-variables-open-data/

  2. Your dataset has 5k odd values which will be further broken into train and test set. I think for making a practically implementable solution, you will be needing more data. However, if you are trying to model a very specific geographic area (like a district), then it might work out.

  3. Try feature engineering, i.e. creating columns with more directly useful information. For example, converting time of the day to say morning, afternoon, evening, night might make more sense. Check out: http://www.analyticsvidhya.com/blog/2015/03/feature-engineering-variable-transformation-creation/

Hope this helps. Please feel free to discuss further.

Thanks,
Aarshay


#8

region_place_day_time_age_incidents_final_data_data_standard.csv (106.7 KB)

Dear sir
thanks for your kindness.
Your advice was helpful to understand the data analysis.

  1. I attached my file to be analyze and apply the clustering and classification.
  2. To apply those algorithm, Do I need change date type, numeric to nominal?

After testing in districts, I will expand the whole city of my hometown (Daegu, Korea)
thanks so much for your advice
Happy new year.
Best regards


#9

Hi @darum2002,

I’m still unable to access the data. It’s showing:

Regarding algorithms:

  • I believe that one should use algorithms when it’s difficult to take a decision by human intuition. So, I would recommend that you refrain from clustering for now. In your first attempt, just focus on making an algorithm for 1 district and check if it’s making sense.

  • While expanding to the entire city, you can consider using clustering to identify which group of districts have similar behavior and can be modeled together. But this is second step.

  • One question: Is the data with 5k odd records for 1 district? If yes, it would be great. Else, I reiterate my point that you may need more data. I think you should have 500-1000 rows atleast for 1 district.

Regarding changing data type:

  • This depends on algorithm, language being used and how you represent the table in your code.

  • Algorithms: A logistic regression requires all variables to be numeric. So if you have a nominal variable with 2 categories - ‘A’, ‘B’, you’ll have to code them as maybe 0 and 1. Other algorithms might not require this.

  • Language: some languages might do the required conversion for you. For instance, I think caret package in R converts the variables for you automatically.

  • Code Representation: I use Python and if my column has 2 values: 0 and 1, I can specify this column as nominal and Python would treat them as 2 strings: ‘0’ and ‘1’. Please check this for your language. Just remember to specify the datatype in your code.

I hope I got your query right and my explanations help. I think your next step should be to filter out data for 1 district and model it. Take a bunch of similar districts if the #rows are less than 500 for 1 district.

I would recommend starting with Logistic Regression then try advances algorithms like Random Forest, SVM, etc.

Have a great year ahead!

Cheers,
Aarshay


#10

Dear sir
thank you so much and your explanation was helpful so much to understand what I need to do.
I attached again, hopefully it is ok to download it that covers only one districts.
After this, I will expand them to whole city.

https://drive.google.com/open?id=0BxS1r7rxF_B3QmpuSmJMNUtLaTg

My city has 8 districts and 2.5 M populations. Each districts has around 0.4M.
After I got a model of prediction of incidents, I will expand to whole city.

As you mentioned, I will apply the logistics regression. after that I will do other.
I will add more data such as weather, and the disables, old people who lives alone, and area characteristics and population composition of residential area.

thanks so much your advice.
After apply the logistic regression, I will upload those results also.

Best regards
Henry Huh from Korea


#11

Sounds like a plan :+1:

The volume of data also appears great. All the best!

Cheers,
Aarshay