Sampling of San Francisco crime data on the basis of category

r
machine_learning
data_science

#1

I am working on San Francisco crime data here I want to divide my data set in five sub dataset.( more than 800000 rows). With the help of these subset I am going to create five model and then using these model I am going to predict category.Help me in sampling. Here one category have only 6 row and other have 174900 rows. how to handle this problem in sampling so every subset will have normal distribution of category.

What method of sampling I have to use for better prediction? What are the other way to do sampling of data (878049 rows). is it necessary to have all category in all five sub dataset or we can divide dataset categorywise?

These are the category frequency in train dataset.

|-------------------------------------------------|----------------|

Category Freq
TREA 6
--------------------------- -------------
PORNOGRAPHY/OBSCENE MAT 22
--------------------------- -------------
GAMBLING 146
--------------------------- -------------
SEX OFFENSES NON FORCIBLE 148
--------------------------- ------ ------
BRIBERY 289
--------------------------- -------------
BAD CHECKS 406
--------------------------- -------------
FAMILY OFFENSES 491
--------------------------- -------------
SUICIDE 508
--------------------------- -------------
EMBEZZLEMENT 1166
--------------------------- -------------
LOITERING 1225
--------------------------- -------------
ARSON 1513
--------------------------- -------------
LIQUOR LAWS 1903
--------------------------- -------------
DRIVING UNDER THE INFLUENCE 2268
--------------------------- -------------
KIDNAPPING 2341
--------------------------- -------------
RECOVERED VEHICLE 3138
--------------------------- -------------
DRUNKENNESS 4280
--------------------------- -------------
DISORDERLY CONDUCT 4320
--------------------------- -------------
SEX OFFENSES FORCIBLE 4388
--------------------------- -------------
STOLEN PROPERTY 4540
--------------------------- -------------
TRESPASS 7326
--------------------------- -------------
PROSTITUTION 7484
--------------------------- -------------
WEAPON LAWS 8555
--------------------------- -------------
SECONDARY CODES 9985
--------------------------- -------------
FORGERY/COUNTERFEITING 10609
--------------------------- -------------
FRAUD 16679
--------------------------- -------------
ROBBERY 23000
--------------------------- -------------
MISSING PERSON 25989
--------------------------- -------------
SUSPICIOUS OCC 31414
--------------------------- -------------
BURGLARY 36755
--------------------------- -------------
WARRANTS 42214
--------------------------- -------------
VANDALISM 44725
--------------------------- -------------
VEHICLE THEFT 53781
--------------------------- -------------
DRUG/NARCOTIC 53971
--------------------------- -------------
ASSAULT 76876
--------------------------- -------------
NON-CRIMINAL 92304
--------------------------- -------------
OTHER OFFENSES 126182
--------------------------- -------------
LARCENY/THEFT 174900
--------------------------- -------------

PS: I am using R