Last Man Standing - Reveal your approach

lastmanstanding
hackathon

#1

Dear all,

Thanks for your participation in LastManStanding. This competition can not be complete until you share and discuss your approach. So, let’s do that!

Some pointers to keep the discussion specific:

  • What all approaches did you try during the hackathon?
  • What worked and what did not work?
  • Did you create any new additional features / variables?
  • What are the approaches you would have tried, if you had some more time?
  • What were the problems you faced? How did you solve them? Do you want answers to any unanswered questions you might have?
  • What data cleaning did you do? Outlier treatment? Imputation?
  • Any suggestions for us (Team AV) to improve the experience further.
  • Finally, what was the best moment for you during the hackathon?

Regards,
Sunil


Is there anyway to see scipts from previous contests?
#2

#3

For what its worth, here’s my code: https://github.com/rohanrao91/AnalyticsVidhya_LastManStanding
I particularly enjoyed creating the features and seeing it steadily improve the CV and LB. I found the time-series pattern pretty much straight away and from there, it was only uphill.
I will write about my approach in detail probably by tomorrow, but you can go through the code and get some ideas.

This model scored 0.9604 on the public LB and was ranked 2nd.


#4

@Rohan_Rao

How do you calculate the parameters inside xgboost algo.
Ex- xgboost(as.matrix(X_train), as.matrix(target), objective=“multi:softprob”, num_class=3, nrounds=130, eta=0.1, max_depth=6, subsample=0.9, colsample_bytree=0.9, min_child_weight=1, eval_metric=‘merror’)

How did you calculate the values of nrounds,eta,max_depth and etc.
Or was it just intuitive?


Data science questions to ask for an Ag dataset
#5

@Rohan_Rao - By time series pattern what do you mean? where is it present in the code you have shared?


#6

hello , I need to know that what are the train file and test file in all the challenges.


#7

Hi All,

I like to present my model but i dont know how it will be received,

My Model

  1. Imputed using -1 (Only this method has high impact over ‘0’, mean,median and mode)

  2. XGB with boost as gbtree.

with this single model i had achieved 0.8492 (14 th place in LB)

What i Tried

  1. Building lot of New features (spent almost 2 days) but failed big time.
  2. Various methods of imputation
  3. Parameter Optimization.

Nothing seem to work for me. almost at the end of the datahack had broken because of frustration, nevertheless had found how hard i can push myself in real world.

Thank you AV team for all the hard work.

Code available here

Code will be pretty simple since am not from a developer background.


#8

I had following approach:

  1. imputed missing values by taking mean based on Pesticide_Use_Category, Crop_Type, Soil_Type,
  2. added new parameters like total weeks, avg dose, total dose, etc
  3. predicted final values using XGB

did not worked -
) I tried XGB to predict missing values, but it did not improved my final score
2) i tried bucketing Estimated_Insects_Count, and added few more parameters based on the numbers, but this too failed.
3) Tried parameter optimization - nothing worked there

I got stuck on same score for 2 days. Was very frustrating and in the end gave up.

score: 0.84834, rank: 45 on lb

There is lot to learn as yet.

Thanks you AV team for putting up the hack and providing opportunity to learn and test the knowledge. Awaiting solutions from top 5/10 candidates so as to understand their method :slightly_smiling:


#9

The 0.84-0.85 was ceiling for what one could get ignoring the time-series element of the data.

I would argue that ID field contained information leak and therefore should have been discarded from analysis. Otherwise the whole exercise becomes about imputation/interpolation, not about generalized learning.

Confused and disappointed that I did not catch this pattern in IDs… I guess guessing these sorts of patterns is what analysis is all about.

Good brain teaser, after all.

P.S. Is it possible to score submissions after the deadline to see the “would-be” score?

Here are tips on automatic tuning of XGboost:


Two useful presentations here:
https://www.kaggle.com/forums/f/15/kaggle-forum/t/17120/how-to-tuning-xgboost-in-an-efficient-way

TL;DR: Can use caret, otherwise create custom grid.search with apply() loop.


#10

I chose the parameters by hand-tuning them, which gave me the least CV-score.


#12

Hi,

This is my first competition in AV and it has been quite an interesting one for me. As far my approach is concerned:
My basic model is an xgboost with multi:softprob like most of you which took me to ~ 85% in public LB. I saved my OOF predictions to pit them against the ground truth to improvise my next level. However, I didn’t find any more potential signal present in the data that could give me an accuracy lift. But very interestingly, what I found is multiple patterns which took me to 96.4% on the LB. Here are the most significant patterns:

  1. In the data (after binding the train and test), there are contiguous blocks of Estimated Insect Count with max difference of 1.

  2. Inside each such contiguous blocks there are sublocks first with Crop_Type = 0 and then Crop_Type = 1.

  3. Inside subblocks there are mini-blocks with Soil_Type = 0 and then Soil_Type = 1

  4. Inside each mini blocks there is a steady monotonically non decreasing sequence of response values that holds for 100% of the data.

  5. Since the pattern is deterministic and not probabilistic, i chose to write my own algorithm to modify my model outputs with that instead of engineering features that I can feed to my model. (that also would have given me the lift as Rohan confirms). - This alone takes me to 94.2%.

  6. Another critical pattern is that- inside each miniblock & response value combination there is steady non-decreasing pattern of Number_Doses_Week which holds for 99.32% of the data - so this can also be treated as more or less deterministic. This helped me to correct the transition entries from 0 to 1 as the pattern only holds for the response value 0 and not 1 and 2.

  7. After I post processed my outputs from step 5 with step 6 algorithm (which I also written separately)- it took me to 96.4%.

There were some more patterns in the data that I could mine I think but didn’t get much time to work on those.

I used R for my XGBOOST and python (Keras) for my NN(~84.2%).

Thanks to the organizers for such a nice competition. I will put my whole code in github as soon as I get time and congrats to all other participants for their hard work and success.

Best regards,
Bishwarup


#13

Rohan, can you tell me about the time-series data? where is it in your code?


#14

Hello guys,
Really enjoyed the hackathon.
I have a question regarding the final score taken into account. Is it not the maximum score we get through our submissions or some other criteria is also present.
Since my highest score which I submitted in the end was 0.8493 but the score which the hackathon took was 0.84800. Need some clarity so that I can be much careful from next time.

Cheers,
Raghudeep


#15

@rremani
The public leaderboard comprised of only 30% dataset. The private leaderboard used the entire dataset in ranking system. I think(am sure) you overfitted on the remaining 70% dataset. So thats why, as a result your accuracy went down.


#16

@gau2112 Ohhhk…Thank a lot man!


#17

Pattern recognition at its best. Nothing to do with classical data science, though. Just poor ID encoding on the part of competition organizers.


#18

Hi Rohan,

Good show. One query, line 34-35, why did you use rbind on the train and test? Whats the reason behind it? And it if you have used it, does it not affect the predictive power i.e. improve it?

Thanks,
Brijesh


#19

Hi bishwarup,

Can you please tell how you managed to find all the patterns in the data?
Is there any visualisation trick or tool to find those patterns in data?

Regards,
Shankar Chavan


#20

Hi Rohan,

I want to know why did you impute missing values with -1. What is the logic behind it.

Thanks
Prateek


#21

Order of observations is important across train and test. Hence, I’m forming the groups by using them together and ordering by ‘ID’ to create the features.

In most other contests, I would not do this.