Share your approach - DataHack Premier League


Hi all,

Hope you would have seen the awesome contest we launched recently. Now, marry your data science skills and cricket mastery to create new predictive models, which can predict performance of players on a stage which matters the most in cricket.

Check out more details about the contest:

Please use this thread to share your thoughts/ suggestions and various approaches.


Benchmark for DataHack Premier League 2018
pinned globally #2

Can we have an example of the RMSE calculation for player prediction table?


We have 187 players featuring in IPL 2018.
For simplicity let us assume that there is only one match with 2 players in each team (MI and CSK) for whom you have predicted 10,15, 18 and 20 runs respectively.

On the basis of actual toss result, relevant predictions for each player would be picked from your submission and RMSE would be calculated across all players and matches.
For example, in actual if each player scored 14 runs the RMSE would be:

square root of ((10 - 14)^2 + (10-15)^2 + (10-18)^2 + (10-20)^2)/4) = 16 + 25 + 64 + 100

The wicket RMSE and Extras RMSE would be calculated in a similar manner.


Thanks Ankit for sharing the perspective. We need to predict one more binary column i.e. playing_xi_flag from the pool of players sold at auction for a particular team. For e.g.11 players need to be selected out of 25 players for CSK and in actual if I predict 8 players correctly then what would be the method to calculate RMSE( sample size will be 8 or 11). Needless to say that the incorrect selection values will be negated. Am i missing anything?


I think have got the answer. Incorrectly predicted players transaction values(runs, wickets) will be 0 and cost you the error square of the actual value unless the actual value is 0. The sample size will remain 11.


first lets have a forecast table for each match filled with avg. values whether for wickets or runs…then we will adjust or modify it …
use a derived metric: consistency::
like for each team,bowler and stadium and year…pick the consistency like 4 wicket has 10% chances or 2 wickets hs 30%…for each match…
…store in teamA_bowler name,he might have played in different teams in the past…then aggregate consistency…for all the matches of ipl…how to do that create six columns as a for 1 wicket,col b for 2 wickets and fill each col with 0 or 1…when you aggregate…you will know overall consistency…
similarly, for batsman…we can have aggression like for Glen maxwell consitency(2)* aggression(4)=8… whereas Gautam Gambhir consitency(3)agression(3) = 9 is better than 8…
and AB devilliers in the past ipl has really not performed well enough…
we need to see if in any particular stadium,a batsman is more consistent like regularly 25 runs so his consistency bracket is 25 runs…like Ajinkya Rahane…but, if he scores lets say just for e.g. consecutive two 50s then in third match he gets out before 15…this is dipping factor…and is it going to impact .how many players actually show dipping factor and how many dont…so, in the forecast table we will replace the value with consistent value …and Mumbai indian is more constent and agressive in later half of the series…you need to get it out through plots…if we can generate this info thru plot of charts…
then next, prepare forecast table we can use average
probability…so plot average and average *probability…and the actual values…
if consistency is more than 60% replace the value prepare the table for each match per stadium…
if unpredictability is more important…then reduce the average value…for e.g. last 5 batsman which are actually bowlers are more unpredicable…but, can we predict the unpredictable ones…which algorithm can help us,we can try, neural networks …
for e.g RCB has this problem …that if in last 7 overs they fail to score with run rate above 8…they tend to loose the match…even if they score 180+…probably low no of sixes in last 7 overs hurts them…
all this will help in generating rules …for winning and predicting…both…


Hi…quite a good approach to quantify terms like consistency and aggression. Although, I am naive as far as modelling is concerned, speaking purely from a common sense perspective, shouldn’t we need to get external data to get the recent performance of players in domestic, international or other t20 leagues. For example, Lasith Malinga has been outstanding in previous IPLs but if he would have played this IPL his performance would have been far less lethal.
Also how do you analyze runs/wickets for players who are playing the IPL for the first time?

1 Like

lets take top first four batsman of RCB…we will take their average in past 3 IPL series…
quinton de cock 21 runs(21 balls)…
Virat Kohli… 30 runs (25 balls)…
AB devilliers…30 runs (25 balls)…
Sarfaraz …22 runs(20 balls)…so total we have 103 runs…(91 balls)…they just scored with a run rate of 7.0 per over…in the first 10 overs…65/2 if we take an average target…Virat and Quinton did the right job…but,after that next 4 batsman have to score with a very high strike rate of more than 150…
if you see this link only player from RCB with a strike rate above 150 and avg, of .22 is Kedar JAdhav…and now they have included Brendon Mccullum…with avg 29 strike rate of 145…but…i think RCB needs one more player from top 25…or one bowler with somewhat better average and strike rate.above Hardik pandya…or Axar PAtell.or Jadeja.can make a difference in the last 3-4 overs…once you do that…your chance of crossing 175 looks good…i say 7/10 times…



Would the RMSE evaluation include only the predicted playing XI?


The RMSE calculation won’t be affected by the selection of playing XI. For a valid submission you must assign 0 runs and wickets to players not in playing XI. RMSE would then be calculated using the actual wickets and runs scored in IPL 2018. The actuals for players not in playing XI would have zero value for both wickets and runs.


1 Like

Presenting a simple dashboard showing the performances of top batsmans:!/vizhome/IPL_DATA/IPLBatsmanDashboard

Please feel free to play around and give your valuable comments.


Ankit I have few doubts? If u r online please respond.


@prathaps Please ask your queries on this thread.


Well am new to Ml as am at the beginning of my learning path but I wanna try using R, I don’t know how to build model, what is necessary to be considered and what not? I have visualized in the Tableau and Got insights of the data, Can you tell me any source where I can learn and build the model as well as clean my datas of those old players who r not in for the Ipl 2018. Sorry I am interested to work on but I don’t know to where to start from? I found that you where knowledgeable, so it wud be helpful if u help me out and guide me.


Any perspective on how to select/predict the playing XI in every match? Do we need to predict the playing XI or need to select the playing XI based on the experience/assumptions, recent developments in other forms of crickets around the world? i.e. Santner has been ruled out of the IPL for CSK, Md. Shami is unlikely to play IPL for DD due to the recent controversy.


You are allowed to use any information openly available to aid your predictions.


Can you help out what are the factors to be considered it wud be helpful for me in building a good model? Waiting for Your Response.


Hi ,

Can any one please share any thoughts on how the runs and wickets for the new players in the test data player_predictions should be predicted as the new players don’t have any historical data ,so what is the strategy to be considered for this scenario,



You could use the openly available performance data for such players from other leagues and international cricket and integrate these with your submissions.