In this article, in iteration 4, for S1, UCB1 is calculated as follows:
10+2*sqrt(ln(3)/2)
Should it be following?:
20+2*sqrt(ln(3)/2)
UCB1 formula is given as:
where Vi is the average reward/value of all nodes beneath this node. Does that reduce Vi at S1 from iteration 3 to iteration 4 from 20 to 10, because in interaction 4, S1 has 2 more children? If yes, I am unable to get why exactly. Can someone please explain?