The point seems to be, at least as currently implemented with Stockfish testing, the SPRT test computes the following (given the experimental data, including the draw ratio):
a) Probability that the Elo difference is -1.5
b) Probability that the Elo difference is +4.5
In Larry's case, both of (a) and (b) are essentially infinitesimal, so it is then a (much) higher order effect that needs to kick in to terminate the test [comparison of tails of size (10±epsilon) sigma or something]. What one might want instead is:
c) Probability that the Elo difference is -1.5 or worse
d) Probability that the Elo difference is +4.5 or better
But I don't think this is what is actually implemented. In the code posted by Lucas Braesch, the comments conflate this point IMO.
# alpha = max typeI error (reached on elo = elo0)
# beta = max typeII error for elo >= elo1 (reached on elo = elo1)
Yet when one looks at what the code does, the comment about "elo>=elo1" does not seem proper. To wit:
# Probability laws under H0 and H1
P0 = bayeselo_to_proba(elo0, drawelo)
P1 = bayeselo_to_proba(elo1, drawelo)
# Log-Likelyhood Ratio
result['llr'] = R['wins']*math.log(P1['win']/P0['win']) +
R['losses']*math.log(P1['loss']/P0['loss']) + R['draws']*math.log(P1['draw']/P0['draw'])
One can simulate the "better/worse" conditions by (say) taking the probabilities for 4.5, 4.55, 4.6, 4.65, etc., and combining them appropriately (maybe integrating over an expected patch distribution, similar to a previous discussion), but with the Stockfish testing environment, where most patches seem reasonably close to the "target" window, the nuance between (a) and (b) versus (c) and (d) might not be that great.
