It's been a while since I've posted anything of content on computer chess, but the Alpha Zero work has prodded me to change that.
Although I am quite impressed by the result, and moreover that they achieved it via
tabula rasa (as they call it), they have (after all) written a scientific paper on it, and thus I feel obliged to offer scientific criticism on the matter.
1.
The strength of Stockfish 8
The paper says that SF8 ran on "64 threads and a hash size of 1GB", searching something like 70 million positions per second (on average?). No hardware is given, but I admit to doing the same in computational papers that I've written, at least when only an idea of the time taken is needed (not so true here, where the exactitude seems more important).
Anyway, I have to seriously criticize this small hash for Stockfish. They said that the "tournament time control" (?!) of 1 min/move was used (seemingly with UCI's "movetime", though that's again unclear). I haven't perused the Sardines-ness of Stockfish lately, but even at 8 bytes per hash entry this would be only 128 million positions. At the above speed, even with a low 10% saving rate (again I assume SF is not saving in q-search), the hash would be full in 1/3 of the time of one search. Moreover given that SF8 uses (I think) a lazy approach to parallelism using hash-sharing, this seems (quite) inadequate to my mind. I guess one could do experiments to determine how much this RAM paucity hurts SF, and I'd guess it is less than the observed Elo differential to AlphaZero, but it still looks quite dubious from someone in the know about computer chess. In any event, this is small potatoes compared to the next item.
2.
Hardware comparisons
I think this has already been brought up by others elsewhere, but there is no real idea about how strong 4 TPUs are compared to the Stockfish setup. Perhaps this is a good reason that "software championships" are bogus anyway...
Google's TPUs are not the best documented except by hype-ish word-of-mouth, but I've seen numbers from 50-100 trillion operations per second, compared to some plausible upper bound of around 1 trillion for the 64 thread SF setup (and in reality, particularly with hash issues, I'd guess the effective number is about 100 billion at most). Again I'm guessing as much as anyway else, and TFLOPS might not be the most relevant measure, but this does tend to indicate (at least) a 100x hardware edge for AlphaZero, possibly something more like 1000x or more. The paper, in my mind rather cheekily, never really attempts to discuss this hardware comparison issue, which I'd think is of critical importance (of course, they are more concerned with the learning aspects, but even handwaving this would be better than ignoring it completely). There are diminishing returns from speed-doubling at some point, but even still I'd guess 100x to be at least 200 Elo here, much more than the observed 100 Elo differential.
If anything, the lack of cluster-sized parallelism in chess (beyond ZCT, Jonny, and a few others) has always annoyed me, as I've never seen the TCEC-esque huger-and-huger machine model to be that relevant. Indeed, in my own Losing Chess research I had to deal with 128 cores across 8 machines (typically), and came up with a (rather simplisitic) approach based upon the goals of that project. I don't foresee any of the top engines becoming (usefully) cluster-capable anytime soon, which is what would seem to be needed to balance the hardware comparison. Maybe we'll just have to admit the single cpu/RAM model is dead. (Yea!).
My guess (from "single machine" being noted) is that the "4 TPUs" are on one chip/module (cf. the CPU/core distinction), but even that is not clear.
Perhaps I could state this criticism succinctly: if you are going to talk about the learning aspects of AlphaZero development, fine - but I don't think that "validating" this by a murky comparison to Stockfish should necessarily be the overall focus of this (one could argue the true domain of the work is at the software/hardware interface, and that you are really showing that TPUs and/or massive parallelism are the future, as opposed to the monolithic model - rather different than a presentation of this as a "mere" software advance). Of course, the 28-0 score will draw the headlines, but to me, it is (at least as described) rather scientifically obscure as to its import. (I've seen such "benchmarketing" in my professional work in computational mathematics, and this is by no means the worst...)
3.
The lack of opening book/variety
This again has been noted by others, though I can't say that it's as big of a deal as point 1 or (particularly) point 2. Unfortunately computer chess literature has IMO been historically too happy to accept 100 game tests (which here is fortunately significant) in unclear conditions. Again this seems to me to be something that someone in the know about computer chess would find anomalous, but I don't think it changes the end result too much. The fact that AlphaZero beat Stockfish from each human opening, which it itself had "found" and self-trained on, does raise some flags about bias - from a scientific point of view, they should use an independent collection of openings. They seem(?) to have done this in another 100 game sample (from what I can tell, there are 1300 games, 100 independent bookless games, and 100 each from 12 popular positions from which AlphaZero self-trained) - at the very least, this should be clarified as to what took place, and in fact the paper is IMO rather sloppy in describing its exact conditions. (I do like that the French Defense got murdered though!).
I'm not going to say that creating a separate FishTest endpoint from each of the 12 chosen opening positions would make such Stockfishes much better here, but that would seem to be the more trenchant comparison if it were feasible.
I've read that a "full paper" is in the works, and maybe it will address these issues to a workable degree - yet egad!, this is another one of my pet gripes, research and "advertizing about research" being conflated..
Some more quibbly comments:
In the abstract, the phrase "achieved within 24 hours" is rather unanchored to the computational power used (in fact, the text says it took just 4 hours to outperform SF, which is much more than merely "superhuman" to my mind).
Although I'm no expert on the (expected) rate of convergence, I'd expect a logarithmic x-axis would be more useful in Figure 1.