Stockfish question

General discussion about computer chess...
mcostalba
Posts: 91
Joined: Thu Jun 10, 2010 11:45 pm
Real Name: Marco Costalba

Re: Stockfish question

Post by mcostalba » Mon Jul 12, 2010 6:17 pm

hyatt wrote: The problem is that if you test A vs A', you might get +30 with a search change. Then if you run A and A' against a range of engines, you may only get +5. It is easier, and more accurate, to use the same test setup each time, then there's no accidental changes to skew the results unknowingly.
I agree 100% with your diagnosis..but the conclusion ;-)

This is not a "problem", this is a "feature" !!

As a developer I am not interested in ELO accuracy, but in reliably detect good patches from bad ones. And these are two _different_ targets.

Your example is more theorical then practical, in real world, when you are modifing an already mature engine what normally happens is that when you test A vs A', you might get +10 ELO, and if you run A and A' against a range of engines, you may only get +3 ELO.

The difference between the two cases is not in the quantitative ELO result, but the _fundamental_ difference is that the first case is detectable by a test, the second case instead is _not_ detectable because you are well below error margin. So, when you don't have a cluster, A vs A' allows to detect as good a much broader set of patches then in the case you run A and A' against a range of engines.

IOW if you sort the tests results in good / bad / unknown, then if you test only against a range of engines it happens that a lot of good patches end up in the 'unkown' bucket, much more then if you run A vs A'.

If you choose to discard all but reliably good ones, then if you test only against a range of engines be prepared to discard a lot of good stuff.


P.S: Please don't comeback with something along the lines: "Of course, you need 1 milion games and you are sure !"....I think you have understood my point.

User avatar
Rebel
Posts: 515
Joined: Wed Jun 09, 2010 7:45 pm
Real Name: Ed Schroder

Re: Stockfish question

Post by Rebel » Mon Jul 12, 2010 6:36 pm

hyatt wrote:
Rebel wrote:
hyatt wrote: For the record, I don't like eng-eng tests when they are the same version except for one feature turned off. Tends to exaggerate the gain or loss of any change.
My experience is different. As a rule of fist, test positional changes against a wide range of engines, test search related changes against your own.

Ed
The problem is that if you test A vs A', you might get +30 with a search change. Then if you run A and A' against a range of engines, you may only get +5. It is easier, and more accurate, to use the same test setup each time, then there's no accidental changes to skew the results unknowingly.
While your reasoning makes perfect sense consider mine: testing search-related changes against a wide range of engines may effect the elo-measuring because every program has its own "angst-gegners", especially when the performance is small, say -/+5-10 elo. You can avoid that playing against your own engine because what you test is 100% about out-searching your previous version.

I agree you won't get an exact elo performance but what you do know is the fact that the search-change is an improvement or not. After that you can always run a verification test round against a wide range of other engines.

Ed

hyatt
Posts: 1242
Joined: Thu Jun 10, 2010 2:13 am
Real Name: Bob Hyatt (Robert M. Hyatt)
Location: University of Alabama at Birmingham
Contact:

Re: Stockfish question

Post by hyatt » Mon Jul 12, 2010 6:54 pm

mcostalba wrote:
hyatt wrote: The problem is that if you test A vs A', you might get +30 with a search change. Then if you run A and A' against a range of engines, you may only get +5. It is easier, and more accurate, to use the same test setup each time, then there's no accidental changes to skew the results unknowingly.
I agree 100% with your diagnosis..but the conclusion ;-)

This is not a "problem", this is a "feature" !!

As a developer I am not interested in ELO accuracy, but in reliably detect good patches from bad ones. And these are two _different_ targets.

Your example is more theorical then practical, in real world, when you are modifing an already mature engine what normally happens is that when you test A vs A', you might get +10 ELO, and if you run A and A' against a range of engines, you may only get +3 ELO.

The difference between the two cases is not in the quantitative ELO result, but the _fundamental_ difference is that the first case is detectable by a test, the second case instead is _not_ detectable because you are well below error margin. So, when you don't have a cluster, A vs A' allows to detect as good a much broader set of patches then in the case you run A and A' against a range of engines.

IOW if you sort the tests results in good / bad / unknown, then if you test only against a range of engines it happens that a lot of good patches end up in the 'unkown' bucket, much more then if you run A vs A'.

If you choose to discard all but reliably good ones, then if you test only against a range of engines be prepared to discard a lot of good stuff.


P.S: Please don't comeback with something along the lines: "Of course, you need 1 milion games and you are sure !"....I think you have understood my point.

You miss the point. I have already given examples in the past where in A vs A', a change shows up as +5, while in A vs gauntlet, and A' vs gauntlet, A' is 5 elo _worse_. I don't like "unknown" because those can be bad _or_ good. I want only good...

hyatt
Posts: 1242
Joined: Thu Jun 10, 2010 2:13 am
Real Name: Bob Hyatt (Robert M. Hyatt)
Location: University of Alabama at Birmingham
Contact:

Re: Stockfish question

Post by hyatt » Mon Jul 12, 2010 6:59 pm

Rebel wrote:
hyatt wrote:
Rebel wrote:
hyatt wrote: For the record, I don't like eng-eng tests when they are the same version except for one feature turned off. Tends to exaggerate the gain or loss of any change.
My experience is different. As a rule of fist, test positional changes against a wide range of engines, test search related changes against your own.

Ed
The problem is that if you test A vs A', you might get +30 with a search change. Then if you run A and A' against a range of engines, you may only get +5. It is easier, and more accurate, to use the same test setup each time, then there's no accidental changes to skew the results unknowingly.
While your reasoning makes perfect sense consider mine: testing search-related changes against a wide range of engines may effect the elo-measuring because every program has its own "angst-gegners", especially when the performance is small, say -/+5-10 elo. You can avoid that playing against your own engine because what you test is 100% about out-searching your previous version.

I agree you won't get an exact elo performance but what you do know is the fact that the search-change is an improvement or not. After that you can always run a verification test round against a wide range of other engines.

Ed
What do you do if the change you make causes your program to search more deeply along some lines that highlight something about your old program (a fault) and therefore performs better. But when you try it against a gauntlet, it looks worse. This is similar to developing a new drug to treat some specific bacterial infection, and then put the same bacteria into two test tubs, add your new drug to one, old drug to the other, and then using that to decide if the new drug should replace the old. Of course, it might miss the key fact that the new drug has some extremely bad side effects when actually introduced into the human physiology, where the old one did not. "inbred testing" simply doesn't give me any confidence in the result as it relates to real OTB play.

Sentinel
Posts: 122
Joined: Thu Jun 10, 2010 12:49 am
Real Name: Milos Stanisavljevic

Re: Stockfish question

Post by Sentinel » Mon Jul 12, 2010 7:05 pm

hyatt wrote:You miss the point. I have already given examples in the past where in A vs A', a change shows up as +5, while in A vs gauntlet, and A' vs gauntlet, A' is 5 elo _worse_. I don't like "unknown" because those can be bad _or_ good. I want only good...
If it's a search (not evaluation) change and you get this opposite direction result, I would say your gauntlet is wrong for a particular change and not the other way around. So the change is positive in absolute sense, but your gauntlet is bad/nonrepresentative and makes the change negative.

hyatt
Posts: 1242
Joined: Thu Jun 10, 2010 2:13 am
Real Name: Bob Hyatt (Robert M. Hyatt)
Location: University of Alabama at Birmingham
Contact:

Re: Stockfish question

Post by hyatt » Mon Jul 12, 2010 7:16 pm

Sentinel wrote:
hyatt wrote:You miss the point. I have already given examples in the past where in A vs A', a change shows up as +5, while in A vs gauntlet, and A' vs gauntlet, A' is 5 elo _worse_. I don't like "unknown" because those can be bad _or_ good. I want only good...
If it's a search (not evaluation) change and you get this opposite direction result, I would say your gauntlet is wrong for a particular change and not the other way around. So the change is positive in absolute sense, but your gauntlet is bad/nonrepresentative and makes the change negative.

Suppose A doesn't extend passed pawn pushes at all, and you add code in A' so that it does. And now A' wins a few more games against A to suggest the change is better. Is it really? What if you play against others and this new extension is really just a time-waster against them and you actually end up playing worse? If this idea would work, it would be just as safe to play A and A' against _one_ common opponent and keep/reject the change based on that test. So why don't we do this? One opponent simply doesn't give an accurate representation of what has changed, IMHO. I make lots of changes to Crafty that slightly improves the results against one opponent in the gauntlet, and hurts against others, for an overall negative effect...

Sentinel
Posts: 122
Joined: Thu Jun 10, 2010 12:49 am
Real Name: Milos Stanisavljevic

Re: Stockfish question

Post by Sentinel » Mon Jul 12, 2010 7:29 pm

hyatt wrote:
Sentinel wrote:If it's a search (not evaluation) change and you get this opposite direction result, I would say your gauntlet is wrong for a particular change and not the other way around. So the change is positive in absolute sense, but your gauntlet is bad/nonrepresentative and makes the change negative.
Suppose A doesn't extend passed pawn pushes at all, and you add code in A' so that it does. And now A' wins a few more games against A to suggest the change is better. Is it really? What if you play against others and this new extension is really just a time-waster against them and you actually end up playing worse? If this idea would work, it would be just as safe to play A and A' against _one_ common opponent and keep/reject the change based on that test. So why don't we do this? One opponent simply doesn't give an accurate representation of what has changed, IMHO. I make lots of changes to Crafty that slightly improves the results against one opponent in the gauntlet, and hurts against others, for an overall negative effect...
What I'm saying is that it is possible that in your gauntlet the net change is negative since for that particular change majority of gauntlet engines (by coincidence simply) have a built-in anti-game that your original engine doesn't have. However, does majority of all other existing engines really have built-in particular anti-game for a given change making the change really bad in absolute sense? You can never know.
The only way to prove it's not a nonrepresentative gauntlet is to run the test again with completely different gauntlet.

hyatt
Posts: 1242
Joined: Thu Jun 10, 2010 2:13 am
Real Name: Bob Hyatt (Robert M. Hyatt)
Location: University of Alabama at Birmingham
Contact:

Re: Stockfish question

Post by hyatt » Mon Jul 12, 2010 10:44 pm

Sentinel wrote:
hyatt wrote:
Sentinel wrote:If it's a search (not evaluation) change and you get this opposite direction result, I would say your gauntlet is wrong for a particular change and not the other way around. So the change is positive in absolute sense, but your gauntlet is bad/nonrepresentative and makes the change negative.
Suppose A doesn't extend passed pawn pushes at all, and you add code in A' so that it does. And now A' wins a few more games against A to suggest the change is better. Is it really? What if you play against others and this new extension is really just a time-waster against them and you actually end up playing worse? If this idea would work, it would be just as safe to play A and A' against _one_ common opponent and keep/reject the change based on that test. So why don't we do this? One opponent simply doesn't give an accurate representation of what has changed, IMHO. I make lots of changes to Crafty that slightly improves the results against one opponent in the gauntlet, and hurts against others, for an overall negative effect...
What I'm saying is that it is possible that in your gauntlet the net change is negative since for that particular change majority of gauntlet engines (by coincidence simply) have a built-in anti-game that your original engine doesn't have. However, does majority of all other existing engines really have built-in particular anti-game for a given change making the change really bad in absolute sense? You can never know.
The only way to prove it's not a nonrepresentative gauntlet is to run the test again with completely different gauntlet.

That is always possible. But it is much less likely to happen with 5 opponents than with just 1. Every time I test A vs A', the change is exaggerated to the results when I test A and A' against a gauntlet. And occasionally, the two tests are contradictory. A vs A' says keep A', while gauntlet says keep A, or vice-versa...

mcostalba
Posts: 91
Joined: Thu Jun 10, 2010 11:45 pm
Real Name: Marco Costalba

Re: Stockfish question

Post by mcostalba » Tue Jul 13, 2010 7:02 am

hyatt wrote:Every time I test A vs A', the change is exaggerated to the results when I test A and A' against a gauntlet. And occasionally, the two tests are contradictory. A vs A' says keep A', while gauntlet says keep A, or vice-versa...
I think you agree that the first thing is defenitly a "feature" for a developer (that is not a public rating list tester), the second is a rare negative side-effect.

To mitigate the latter we discard results of A vs A' that are below a certain thresold, say 5-10 ELO. You may have observed that when the latter happens it almost always happens when difference between A vs A' is very small anyhow. So we only have to be worried of the rare case among rare cases that a big A vs A' difference that shows with opposite sign when playing a gauntlet, but this almost therorical exceptions cannot invalidate a good and effcient method in 99.9% of cases.

As used to say a teacher of mine: "Optimum is against good". He meant that "good" is useful, "optimum" could be much less useful ;-)

hyatt
Posts: 1242
Joined: Thu Jun 10, 2010 2:13 am
Real Name: Bob Hyatt (Robert M. Hyatt)
Location: University of Alabama at Birmingham
Contact:

Re: Stockfish question

Post by hyatt » Tue Jul 13, 2010 8:14 pm

mcostalba wrote:
hyatt wrote:Every time I test A vs A', the change is exaggerated to the results when I test A and A' against a gauntlet. And occasionally, the two tests are contradictory. A vs A' says keep A', while gauntlet says keep A, or vice-versa...
I think you agree that the first thing is defenitly a "feature" for a developer (that is not a public rating list tester), the second is a rare negative side-effect.

To mitigate the latter we discard results of A vs A' that are below a certain thresold, say 5-10 ELO. You may have observed that when the latter happens it almost always happens when difference between A vs A' is very small anyhow. So we only have to be worried of the rare case among rare cases that a big A vs A' difference that shows with opposite sign when playing a gauntlet, but this almost therorical exceptions cannot invalidate a good and effcient method in 99.9% of cases.

As used to say a teacher of mine: "Optimum is against good". He meant that "good" is useful, "optimum" could be much less useful ;-)
I only have one A vs A' test in the past 6 months. The result was +30 elo, +/- 5 (I did not run 30,000 games, because I do not have 15,000 different positions. When I ran against the gauntlet, the result was -3 elo, +/- 3. A second run was -5 +/-3. I concluded that the +30 was nonsense, and in fact the idea seemed to be a very slight loss overall. The change was in LMR reduction depth. I currently do 2 almost everywhere, except for the first "random move chosen" where I use 1, unless I had a TT move, good capture, or killer, in which case the first random move is a +2 reduction as well. I played with some 3 ply reductions, and while the cluster was down, I ran the test on my laptop where all I could do was use xboard to play Crafty vs Crafty'. The +3 looked pretty good there. But when we got the A/C back up and I ran it on the cluster, it showed up as slightly worse. And that was for a search change, not an eval change, notice. How often does that happen? Not sure. Early in the development of our cluster testing methodology we ran a significant number of A vs A' tests, both with search changes and eval changes, and when we compared them to the gauntlet approach (we were trying to discover the fastest way to accept/reject a change, and we tried a _lot_ of different approaches) the results were contradictory often enough to raise a red flag. And absolutely nothing will ever convince me that when the gauntlet and A vs A' disagree, that A vs A' is the _right_ answer. I am always more interested in how does this change, whatever it is, do against _different_ searches and _different_ evaluations, to give me some confidence that it is better against a general opponent rather than just a doppelganger with one change different between them.

We spent a _ton_ of time trying to eliminate as much odd behaviour as possible, which included no opening books at all, no endgame tables since different programs are affected in different ways by them, no parallel search (except for a couple of occasions where I would turn it on in both Crafty and an opponent that supported it to see how the parallel search efficiency measured up), no pondering to avoid time pertubations that add randomness, etc. We then tried to figure out the best way to spend the computational time we have available to get the most bang for our "buck". Bottom line is that there is no substitute for large numbers of games against different opponents. This has enough holes since we can't use the commercial programs on our cluster (Have to compile due to library issues). We didn't want to add even more holes by just using one opponent, particularly ourselves...

Clearly anyone can test however they want. I believe our testing gives results that are more accurate than any other approach being used, unless someone uses something very similar. In drag racing, we use the "there is no replacement for displacement" saying all the time (bigger motor = more torque = faster ET). In testing, there is no replacement for large numbers of games against a set of different opponents and positions.

Post Reply