Large tablebases

BB+ · Post by **BB+** » Thu Apr 21, 2011 7:35 am

It seems that 7-piece tablebases are being discussed again. I forget whether I mentioned it specifically before, but it seems to me that these can be built in about 1-2 years on a 32/48-core machine with 256GB of RAM (and still a non-negligible portion of the time will be disk I/O). With a bit more work, perhaps 128GB of RAM would suffice. With 512GB/1TB of RAM, the method could be a good factor (2 or 3?) faster (using out-counting rather than the grandfather method). See this post for more about methods to build them. These are not insane RAM amounts from everyone's standpoint; the head of my research group was recently impressed by how many more problems he could solve with 256GB of RAM rather than some lesser amount, and was getting time on a 1TB machine. Extrapolating from 6-piece data, a typical build should take about 8-12 hours, with pawn-full ones being more time-consuming as you need to load info about promotions (special bitbases [pieces on the 8th rank, for instance] could be made for this) -- but there is no implementation of this that I know of.

Building them all via a distribution network is another possibility, where each user spends a few days (or weeks, depending on how optimised the code is) to build a given 7-piece TB, with streaming I/O from (each) hard drive at ~100MB/s being the limitation for most of the computation. Making them Internet-available is again a "solved" problem for some high-end users, as at work I can get nearly 8MB/s in up/download speed. Users building pawnfull endgames would need a suitable set of the pawnless ones to handle promotions, though DTC and bitbases should ameliorate the large disk space needed for this. Another worry here is in the debugging phase of any software development, as it is really annoying to track bugs that take multiple days to appear. Given the size of the project, even some OS bugs might be uncovered (such as dealing with large files) -- it is also difficult to debug remotely on someone else's computer, though if the code itself were "peer-reviewed" perhaps any flaws would more readily come to light. This method has been feasible in practise for (at least) the last 5 years, but few seem sufficiently motivated to attack such a project with the dedication needed to see it through (the most known exception being the currently-private work of Konoval and Bourzutschky). With the increases in readily available HD space over the last few years, perhaps more will become interested, but it is still a formidable task.

The DTC format should be about 50-75TB for the whole set (thus about 3-4 months to download at 8MB/s, less than the time for the rest of the project), and the storage therein is (if nothing else) a reasonable cost compared to the rest of the project, though setting up an always-available RAID for them would be quite a hassle (as opposed to having 30-odd 2TB USB drives sitting on your bookshelf, to be attached when necessary). Bitbases should reduce this by a factor of around 10, so maybe you could have four 2TB drives containing the whole lot. Another option would first be to make "extended 6-piece" TBs; for instance, the "RobboBases" have "BlockedPawn" objects, but this could be generalised to other formations where opposing pawns were on the same file. Another useful tool might be a generic builder program which worked with specific files for pawns (the "RobboBases" already do this it seems, but only for the first pawn), though some conditions about promotions would be needed. For instance, something like KR+fgh vs KR+gh has 9 units, but [perhaps making assumptions about captures] there are only 15 possibilities with facing pawns on each of the gh-files, which reduces the problem by a good factor. Some of the functionality here exists in Freezer, I suppose.

So the "constants" are: total disk space around 50-75TB (with [one-sided?] DTC, and maybe DTZ for pawnfull), and bitbases in maybe 8TB. Internet connectivity at minimally something like 1MB/s (and preferably 10x that) is necessary at some stage in any event, and is crucial for a number of users to possess this if building via a distribution network.

A comparison between one-person RAM builds versus many-person disk builds:
*) Hardware cost: The one-person (RAM) method requires a 32/48-core machine with 256GB of RAM, which costs about $10-15K, and perhaps half that again for electricity. The many-person (distributed) method exploits already-sunk costs, and with electricity tries to split it up enough that no one is over-burdened (of course, the RAM system could get "donations" or whatever, and work the same way). There would be a cost of around $3-5K for hard drive storage of all the data for everyone who wants all the data in any event. For bitbases it would be closer to $500. Simply selling a set of four 2TB external HDs with the bitbases pre-installed might be a viable option -- though formatting/writing to each 2TB disk is already nontrivial (taking 6 hours at 100MB/s for each [with parallelisation possible, I guess], though this speed is at least currently an overestimate for a USB drive), and the "secretarial costs" of this might add up over time.
*) Timeline: The one-person RAM build should take about 50 cpu-years, so 1-2 years on the specified machine. It is much harder to guess about the distributed method, as it is not even clear when it is exactly "finished" -- when one person has all the files, or when all the tablebases have been generated by someone, or what? Again extrapolating from 6-piece numbers, my guess is that it would take around 10-20 harddrive-years to generate everything, given suitable code (Nalimov would be much slower than this). With a gung-ho group of 5-10 people with good Internet connections and optimal central planning, this is again around 1-2 years.
*) Software development: This is by far the most crucial step in any of this, as having a lousy format for the tablebases (or dodgy code) could easily make the whole thing a no-go. So whatever path any project takes, it will be ultimately be directed by whomever is willing to put in the effort herein. I don't want to underestimate the amount of effort necessary to ensure sound working of either system.

BB+ · Post by **BB+** » Sun May 08, 2011 11:43 pm

One can also note that the "out-counting" method is amenable to clusters. That is, one need not have 768GB RAM or some large amount on one machine, but can distribute it over many. For instance, 24 12-core machines (say) with 32GB each should also work. I'm not sure how fast the communication between them needs to be. There might also be some amount of extra RAM overhead needed to process some of the lists.

The idea here is that at every stage of "retrograde generation", one need merely make a list of what things to process at the next step, and these lists can be passed to the cpu that has the relevant memory locations. I don't think this works as easily in the "grandfather" method, as there you need to do a lot of reads (which need not be in the same memory block, though some capacity for such locality should be possible) so as to determine whether a position is lost. With "out-counting", when the count hits 0, the position is lost (and this "count" resides in the RAM of just one of the cpus, so it is "localised" in some sense).

Here would be how the out-counting scheme could work, for simplicity I speak of a 7-piece pawnless DTC setup with 462 1GB slices (based on king position).

First the bootstrapping:
*) For each wtm position, determine whether it is "won in 1", that is, whether there is a conversion that wins, and put all of these in a "won in 1" list.
*) For each btm position, determine first whether Black has a conversion that wins/draws. If not, then determine the number of "outs" that Black has, that is, the number of legal non-conversion moves. If this number is 0, and Black is in check, then note that Black is "lost in 0", and from this retro-generate all possible wtm positions, and via the king-indexing figure out which cpu/RAM-block should have this position appended to its "won in 1" list (there can be some dispute as to the meaning of won/lost in 0 or 1 with DTC, but that is mostly irrelevant).
Then the loop:
*) For each position in the "won in X" list, see if it was already known to be won, and if not then retro-generate all possible btm positions, and via the king-indexing figure out which cpu should have this position appended to its "subtract an out" list.
*) For each position in the "subtract an out" list, subtract an out and if the number of outs is now zero, then note the position is "lost in X", and retro-generate all possible wtm positions, with king-indexing determining which cpu should have this position appended to its "won in X+1" list.
Terminate the loop when either of the lists is empty on all cpus.

As to the above analysis, this still isn't cheap (though such a cluster might find other uses prior to its depreciation date arriving), it is not really "distributable" as I think the inter-cpu communication at the first stages is quite non-negligible, and would require a lot of software coding/testing. And there's still the problem of what to do with all the data at the end, etc. But for the enterprising entrepreneur, it might be a better bet than having just one machine with a (very) large amount of RAM attached. In any event, my own opinion is that in 2 or 3 years the 7-piece TBs will be sufficiently amenable so that there will be real movement aside from the trailblazers, and in 4 or 5 years it will likely be but a summer's project to construct them.

marcelk · Post by **marcelk** » Sun Jun 05, 2011 10:16 am

BB+ wrote:One can also note that the "out-counting" method is amenable to clusters. That is, one need not have 768GB RAM or some large amount on one machine, but can distribute it over many. For instance, 24 12-core machines (say) with 32GB each should also work. I'm not sure how fast the communication between them needs to be. There might also be some amount of extra RAM overhead needed to process some of the lists.

The idea here is that at every stage of "retrograde generation", one need merely make a list of what things to process at the next step, and these lists can be passed to the cpu that has the relevant memory locations.\

First: It might not be necessary to keep the out-count table in memory at all (nor any closed-bit tables). The idea is as follows:

Lets say you store the out-counts in a file, the 'out file'. Then you have to make sure that you pass over this file in sequential passes. You can do that by postponing all accesses to it by writing the indexes that you want to hit into a buffer file on disk. Set a maximum size of that buffer file based on how much memory you have. If the buffer file grows too large, close it and start a new one. At the end, read each buffer file in memory, one at a time. Sort it in memory and make a sequential pass over the out-file. Decrement the outs and write out indexes of those positions reaching 0, to be used for the next pass. Then go on with the next buffer file, if any.

Now the out file is always accessed sequentially. The algorithm gracefully adapts to the memory size by adding passes. The more memory available, the fewer passes. Memory is only used for sorting batches of work.

I think it is most easily be done all on one CPU. If you have a cluster, just let them work on different endgames. That keeps the I/O localized. There is only sharing of completed (and compressed) EGTs.

Second: Key in such project is verification. It would be best if a verification program was written independently from the generator, just based on the index specification and file format. To ease verification it would be helpful the keep track of the/a best move in every position, because then you can also convert random accesses into sequential passes.

BB+ · Post by **BB+** » Sun Jun 05, 2011 11:05 pm

marcelk wrote:Lets say you store the out-counts in a file, the 'out file'. Then you have to make sure that you pass over this file in sequential passes.

My contention is that "sequential passes" from a hard drive are already rather slow. For instance, I think GCP's work with Nalimov made it essentially streaming I/O dependent for most of the work, and the rate of that is ~100MB/s. I claim that you aren't going to do any better than this once you turn to hard drive access.

In particular, at the "bulk" level (say at the win-in-250 iteration in the nefarious QB vs RBN) of a 7-piece computation, you really do have quite a lot of approximately "random" (non-localised) data changes to the 462 gigapieces of data at each iteration. So I fail to see how, for each iteration you are using less time than something like ~4620 seconds of read/write time, or around an hour. You can buffer all you want, but eventually you are going to have to flush it. This might even be a low estimate, as there is both reading and writing at each iteration, and you also talk about multiple passes for sorting.

Not every iteration has such a multitude of essentially dense data modifications, but sufficiently many do that you are in the range of a day or more (or even a week in the worst cases) for a typical pawnless 7-piece with no repeated pieces. This is in line with what Bourzutschky has reported regarding the time for Konoval's code (hardware has improved a bit in the last 5 years, but streaming I/O is still slow). In contrast, the "Volodya" user on Rybkaforum reported doing QB vs RBN in 12 hours (presumably with a RAM-based build), and I find it believable. This is maybe around 5-10x times faster in general (with all the usual caveats about comparative implementations), though you are probably correct that simply buying 5-10x as many hard drives and 6-core machines is likely comparable overall to investing in a large amount of RAM. The question of (say) pawnful DTM endgames is another matter, as the disk access increases somewhat.

[Bourzutschky says in 2006 regarding krnnkbbn says: "This ending could be generated with 64 GB of RAM in a few months on a fast single CPU machine and about 5 terabytes of storage. Any takers?" -- I presume he is taking the bishops to be of opposite colours, as else it would be more like twice as much memory by my accounting. If you "just" want the maximal DTC for this and don't want to save the result, you can likely decrease the needed disk space by a decent factor via using the grandfather method and thus bit arrays rather than byte-sized out-counts. The large amount of RAM listed is likely due to localisation tricks, so that, e.g., when looking at black-to-move positions the relevant White retrogrades are also in memory.].

Regarding verification, I think it is better to just have some write completely independent generation code, do the calculation, and compare the results. Given that verification already takes ~30%(?) of the time, and is still prone to errors (particularly with the way draws propagate in out-counting), I would more be likely to trust two truly independent computations. But then, that's my attitude toward science in general.

Incidentally, I was involved (perhaps even the protagonist) with two projects that did indeed do the same computation of multiplying large (256GB) integers on disk (using an FFT), though it wasn't really "completely independent" as I think some of the underlying fast arithmetic packages that were used had some overlap. See the "News" at http://cims.nyu.edu/~harvey/code/mul_huge and the independent slides http://www.ants9.org/slides/hart.pdf [Curiously, both main programmers, David Harvey and Bill Hart, and originally Australian and now working elsewhere, whereas I am not Australian, but live here].

marcelk · Post by **marcelk** » Mon Jun 06, 2011 6:27 pm

BB+ wrote:
marcelk wrote:Lets say you store the out-counts in a file, the 'out file'. Then you have to make sure that you pass over this file in sequential passes.
My contention is that "sequential passes" from a hard drive are already rather slow. For instance, I think GCP's work with Nalimov made it essentially streaming I/O dependent for most of the work, and the rate of that is ~100MB/s. I claim that you aren't going to do any better than this once you turn to hard drive access.

I'm willing to give it more thought and work out the case for example on the QB vs RBN
with DTM. A typical RAID controller with mechanical hard drives today I see doing 500MB/s.
But indeed it needs both in and out, so we can expect to get half of that.

I'm not sure about SSD performance in RAID. But that might be more expensive
than using DRAM in the first place.

The flushing of the buffers should be avoided for at least one of them on every pass.
This is possible if the sorted buffer can shrink fast enough and give up space for
the outgoing buffer.

The throughput can be improved much more by zipping all streams. Partially pre-sorting
the buffer-out stream in large enough chunks will help improve its compression ratio
somewhat.

I was wondering if there is documentation on the indexing schemes that are today
considered a minimal expectation. One can go very far in squeezing the index space,
but I'm not willing to spend excessive time on designing that part. Here is what I
have in mind:

Code: Select all

 *  Tables
 *  ------
 *
 *  The table determines the pieces present (regardless of location) and the Pawns
 *  (including their squares). Positions with en-passant are not stored. En-passant
 *  has to be taken into account during generation (when looking at transitions
 *  into another table), or during probing when using the tables.
 *
 *  A table holds the positions for both sides to move. So the side to move indicator
 *  is part of the index, not part of the table id. For balanced endgame classes it
 *  is absent and then the table will have just half the size.
 *
 *  The table id gives the dominant side first. This is the side with the most
 *  Queens, Rooks, Bishops, Knights or Pawns, in that order. (So "KQKRR", not "KRRKQ",
 *  and "KQNKRBN", not "KRBNKQN").
 *  The pieces are listed in the same order. The Pawns also list their square.
 *  For example:
 *      KBNK        (Bishop and Knight against lone King)
 *      KRPa2KBPa3  (Timman vs. Velimirovic 1979)
 *
 *  In case one side has multiple Pawns, the Pawns are ordered alphabetically by
 *  square name. For example" "Pa3Pb2", not "Pb2Pa3".
 *
 *  The board must be mirrored left-right if that yields a table id that comes
 *  earlier lexicographically. For example: "Pc4" instead of "Pf4", and
 *  "Pa2Pb2Ph2" instead of "Pa2Pg2Ph2".
 *
 *  In case of Pawns when the material is balanced, the board may also have to be
 *  flipped White-Black to obtain its normalized table id. In that case we take the
 *  alphabetically smallest table id, possibly after mirroring left-right again.
 *
 *  There is no special treatment for distinguishing between light vs. dark bishops.
 *  There is no provision for castling. These would make the normalizations and
 *  indexing a lot more complex and prone to bugs.
 *
 *  Indices
 *  -------
 *
 *  The highest level of significance is side to move: the lower half is for first
 *  player to move, the higher half is for the second player to move. If the material
 *  is balanced the second half is absent.
 *
 *  Next are the Kings. Both Kings are combined into one index. No indices for Kings next
 *  to or on top of each other. (Kings or pieces on top of Pawns are not further
 *  optimized out of the scheme).
 *
 *  In Pawnless EGTs there is mirroring to put the first King in the a1-d1-d4 triangle.
 *  If the first King is on the a1-d4 diagonal then mirror the second King into the
 *  a1-h1-h8 triangle as well. There is no additional "deep mirroring" of other pieces
 *  when both Kings are on the a1-h8 diagonal.
 *
 *  When there are Pawns in a symmetrical left-right configuration, mirror the
 *  first King to the left side of the board.
 *
 *  After Kings come the first side's pieces, from Q to N.
 *
 *  Of lowest significance are the second side's pieces, from Q to N.
 *
 *  Multiple pieces of the same type are sorted ascending by square, using again
 *  the order a1, a2, ... h7, h8. The index space is reduced accordingly ("RR"
 *  takes half the normal space of two pieces, "RRR" takes 1/6th, etc.. There
 *  is no index for pieces of the same type on top of each other.)

I'm interested to hear opinions on whether this is reasonable enough
by current standards, before I dive deeper into the project.

BB+ · Post by **BB+** » Mon Jun 06, 2011 10:19 pm

marcelk wrote:The flushing of the buffers should be avoided for at least one of them on every pass.

On re-thinking about this last night [before reading your post this morning], I already realised I forgot to emphasise that every iteration appears to necessitate a read/write across all data elements that are changed (that is, you can't cache out-reductions from one iteration to the next), as to "do the inductive step" of the retrograde build (that is, get the positions to consider at the next iteration), you need to know which positions are now losses (that is, their out-count has reached zero). Perhaps one could get around this slightly by (e.g.) not strictly computing the DTC but only a [finite] upper bound for it, perhaps in conjunction keeping a list of positions with one out (though 2+ outs can disappear at any given iteration), but this seems undesirable, but theoretically and likely for practical purposes.

The throughput can be improved much more by zipping all streams.

Indeed, we found this to be the case in the aforementioned disk-based FFT/multiplication, though we did not find it to be "much" better, and had to use rather simplistic encoding to keep the time from that step from dominating (and at least part of our data was rather structured, which might additionally have helped). For instance, it doesn't seem that easy to get bzip2 to be anywhere near the speed of I/O throughput (and bunzip, while notably faster by 5-10x or so, is still below 100MB/s [even in parallel] if my memory is correct), and I couldn't convince myself that other methods would be much better. The "RobboBases" actually seem to have quite good BWT/bzip style code, though as I pointed out previously, it seems they may have "borrowed" it [or at least the compression code], and as above, it is still a bottleneck compared to I/O.

I was wondering if there is documentation on the indexing schemes that are today considered a minimal expectation. One can go very far in squeezing the index space, but I'm not willing to spend excessive time on designing that part. Here is what I have in mind:

My recollection is that the EGTB forum found that (e.g.) the parsimonious indexing schemes advocated by Haworth/Heinz/Nalimov and not really that useful overall, at least after compression is applied. The "broken" positions simply compress out of the picture. Again there is an expense in computing tricky indices (both in the build process, and then even in the usage phase, unless the data is reconstructed between these phases). It seems that Ballicora has various indexing schemes with the Gaviota system [though he notes that Nalimov does much more, such as getting rid of contact checks], while the "RobboBases" are rather minimal in this regard (only 462 and 1806 for king positions, and some folding for duplicated pieces). I think your proposal looks fine, though I think you can always mirror the first King when there are pawns, and I guess you mean "from Q to P" rather than from "Q to N"? I would prefer to have everything be white-to-move (and thus have both KRRKQ and KQKRR, but at half the size in each), but this is not too important.

BB+ · Post by **BB+** » Mon Jun 06, 2011 10:41 pm

marcelk wrote:I'm willing to give it more thought and work out the case for example on the QB vs RBN with DTM

I might point out that, at least for pawnless positions, you can do DTM with little more effort than DTC, by always having the maximal number of pieces around and indicating captured pieces by (say) making them have the same square as the friendly/opposing king (and interpreting this in the retrograde code in right manner, of course). Then you just solve all the subgames simultaneously with the main game. For instance, KQKR would have a position with Ke1Qd1Ke8Re8 corresponding to the Ke1Qd1Ke8 subgame. This Ke1Qd1Ke8Re8 configuration would have the Re8 ignored in retro-generation, while any White un-move would have to consider un-capturing it (e.g. wQd1-a1, with the bRe8 appearing at bRd1). This doesn't seem to add much more complication, though maybe I should actually try it before saying that.

[It seems that the "RobboBase" code contains some pre-working bits of something like this called "DTR" (why not just DTM?), but I am confused about what its purpose actually is -- at any rate, I think Muller might have spoken about this idea awhile back, and it likely pre-dates him].

EDIT: It seems that the various bits of "RobboBase" #if 0 code doesn't restrict this to pawnless situations, so maybe "DTR" means distance to mate-or-promotion or something. I didn't test whether it is functional (it seems rather incomplete). Also, I realised that their indexing scheme seems to interfere with this when there are duplicated pieces: you can't have (say) two White queens on the same square with their indices, so they can't both be captured...

BB+ · Post by **BB+** » Fri Jun 10, 2011 1:35 am

It seems as though some of the older RobboLito versions have "disk build" code for "RobboBases". There is some attempt to localise the data access. For instance, given any king configuration, there are only 9 possible king configurations after one side unmakes a move. If you order the 462 or 1806 king configurations in some clever order, you only need to load/save each "king slice" (or "chunk", which is what "fetta" means I'd say) once per iteration, maybe holding at most 16(?) in memory at once (the "RobboBase" code seems to allow up to 32, and it's not always exactly clear when a "king slice" will be finished due to SMP effects). [The "RobboBase" code even seemed to allow data to be compressed before written, at least as an option].

This is still a large memory requirement for building a 7-piece tablebase (a "king slice" is 1GB if the out-counts are a byte each, though one also needs various bit-arrays for which positions are won/lost), though one can extend this observation: the side that is not unmaking a move will have all its pieces remain in the same place (up to modifications induced by king-based rotations). Applying this latter fact appears notably more difficult than doing "just" the king-slicing aspect, though possibly the Konoval code does something like this, as Bourzutschky suggested 64GB of RAM for the 8-piece RNN vs BBN, where each "king slice" would be around 8GB or more.

One can reduce the I/O throughput (if you aren't using compression already) via noticing that there are large bands of "broken" positions, e.g. when the "major" indexed piece is on the same square as one of the kings. By the time you have 6 pieces plus 2 kings around, you could reduce the memory requirements by as much as the product of (1-k/64) over k=2..7, which is 36% less than the original estimate.

BB+ · Post by **BB+** » Tue Jun 28, 2011 8:27 am

Here is a 48-core machine I was recently researching - the specs claim that one could reach as much 4x128GB of RAM, but that requires quad-rank 16GB DIMMs, so maybe 4x96GB [involving 48 8GB DIMMs -- which seem much more reasonably priced] is a more suitable limit.

Motherboard, AS-2002-TC ($2699)
http://www.supermicro.com/Aplus/system/ ... C-BTRF.cfm
http://www.acmemicro.com/ShowProduct.aspx?pid=8979
This is actually in a 2U form factor(!), but the total system will draw 1000W or more, so there's no way you could (say) fill a 42U rack with these and hope to cool the lot too easily. It is essentially four H8DCT-F motherboards (each called a node), each of which has two cpu slots (and 12 DIMM slots). As noted below, there is a "half-price" option (with 2 nodes), the 1022TC-TF for $1285.

Eight Opteron 4180 processors ($1645 total)
This 6-core cpu seems to be the best price bet, though if you want something slightly faster than 2.6Ghz, the 4184 is available. http://www.acmemicro.com/ShowProduct.aspx?pid=8195

RAM (~$3500 for 192GB, ~$6400 for 384GB)
There's obviously many choices here, depending on what you want to do with this system. If you are looking to build tablebases in memory (as per the subject of this thread), then likely RAM will be the single most costly component. Looking at the Memory Support Matrix, I think I understand that: each cpu on each node has 2 channels, each of which has a max of 32GB. In any event, one needs quad-rank RAM to reach this limit. Assuming one stays with dual-rank, the limit is the third line from the bottom in the matrix therein, which would be 48 total DIMMs. With a 192GB system, the 4GB DIMMs are priced at $73 apiece, so just over $3500 total. It is actually more cost-effective to get 8GB DIMMs at $133 apiece (this seems to be quite a special, but then I don't follow the market), which works out to about $6400 total for 384GB.

Storage (~$1000 for twelve 2TB drives)
The specs say it has 3x 3.5" SATA hot-swap trays per node, so presumably one could go hog-wild and set up 12 drives, getting 24TB for a bit over $1000 [or under, depending on tomorrow's prices (the lowest 3TB drives I could find were $165, compared to $85 for a 2TB drive), while the recommendation is that only Enterprise should be used, which can be interpreted as one sees fit].

The above 4-node system also has 4 power switches(!) [note that the 1022TC-TF only has one (not two) listed, but who knows], so there might be some way (beyond virtual machines) to have the system act as four separate machines for some purposes (e.g., playing ponder-on games).

For "budget" enthusiasts, there is the A+ Twin Server 1022TC-TF (available from the same distributor -- look under Server/Workstation, not under Motherboards), which essentially halves all the above. Such a system obviously won't be the best performance-to-price ratio for many applications, but (for instance) it might provide a decent test-bed for cluster-ware, without the hassle of multiple machines lying around [obviously the slower LAN speeds of a real cluster can be simulated on the integrated machine, if desired].

In short, a 24-core 32GB system with 8TB of disk space [there is a discrepancy, but I think the 1022TC-TF has 4 drive bays total and not per node] as above is likely to end up about $3500 (depending on what extras you add), and doubling everything would make it around $7500, while tripling the RAM of the latter from 64GB to 192GB adds about another $2500, while sextupling to 384GB would only be an extra $5000. So the 48-core 384GB machine with 24TB of disk space should easily cost less than $15K. [Note that better deals might be found elsewhere; the above is just a rough guide, and hopefully I have understood the specs and compatibilities correctly]. My guess it will pull at least 1000W, possibly 1500W, and maybe even 2000W. If electricity is 15-20c/KWh, then it will cost about $5/day to run the machine.

BB+ · Post by **BB+** » Tue Jun 28, 2011 8:59 am

In the other direction, the AS-2022TG-HTRF ($2920 from Acmemicro) again has 4 nodes, with now 2x8 DIMM slots per node (rather than just 2x6), and after a bit of searching I found the relevant Memory Matrix to help me figure out what it all means. It seems that a machine with 64 8GB DIMMs (each dual-rank) is eminently feasible, and this configuration is capable of 1TB via 64 16GB (quad-rank) DIMMs [which are not cheap -- $558 vs $138 for 8GB, which is likely even outside of a typical academia-sized budget at $35K for 1TB].

This board supports the 8-core and 12-core 61xx series Opterons (rather than the 4xxx), so a 96-core box should presumably be feasible (but be warned that the only "cheap" such Opteron 61xx is the [slowish] 2.0GHz 8-core 6128 at $283, with every other 61xx product being twice as much or more). The 2122-TG-HTRF actually has 24 total drive bays [though Acmemicro seems not to list this product yet], so when fully decked out with 3TB drives [EDIT: now I note that the 24 bays are 2.5", which makes this a bit more dicey], the whole 7-piece lot might just fit! In any event, a 64-core machine (eight 6128s) with 512GB of RAM (64 8GB DIMMs) could be preferable to the previous 48-core 384GB option if memory is the main consideration, though it must be said that the 48 cores at 2.6GHz should likely be just as fast as the 64 cores at 2.0Ghz. [Note that the 61xx has 4 memory channels, and the 41xx only 2, which is why the extra RAM is possible].

OpenChess

OpenChess

Large tablebases

Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases

Re: Large tablebases