Multi-threaded memory access

ThinkingALot · Post by **ThinkingALot** » Sun Feb 10, 2013 9:09 am

I encountered the following problem when testing Gull in SMP mode. If I set the hash table size to 1Gb (64Mb by default), NPS drops depending on the number of processes.
1 process: the slowdown of the master process (which never idles) is about 2.2%.
2 processes: the slowdown of the master process is about 4.8%.
4 processes: the slowdown of the master process is about 7%.
2600K, HT off, DDR3 is in dual channel mode, the engine doesn't use large pages.
Profiling showed that the amount of time each process spends on the prefetch instruction in do_move() increases significantly with 1Gb hash table size.
So here are two questions:
1) Why does the memory access time depend on the amount of memory allocated?
3) Why does the slowdown caused by a large amount of memory depend on the number of processes?

Jeremy Bernstein · Post by **Jeremy Bernstein** » Sun Feb 10, 2013 1:45 pm

How are you managing your multi-threaded memory access, in terms of read/write locking? Semaphores? Mutexes? Nothing?

Bob Hyatt probably has some tips about maximizing hash table performance and avoiding processor cache misses, but you might want to do some general reading on the subject as well.

For instance: http://stackoverflow.com/questions/9265 ... erformance

jb

ThinkingALot · Post by **ThinkingALot** » Sun Feb 10, 2013 2:37 pm

Jeremy Bernstein wrote:How are you managing your multi-threaded memory access, in terms of read/write locking? Semaphores? Mutexes? Nothing?

Semaphores, but with locking turned off completely the slowdown is the same.

Richard Vida · Post by **Richard Vida** » Sun Feb 10, 2013 2:39 pm

ThinkingALot wrote: 1) Why does the memory access time depend on the amount of memory allocated?

TLB misses. Resolving a virtual address to a physical one can take up to 4 additional memory reads. Using large pages really helps a lot with larger hash sizes.

http://en.wikipedia.org/wiki/Translatio ... ide_buffer

ThinkingALot wrote: 3) Why does the slowdown caused by a large amount of memory depend on the number of processes?

Having more processes implies faster search = more frequent hash table access and consequently more L2 TLB misses. Besides that, each process has its own private page table, which again increases pressure on L2 TLB (this is one of the reasons why threads are more efficient than processes).

Richard Vida · Post by **Richard Vida** » Sun Feb 10, 2013 2:50 pm

Richard Vida wrote:
ThinkingALot wrote: 1) Why does the memory access time depend on the amount of memory allocated?
TLB misses. Resolving a virtual address to a physical one can take up to 4 additional memory reads. Using large pages really helps a lot with larger hash sizes.

http://en.wikipedia.org/wiki/Translatio ... ide_buffer

ThinkingALot wrote: 3) Why does the slowdown caused by a large amount of memory depend on the number of processes?
Having more processes implies faster search = more frequent hash table access and consequently more L2 TLB misses. Besides that, each process has its own private page table, which again increases pressure on L2 TLB (this is one of the reasons why threads are more efficient than processes).

If you are using SSE prefetch, make sure you have enough useful work to do between the prefetch instruction and the memory access. Often the right timing of the prefetch is very important. Do it too soon and the prefetched data could be evicted from the cache before it is accessed. Do it too late and you don't have enough useful work to hide the latency.

ThinkingALot · Post by **ThinkingALot** » Sun Feb 10, 2013 3:09 pm

Richard, thank you for the explanation.

hyatt · Post by **hyatt** » Sun Feb 10, 2013 11:06 pm

ThinkingALot wrote:I encountered the following problem when testing Gull in SMP mode. If I set the hash table size to 1Gb (64Mb by default), NPS drops depending on the number of processes.
1 process: the slowdown of the master process (which never idles) is about 2.2%.
2 processes: the slowdown of the master process is about 4.8%.
4 processes: the slowdown of the master process is about 7%.
2600K, HT off, DDR3 is in dual channel mode, the engine doesn't use large pages.
Profiling showed that the amount of time each process spends on the prefetch instruction in do_move() increases significantly with 1Gb hash table size.
So here are two questions:
1) Why does the memory access time depend on the amount of memory allocated?
3) Why does the slowdown caused by a large amount of memory depend on the number of processes?

first, the TLB is finite in size. 1gb of ram is, basically, 2^30 bytes, where a page has 2^12 bytes. Or 2^18 pages. No machine has that big a TLB. This means that HT accesses have to "walk the map tables" and can have up to 4 memory reads just to translate the virtual to real addresses. Big pages reduce that as you can then use 2M/4M page sizes, which means fewer pages and less TLB thrashing.

With multiple threads, you now have more memory accesses, and the CPUs can begin to get in each other's way. It can also be an issue with just regular memory addresses, as just accessing a large TT will tend to give cache a workout, and more threads mean more memory bandwidth required just to access the data, ignoring the TLB issue.

ThinkingALot · Post by **ThinkingALot** » Mon Feb 11, 2013 10:39 am

According to http://software.intel.com/sites/product ... _guide.pdf, page 9, Intel Nehalem has a 512 entry TLB and a 32 entry DTLB for large pages (2Mb/4Mb) per core. One can conclude that
- it doesn't matter whether threads or processes are used since each thread running on a separate core has its own TLB
- the maximum amount of memory which can be accessed without TLB misses is 2Mb (or 64Mb/128Mb if large pages are used).
In case of a 64Mb+ hash table the requested page address should be almost always absent from the TLB. So the difference between 64Mb & 1Gb table is only in the length of the page walk?

Richard Vida · Post by **Richard Vida** » Mon Feb 11, 2013 3:08 pm

ThinkingALot wrote:According to http://software.intel.com/sites/product ... _guide.pdf, page 9, Intel Nehalem has a 512 entry TLB and a 32 entry DTLB for large pages (2Mb/4Mb) per core. One can conclude that
- it doesn't matter whether threads or processes are used since each thread running on a separate core has its own TLB

Nehalem has a 2 level TLB hierarchy. Each core has its own L1 Instruction-TLB (128 entry) and L1 Data-TLB (64 entry) plus an additional 32 entry Data-TLB for large pages.

The L2 TLB (512 entry) is unified (serves both data and instructions), can handle only small pages, and is shared between cores.

ThinkingALot wrote: - the maximum amount of memory which can be accessed without TLB misses is 2Mb (or 64Mb/128Mb if large pages are used).
In case of a 64Mb+ hash table the requested page address should be almost always absent from the TLB. So the difference between 64Mb & 1Gb table is only in the length of the page walk?

Page walks are normal memory reads and as such they go through the usual cache hierarchy (L1D/L2/L3 - although I'm not sure about L1D). So the latency of the page walk depends on what is already in the caches.

With large pages the situation is actually a lot better. Even in case of permanent TLB misses (which we are trying to hide with prefetch anyway) the reloads do not displace other useful (small page) entries because of separate small-page/large-page sections of TLB.

OpenChess

OpenChess

Multi-threaded memory access

Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access

Re: Multi-threaded memory access