Multiprocessing Benchmarks

The goal of this page is not to demonstrate the raw speed of the simulator, but rather to demonstrate the speed-up that can be obtained by using multiple cores, processors, GPU's in different ways. You can run these simulations on your own machine (they are based off the examples that ship with the software) and see how your hardware compares to the machines we've tested.

#	Simulation	Hardware	PSS/HSS configuration	PSS/HSS license requirement	Effective Cycle Time (s/cycle)^*	Comment
A1	Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, non-complex	Box #1: 2x Opteron 285, 16GB DDR400	1x 1-threaded-PSS	1/0	187	Single-core (i.e. no SimRunner)
A2	"	"	1x 4-threaded-PSS	4/0	95	4 cores give 2X speedup with multi-threading (4 cores working on one job)
A3	"	"	2x 1-threaded-PSS	2/0	104	2 cores give almost 2X speedup with job-distribution (2 cores working on two jobs independently). This is always more efficient than multi-threading, but requires more memory.
A4	"	"	2x 2-threaded-PSS	4/0	61	combination of multi-threading and job distribution seems optimal - 4 cores giving 3X speedup - requires memory for two simulations. Seems reasonable on AMD dual-core architecture where each processor (pair of cores) has it's own memory controller and "close" memory.
A5	"	"	1x SuperPSS -{4x 1-threaded PSS}	4/0	64	almost 3X speedup with 4 cores, but uses less memory than #A4. Much faster than #A2.
A6	"	Box #1: 2x Opteron 280, 16GB DDR 400 Box #2, 2x Opteron 270, 16GB DDR 400	1x SuperPSS -{8x 1-threaded PSS}	8/0	60	Not much faster than #A4 or #A5. Uses less memory per machine than #A4.
A7	"	"	2x SuperPSS{2x 2-threaded-PSS}	8/0	49	The Opteron 270 machine is slower. If both machines were opteron 285's than we would expect double the performance of #A4.
A8	"	Box #1: 2x Tesla C870	1x 2-GPU-HSS	0/2	18	Simulation fits entirely within two cards.
A9	"	Box #1: 1x Tesla C870	1x 1-GPU_HSS	0/1	29	More than 2X faster than #A4. (1 HSS license vs. 4 PSS licenses)
A10	"	Box #1: 2x Tesla C870 Box #2: 2x Tesla C870	2x 2-GPU-HSS	0/4	9	Double the performance of #A8 (running two cases at once)
A11	"	"	1x SuperPSS{2x 2-GPU-HSS}	0/4	43	Bad performance because of communication overhead for SuperPSS.
B1	AltPSM_Contacts with pitch=2.2 (9.1GB)	Box #1: 2x Opteron 285, 16GB DDR366, 2x C870	1x 1-GPU_HSS	0/1	162	The DDR 400 memory was slowed to 366MHz.
B2	“	Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870	1x 1-GPU_HSS	0/1	135	This machine has faster memory compared with B1.
B3	“	“	2x SuperPSS{4x 1-threaded-PSS}	8/0	288	Using all 8 cores is slower than 1x Tesla C870 on the same machine. (see B2)
B4	“	Box #1: 2x Opteron 285, 16GB DDR366, 2x C870	1x 2-GPU_HSS	0/2	101	Using 2 C870's compared to 1 C870 gives 101s to 162s. So, don't get a 2X speedup (as expected) – but do get a decent speed-up (162/101=1.6X speedup)
B5	“	Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870	1x 8-threaded PSS	8/0	326	See B3.
B6	“	“	2x SuperPSS{2x 2-threaded-PSS}	8/0	314	See B5 & B3.
B7	“	“	1x SuperPSS{1x 8-threaded-PSS, 1x 1-GPU_HSS}	8/1	287	Better to just use HSS alone. The PSS's can't help it – just slow it down. See B2.
C1	AltPSM_Contacts, pitch=0.3 (169MB)	Box #2: 2xIntel 5440 32GB, DDR2 667, 1X 8800 GT-OC	1x 1-GPU_HSS	0/1	1.57	This is just a graphics card (8800 GT-OC) with 512MB GDDR3 memory. The card was driving video during the simulation (maybe a bit faster without video)
C2	“	Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870	“	0/1	1.29	Compare to C1. The TESLA C870 beats the less expensive 8800 GT-OC even for small simulation that fits entirely with the card's memory.
C3	“	Box #2: 2xIntel 5440 32GB, DDR2 667	1x 1-threaded-PSS	1/0	9.90	Tesla C870 is 7.67X faster than single core of Intel 5440. 8800 GT-OC is only 6.3X faster
C4	“	Box #1: 2x Opteron 285, 16GB DDR366, 2x C870	“	1/0	9.90	Older Opteron 285 same speed as newer Intel 5440!?
C5	“	“	1x 1-GPU_HSS	0/1	1.35	Tesla C870 on Opteron 285 with 366MHz DDR is slower than Tesla C870 on Intel 5440 with DDR2 667MHz. (expected)
D1	AltPSM_Contacts, pitch=0.8 (1.2GB)	Box #1: 2x Opteron 285, 16GB DDR366, 2x C870	1x 1-GPU_HSS	0/1	9.9	Simulation fits entirely within the Tesla C870's 1.5GB memory.
D2	“	“	1x 1-threaded-PSS	1/0	143	Compare to D1. Here the Tesla C870 is 14X faster than the Opteron 285 Processor. This is the “sweet spot” for the C870 because the simulation is large, but still fits inside the card.
D3	“	Box #2: 2xIntel 5440 32GB, DDR2 667	“	1/0	83	Here we see the newer Intel 5440/DDR2 667MHz beating the older Opteron 285/DDR 366MHz (expected)
D4	“	Box #2: 2xIntel 5440 32GB, DDR2 667, 1x C870	1x 1-GPU_HSS	0/1	8.7	Here we we 9.5X speed-up when compared to late model Intel 5440 processor. Note, this cycle time is faster than C870 on the older Opteron machine (D1). So, host system does matter.
E1	AltPSM_Contacts, pitch=0.3 (169MB)	Box #1: 2x Opteron 285, 16GB DDR366, 2x C870	1x 1-GPU_HSS	0/1	1.35	compare with E1a
E1a	"	" (but with 2XC1060)	"	"	0.67	compare with E1 - the C1060 has 2X the processing power as the C870
E2	"	" (but with 2X C870)	1x 2-GPU_HSS	0/2	1.42	as expected no improving when using more cards on a small simulation that fits within one card (compare to E1)
E2a	"	" (but with 2X C1060)	"	"	.65	basically same as E1a
E3	"	" (but with 2X C870)	2x 1-GPU_HSS	0/2	.68	running two simulations at the same time - compare to E1
E3a	"	" (but with 2X C1060)	"	"	.36	" - compare to E1a
E4	Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, non-complex	" (but with 2X C870)	2x 1-GPU_HSS	0/2	17.5	Compare with E4a
E4a	"	" (but with 2X C1060)	"	"	8.25	C1060 more than 2X faster than C870 - compare with E4
E5	"	" (but with 2X C870)	1x 2-GPU_HSS	0/2	18.4	Compare with E5a
E5a	"	" (but with 2X C1060)	"	"	14.8	Not so great improvement of C870 is expected because 2nd card is not utilized at all as simulation fits within the first card. In E5, both C870's are running at same time, in here (E5a) only one card is running while the other sits idle.
E6	Elbow.sim, with 6 degree incidence (complex simulation) and pitch=76nm, 10GB, 3D EUV with Fourier Boundary Condition	" (but with 2X C870)	1x 2-GPU_HSS	0/2	92	domain divided into 7 parts - the first 6 parts run in simultaneous pairs, and the 7th part runs on one card while the other remains idle - card utilization is 7/8=87.5% (excluding CPU memory xfer overhead)
E6a	"	" (but with 2X C1060)	"	"	67	domain divided into 3 parts - the first 2 parts run simultaneously, and the 3rd parts runs on one card while the other remains idle - card utilization is 3/4=75% (excluding CPU memory xfer overhead) The reason there is not 2X speedup over E6 is because GPU utilization is lower, and CPU xfer overhead might be large - especially since box has DDR 336 (not even DDR2) and only PCI x16 generation 1 (not generation 2.0). Probably with PCI Express x16 (gen 2) and DDR2 - 800, improvement will be closer to 2X.

^*Note: "Effective" cycle time is the total cycle time divided by the number of cases running. For example, if you have 5 PSS's running 5 different simulations (of the same size) and each has a cycle time of 10s, then the effective cycle time would be 10s/5=2s. A "cycle" is amount of time TEMPESTpr2 takes to propagate the fields one wavelength.