NameDateSize

..23-Mar-202045

bench.hH A D23-Mar-20202.9 KiB

bench.shH A D23-Mar-2020635

bench_ctl.cH A D23-Mar-20207.7 KiB

common.cH A D23-Mar-20204.3 KiB

core_single_cpu.cH A D23-Mar-20207.3 KiB

core_single_cpu_lcg.cH A D23-Mar-20206.9 KiB

HakefileH A D23-Mar-2020977

MakefileH A D23-Mar-2020611

RandomAccess.hH A D23-Mar-20203.7 KiB

README.htmlH A D23-Mar-202026 KiB

README.txtH A D23-Mar-202016.2 KiB

single_cpu.cH A D23-Mar-20201.9 KiB

single_cpu_lcg.cH A D23-Mar-20202 KiB

README.html

1<!DOCTYPE html>
2<html>
3<head>
4<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
5<meta name="generator" content="hevea 2.00">
6<style type="text/css">
7.li-itemize{margin:1ex 0ex;}
8.li-enumerate{margin:1ex 0ex;}
9.dd-description{margin:0ex 0ex 1ex 4ex;}
10.dt-description{margin:0ex;}
11.toc{list-style:none;}
12.footnotetext{margin:0ex; padding 0ex;}
13div.footnotetext P{margin:0px; text-indent:1em;}
14.thefootnotes{text-align:left;margin:0ex;}
15.dt-thefootnotes{margin:0em;}
16.dd-thefootnotes{margin:0em 0em 0em 2em;}
17.footnoterule{margin:1em auto 1em 0px;width:50%;}
18.caption{padding-left:2ex; padding-right:2ex; margin-left:auto; margin-right:auto}
19.title{margin:2ex auto;text-align:center}
20.center{text-align:center;margin-left:auto;margin-right:auto;}
21.flushleft{text-align:left;margin-left:0ex;margin-right:auto;}
22.flushright{text-align:right;margin-left:auto;margin-right:0ex;}
23div table{margin-left:inherit;margin-right:inherit;margin-bottom:2px;margin-top:2px}
24td table{margin:auto;}
25table{border-collapse:collapse;}
26td{padding:0;}
27.cellpadding0 tr td{padding:0;}
28.cellpadding1 tr td{padding:1px;}
29pre{text-align:left;margin-left:0ex;margin-right:auto;}
30blockquote{margin-left:4ex;margin-right:4ex;text-align:left;}
31td p{margin:0px;}
32.boxed{border:1px solid black}
33.textboxed{border:1px solid black}
34.vbar{border:none;width:2px;background-color:black;}
35.hbar{border:none;height:2px;width:100%;background-color:black;}
36.hfill{border:none;height:1px;width:200%;background-color:black;}
37.vdisplay{border-collapse:separate;border-spacing:2px;width:auto; empty-cells:show; border:2px solid red;}
38.vdcell{white-space:nowrap;padding:0px border:2px solid green;}
39.display{border-collapse:separate;border-spacing:2px;width:auto; border:none;}
40.dcell{white-space:nowrap;padding:0px border:none;}
41.dcenter{margin:0ex auto;}
42.vdcenter{border:solid #FF8000 2px; margin:0ex auto;}
43.minipage{text-align:left; margin-left:0em; margin-right:auto;}
44.marginpar{border:solid thin black; width:20%; text-align:left;}
45.marginparleft{float:left; margin-left:0ex; margin-right:1ex;}
46.marginparright{float:right; margin-left:1ex; margin-right:0ex;}
47.theorem{text-align:left;margin:1ex auto 1ex 0ex;}
48.part{margin:2ex auto;text-align:center}
49h1, h2, h3, h4 {color: #527bbd;}
50.section {border-bottom: 2px solid silver;}
51</style>
52<title>DARPA/DOE HPC&#XA0;Challenge Benchmark version 1.4.2
53</title>
54</head>
55<body >
56<!--HEVEA command line is: hevea README.tex -->
57<!--CUT STYLE article--><!--CUT DEF section 1 --><table class="title"><tr><td><h1 class="titlemain">DARPA/DOE HPC&#XA0;Challenge Benchmark version 1.4.2</h1><h3 class="titlerest">Piotr Luszczek<sup><a id="text1" href="#note1">*</a></sup></h3><h3 class="titlerest">October 12, 2012</h3></td></tr>
58</table>
59<!--TOC section id=sec1 Introduction-->
60<h2 id="sec1" class="section">1&#XA0;&#XA0;Introduction</h2><!--SEC END --><p>
61This is a suite of benchmarks that measure performance of processor,
62memory subsytem, and the interconnect. For details refer to the
63HPC&#XA0;Challenge web site (<span style="font-family:monospace; color:navy;">http://icl.cs.utk.edu/hpcc/</span>.)</p><p>In essence, HPC&#XA0;Challenge consists of a number of tests each
64of which measures performance of a different aspect of the system.</p><p>If you are familiar with the High Performance Linpack&#XA0;(HPL) benchmark
65code (see the HPL web site:
66<span style="font-family:monospace; color:navy;">http://www.netlib.org/benchmark/hpl/</span>) then you can reuse the
67build script file&#XA0;(input for <span style="font-family:monospace; color:navy;">make(1)</span> command) and the input
68file that you already have for HPL. The HPC&#XA0;Challenge benchmark
69includes HPL and uses its build script and input files with only
70slight modifications. The most important change must be done to the
71line that sets the <span style="font-family:monospace; color:navy;">TOPdir</span> variable. For HPC&#XA0;Challenge, the
72variable&#X2019;s value should always be <span style="font-family:monospace; color:navy;">../../..</span> regardless of what
73it was in the HPL build script file.</p>
74<!--TOC section id=sec2 Compiling-->
75<h2 id="sec2" class="section">2&#XA0;&#XA0;Compiling</h2><!--SEC END --><p>
76The first step is to create a build script file that reflects
77characteristics of your machine. This file is reused by all the
78components of the HPC&#XA0;Challenge suite. The build script file should be
79created in the <span style="font-family:monospace; color:navy;">hpl</span> directory. This directory contains
80instructions (the files <span style="font-family:monospace; color:navy;">README</span> and <span style="font-family:monospace; color:navy;">INSTALL</span>) on how
81to create the build script file for your system. The
82<span style="font-family:monospace; color:navy;">hpl/setup</span> directory contains many examples of build script
83files. A recommended approach is to copy one of them to the
84<span style="font-family:monospace; color:navy;">hpl</span> directory and if it doesn&#X2019;t work then change it.</p><p>The build script file has a name that starts with <span style="font-family:monospace; color:navy;">Make.</span>
85prefix and usally ends with a suffix that identifies the target
86system. For example, if the suffix chosen for the system is
87<span style="font-family:monospace; color:navy;">Unix</span>, the file should be named <span style="font-family:monospace; color:navy;">Make.Unix</span>.</p><p>To build the benchmark executable (for the system named <span style="font-family:monospace; color:navy;">Unix</span>)
88type: <span style="font-family:monospace; color:navy;">make arch=Unix</span>. This command should be run in the top
89directory&#XA0;(not in the <span style="font-family:monospace; color:navy;">hpl</span> directory). It will look in the
90<span style="font-family:monospace; color:navy;">hpl</span> directory for the build script file and use it to build
91the benchmark executable.</p><p>The runtime behavior of the HPC&#XA0;Challenge source code may be
92configured at compiled time by defining a few C preprocessor
93symbols. They can be defined by adding appropriate options to
94<span style="font-family:monospace; color:navy;">CCNOOPT</span> and <span style="font-family:monospace; color:navy;">CCFLAGS</span> make variables. The former
95controls options for source code files that need to be compiled
96without aggressive optimizations to ensure accurate generation of
97system-specific parameters. The latter applies to the rest of the
98files that need good compiler optimization for best performance. To
99define a symbol <span style="font-family:monospace; color:navy;">S</span>, the majority of compilers requires option
100<span style="font-family:monospace; color:navy;">-DS</span> to be used. Currently, the following options are
101available in the HPC&#XA0;Challenge source code:
102</p><ul class="itemize"><li class="li-itemize"><span style="font-family:monospace; color:navy;">HPCC_FFT_235</span>: if this symbol is defined the FFTE
103code (an FFT implementation) will use vector sizes and processor
104counts that are not limited to powers of 2. Instead, the vector sizes
105and processor counts to be used will be a product of powers of 2, 3,
106and 5.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">HPCC_FFTW_ESTIMATE</span>: if this symbol is defined it will
107affect the way external FFTW library is called&#XA0;(it does not have any
108effect if the FFTW library is not used). When defined, this symbol
109will call the FFTW planning routine with <span style="font-family:monospace; color:navy;">FFTW_ESTIMATE</span>
110flag&#XA0;(instead of <span style="font-family:monospace; color:navy;">FFTW_MEASURE</span>). This might result with worse
111performance results but shorter execution time of the
112benchmark. Defining this symbol may also positively affect the memory
113fragmentation caused by the FFTW&#X2019;s planning routine.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">HPCC_MEMALLCTR</span>: if this symbol is defined a custom
114memory allocator will be used to alleviate effects of memory
115fragmentation and allow for larger data sets to be used which may
116result in obtaining better performance.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">HPL_USE_GETPROCESSTIMES</span>: if this symbol is defined
117then Windows-specific <span style="font-family:monospace; color:navy;">GetProcessTimes()</span> function will be used
118to measure the elapsed CPU time.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">USE_MULTIPLE_RECV</span>: if this symbol is defined then multiple non-blocking
119receives will be posted simultaneously. By default only one non-blocking
120receive is posted.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">RA_SANDIA_NOPT</span>: if this symbol is defined the
121HPC&#XA0;Challenge standard algorithm for Global RandomAccess will not be
122used. Instead, an alternative implementation from Sandia
123National Laboratory will be used. It routes messages in software
124across virtual hyper-cube topology formed from MPI processes.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">RA_SANDIA_OPT2</span>: if this symbol is defined the
125HPC&#XA0;Challenge standard algorithm for Global RandomAccess will not be
126used. Instead, instead an alternative implementation from Sandia
127National Laboratory will be used. This implementation is optimized for
128number of processors being powers of two. The optimizations
129are sorting of data before sending and unrolling the data update
130loop. If the number of process is not a power two then the code
131is the same as the one performed with the <span style="font-family:monospace; color:navy;">RA_SANDIA_NOPT</span> setting.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">RA_TIME_BOUND_DISABLE</span>: if this symbol is defined then the
132standard Global RandomAccess code will be used without time limits. This is
133discouraged for most runs because the standard algorithm tends to be slow for
134large array sizes due to a large overhead for short MPI messages.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">USING_FFTW</span>: if this symbol is defined the standard
135HPC&#XA0;Challenge FFT implemenation&#XA0;(called FFTE) will not be used.
136Instead, FFTW library will be called. Defining the
137<span style="font-family:monospace; color:navy;">USING_FFTW</span> symbol is not sufficient: appropriate flags have
138to be added in the make script so that FFTW headers files can be found
139at compile time and the FFTW libraries at link time.</li></ul>
140<!--TOC section id=sec3 Runtime Configuration-->
141<h2 id="sec3" class="section">3&#XA0;&#XA0;Runtime Configuration</h2><!--SEC END --><p>
142The HPC&#XA0;Challenge is driven by a short input file named
143<span style="font-family:monospace; color:navy;">hpccinf.txt</span> that is almost the same as the input file for
144HPL&#XA0;(customarily called <span style="font-family:monospace; color:navy;">HPL.dat</span>). Refer to the directory
145<span style="font-family:monospace; color:navy;">hpl/www/tuning.html</span> for details about the input file for
146HPL. A sample input file is included with the HPC&#XA0;Challenge
147distribution.</p><p>The differences between HPL&#X2019;s input file and HPC&#XA0;Challenge&#X2019;s input
148file can be summarized as follows:</p><ul class="itemize"><li class="li-itemize">
149Lines 3 and 4 are ignored. The output is always appended to the
150file named <span style="font-family:monospace; color:navy;">hpccoutf.txt</span>.
151</li><li class="li-itemize">There are additional lines&#XA0;(starting with line 33) that may&#XA0;(but
152do not have to) be used to customize the HPC&#XA0;Challenge benchmark. They
153are described below.
154</li></ul><p>The additional lines in the HPC&#XA0;Challenge input file&#XA0;(compared to the
155HPL input file) are:</p><ul class="itemize"><li class="li-itemize">
156Lines 33 and 34 describe additional matrix sizes to be used for
157running the PTRANS benchmark&#XA0;(one of the components of the
158HPC&#XA0;Challenge benchmark).
159</li><li class="li-itemize">Lines 35 and 36 describe additional blocking factors to be used
160for running the PTRANS test.
161</li></ul><p>Just for completeness, here is the list of lines of the HPC
162Challenge&#X2019;s input file and brief description of their meaning:
163</p><ul class="itemize"><li class="li-itemize">
164Line 1: ignored
165</li><li class="li-itemize">Line 2: ignored
166</li><li class="li-itemize">Line 3: ignored
167</li><li class="li-itemize">Line 4: ignored
168</li><li class="li-itemize">Line 5: number of matrix sizes for HPL (and PTRANS)
169</li><li class="li-itemize">Line 6: matrix sizes for HPL (and PTRANS)
170</li><li class="li-itemize">Line 7: number of blocking factors for HPL (and PTRANS)
171</li><li class="li-itemize">Line 8: blocking factors for HPL (and PTRANS)
172</li><li class="li-itemize">Line 9: type of process ordering for HPL
173</li><li class="li-itemize">Line 10: number of process grids for HPL (and PTRANS)
174</li><li class="li-itemize">Line 11: numbers of process rows of each process grid for HPL (and PTRANS)
175</li><li class="li-itemize">Line 12: numbers of process columns of each process grid for HPL (and PTRANS)
176</li><li class="li-itemize">Line 13: threshold value not to be exceeded by scaled residual for HPL (and PTRANS)
177</li><li class="li-itemize">Line 14: number of panel factorization methods for HPL
178</li><li class="li-itemize">Line 15: panel factorization methods for HPL
179</li><li class="li-itemize">Line 16: number of recursive stopping criteria for HPL
180</li><li class="li-itemize">Line 17: recursive stopping criteria for HPL
181</li><li class="li-itemize">Line 18: number of recursion panel counts for HPL
182</li><li class="li-itemize">Line 19: recursion panel counts for HPL
183</li><li class="li-itemize">Line 20: number of recursive panel factorization methods for HPL
184</li><li class="li-itemize">Line 21: recursive panel factorization methods for HPL
185</li><li class="li-itemize">Line 22: number of broadcast methods for HPL
186</li><li class="li-itemize">Line 23: broadcast methods for HPL
187</li><li class="li-itemize">Line 24: number of look-ahead depths for HPL
188</li><li class="li-itemize">Line 25: look-ahead depths for HPL
189</li><li class="li-itemize">Line 26: swap methods for HPL
190</li><li class="li-itemize">Line 27: swapping threshold for HPL
191</li><li class="li-itemize">Line 28: form of L1 for HPL
192</li><li class="li-itemize">Line 29: form of U for HPL
193</li><li class="li-itemize">Line 30: value that specifies whether equilibration should be used by HPL
194</li><li class="li-itemize">Line 31: memory alignment for HPL
195</li><li class="li-itemize">Line 32: ignored
196</li><li class="li-itemize">Line 33: number of additional problem sizes for PTRANS
197</li><li class="li-itemize">Line 34: additional problem sizes for PTRANS
198</li><li class="li-itemize">Line 35: number of additional blocking factors for PTRANS
199</li><li class="li-itemize">Line 36: additional blocking factors for PTRANS
200</li></ul>
201<!--TOC section id=sec4 Running-->
202<h2 id="sec4" class="section">4&#XA0;&#XA0;Running</h2><!--SEC END --><p>
203The exact way to run the HPC&#XA0;Challenge benchmark depends on the MPI
204implementation and system details. An example command to run the
205benchmark could like like this: <span style="font-family:monospace; color:navy;">mpirun -np 4 hpcc</span>. The
206meaning of the command&#X2019;s components is as follows:
207</p><ul class="itemize"><li class="li-itemize">
208<span style="font-family:monospace; color:navy;">mpirun</span> is the command that starts execution of an MPI
209code. Depending on the system, it might also be <span style="font-family:monospace; color:navy;">aprun</span>,
210<span style="font-family:monospace; color:navy;">mpiexec</span>, <span style="font-family:monospace; color:navy;">mprun</span>, <span style="font-family:monospace; color:navy;">poe</span>, or something
211appropriate for your computer.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">-np 4</span> is the argument that specifies that 4 MPI
212processes should be started. The number of MPI processes should be
213large enough to accomodate all the process grids specified in the
214<span style="font-family:monospace; color:navy;">hpccinf.txt</span> file.</li><li class="li-itemize"><span style="font-family:monospace; color:navy;">hpcc</span> is the name of the HPC&#XA0;Challenge executable to
215run.
216</li></ul><p>After the run, a file called <span style="font-family:monospace; color:navy;">hpccoutf.txt</span> is created. It
217contains results of the benchmark. This file should be uploaded
218through the web form at the HPC&#XA0;Challenge website.</p>
219<!--TOC section id=sec5 Source Code Changes across Versions (ChangeLog)-->
220<h2 id="sec5" class="section">5&#XA0;&#XA0;Source Code Changes across Versions (ChangeLog)</h2><!--SEC END -->
221<!--TOC subsection id=sec6 Version 1.4.3 (2013-08-26)-->
222<h3 id="sec6" class="subsection">5.1&#XA0;&#XA0;Version 1.4.3 (2013-08-26)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
223Increased the size of scratch vector for local FFT tests that was
224missed in the previous version (reported by SGI).
225</li><li class="li-enumerate">Added Makefile for Blue Gene/P contributed by Vasil Tsanov.
226</li></ol>
227<!--TOC subsection id=sec7 Version 1.4.2 (2012-10-12)-->
228<h3 id="sec7" class="subsection">5.2&#XA0;&#XA0;Version 1.4.2 (2012-10-12)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
229Increased sizes of scratch vectors for local FFT tests to account for
230runs on systems with large main memory (reported by IBM, SGI and Intel).
231</li><li class="li-enumerate">Reduced vector size for local FFT tests due to larger scratch space needed.
232</li><li class="li-enumerate">Added a type cast to prevent overflow of a 32-bit integer vector
233size in FFT data generation routine (reported by IBM).
234</li><li class="li-enumerate">Fixed variable types to handle array sizes that overflow 32-bit
235integers in RandomAccess (reported by IBM and SGI).
236</li><li class="li-enumerate">Changed time-bound code to be used by default in Global RandomAccess and
237allowed for it to be switched off with a compile time flag if necessary.
238</li><li class="li-enumerate">Code cleanup to allow compilation without warnings of RandomAccess test.
239</li><li class="li-enumerate">Changed communication code in PTRANS to avoid large message sizes that
240caused problems in some MPI implementations.
241</li><li class="li-enumerate">Updated documentation in README.txt and README.html files.
242</li></ol>
243<!--TOC subsection id=sec8 Version 1.4.1 (2010-06-01)-->
244<h3 id="sec8" class="subsection">5.3&#XA0;&#XA0;Version 1.4.1 (2010-06-01)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
245Added optimized variants of RandomAccess that use Linear Congruential Generator for random number generation.
246</li><li class="li-enumerate">Made corrections to comments that provide definition of the RandomAccess test.
247</li><li class="li-enumerate">Removed initialization of the main array from the timed section of optimized versions of RandomAccess.
248</li><li class="li-enumerate">Fixed the length of the vector used to compute error when using MPI implementation from FFTW.
249</li><li class="li-enumerate">Added global reduction to error calculation in MPI FFT to achieve more accurate error estimate.
250</li><li class="li-enumerate">Updated documentation in README.
251</li></ol>
252<!--TOC subsection id=sec9 Version 1.4.0 (2010-03-26)-->
253<h3 id="sec9" class="subsection">5.4&#XA0;&#XA0;Version 1.4.0 (2010-03-26)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
254Added new variant of RandomAccess that uses Linear Congruential Generator for random number generation.
255</li><li class="li-enumerate">Rearranged the order of benchmarks so that HPL component runs last and may be aborted
256if the performance of other components was not satisfactory. RandomAccess is now first to assist in tuning
257the code.
258</li><li class="li-enumerate">Added global initialization and finalization routine that allows to properly initialize
259and finalize external software and hardware components without changing the rest of the HPCC testing harness.
260</li><li class="li-enumerate">Lack of <span style="font-family:monospace; color:navy;">hpccinf.txt</span> is no longer reported as error but as a warning.
261</li></ol>
262<!--TOC subsection id=sec10 Version 1.3.2 (2009-03-24)-->
263<h3 id="sec10" class="subsection">5.5&#XA0;&#XA0;Version 1.3.2 (2009-03-24)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
264Fixed memory leaks in G-RandomAccess driver routine.
265</li><li class="li-enumerate">Made the check for 32-bit vector sizes in G-FFT optional. MKL allows for 64-bit vector sizes in its FFTW wrapper.
266</li><li class="li-enumerate">Fixed memory bug in single-process FFT.
267</li><li class="li-enumerate">Update documentation (README).
268</li></ol>
269<!--TOC subsection id=sec11 Version 1.3.1 (2008-12-09)-->
270<h3 id="sec11" class="subsection">5.6&#XA0;&#XA0;Version 1.3.1 (2008-12-09)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
271Fixed a dead-lock problem in FFT component due to use of wrong communicator.
272</li><li class="li-enumerate">Fixed the 32-bit random number generator in PTRANS that was using 64-bit
273routines from HPL.
274</li></ol>
275<!--TOC subsection id=sec12 Version 1.3.0 (2008-11-13)-->
276<h3 id="sec12" class="subsection">5.7&#XA0;&#XA0;Version 1.3.0 (2008-11-13)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
277Updated HPL component to use HPL 2.0 source code
278<ol class="enumerate" type=a><li class="li-enumerate">
279Replaced 32-bit Pseudo Random Number Generator (PRNG) with a 64-bit one.
280</li><li class="li-enumerate">Removed 3 numerical checks of the solution residual with a single one.
281</li><li class="li-enumerate">Added support for 64-bit systems with large memory sizes (before they would
282overflow during index calculations 32-bit integers.)
283</li></ol>
284</li><li class="li-enumerate">Introduced a limit on FFT vector size so they fit in a 32-bit integer (only
285applicable when using FFTW version 2.)
286</li></ol>
287<!--TOC subsection id=sec13 Version 1.2.0 (2007-06-25)-->
288<h3 id="sec13" class="subsection">5.8&#XA0;&#XA0;Version 1.2.0 (2007-06-25)</h3><!--SEC END --><ol class="enumerate" type=1><li class="li-enumerate">
289Changes in the FFT component:
290<ol class="enumerate" type=a><li class="li-enumerate">
291Added flexibility in choosing vector sizes and processor counts:
292now the code can do powers of 2, 3, and 5 both sequentially and in parallel
293tests.
294</li><li class="li-enumerate">FFTW can now run with ESTIMATE (not just MEASURE) flag: it might produce
295worse performance results but often reduces time to run the test and cuases
296less memory fragmentation.
297</li></ol>
298</li><li class="li-enumerate">Changes in the DGEMM component:
299<ol class="enumerate" type=a><li class="li-enumerate">
300Added more comprehensive checking of the numerical properties of the
301test&#X2019;s results.
302</li></ol>
303</li><li class="li-enumerate">Changes in the RandomAccess component:
304<ol class="enumerate" type=a><li class="li-enumerate">
305Removed time-bound functionality: only runs that perform complete
306computation are now possible.
307</li><li class="li-enumerate">Made the timing more accurate: main array initialization is not counted
308towards performance timing.
309</li><li class="li-enumerate">Cleaned up the code: some non-portable C language constructs have been
310removed.
311</li><li class="li-enumerate">Added new algorithms: new algorithms from Sandia based on hypercube
312network topology can now be chosen at compile time which results on much
313better performance results on many types of parallel systems.
314</li><li class="li-enumerate">Fixed potential resource leaks by adding function calls rquired by the MPI
315standard.
316</li></ol>
317</li><li class="li-enumerate">Changes in the HPL component:
318<ol class="enumerate" type=a><li class="li-enumerate">
319Cleaned up reporting of numerics: more accurate printing of scaled
320residual formula.
321</li></ol>
322</li><li class="li-enumerate">Changes in the PTRANS component:
323<ol class="enumerate" type=a><li class="li-enumerate">
324Added randomization of virtual process grids to measure bandwidth of the
325network more accurately.
326</li></ol>
327</li><li class="li-enumerate">Miscellaneous changes:
328<ol class="enumerate" type=a><li class="li-enumerate">
329Added better support for Windows-based clusters by taking advantage of
330Win32 API.
331</li><li class="li-enumerate">Added custom memory allocator to deal with memory fragmentation on some
332systems.
333</li><li class="li-enumerate">Added better reporting of configuration options in the output file.
334</li></ol>
335</li></ol>
336<!--TOC subsection id=sec14 Version 1.0.0 (2005-06-11)-->
337<h3 id="sec14" class="subsection">5.9&#XA0;&#XA0;Version 1.0.0 (2005-06-11)</h3><!--SEC END -->
338<!--TOC subsection id=sec15 Version 0.8beta (2004-10-19)-->
339<h3 id="sec15" class="subsection">5.10&#XA0;&#XA0;Version 0.8beta (2004-10-19)</h3><!--SEC END -->
340<!--TOC subsection id=sec16 Version 0.8alpha (2004-10-15)-->
341<h3 id="sec16" class="subsection">5.11&#XA0;&#XA0;Version 0.8alpha (2004-10-15)</h3><!--SEC END -->
342<!--TOC subsection id=sec17 Version 0.6beta (2004-08-21)-->
343<h3 id="sec17" class="subsection">5.12&#XA0;&#XA0;Version 0.6beta (2004-08-21)</h3><!--SEC END -->
344<!--TOC subsection id=sec18 Version 0.6alpha (2004-05-31)-->
345<h3 id="sec18" class="subsection">5.13&#XA0;&#XA0;Version 0.6alpha (2004-05-31)</h3><!--SEC END -->
346<!--TOC subsection id=sec19 Version 0.5beta (2003-12-01)-->
347<h3 id="sec19" class="subsection">5.14&#XA0;&#XA0;Version 0.5beta (2003-12-01)</h3><!--SEC END -->
348<!--TOC subsection id=sec20 Version 0.4alpha (2003-11-13)-->
349<h3 id="sec20" class="subsection">5.15&#XA0;&#XA0;Version 0.4alpha (2003-11-13)</h3><!--SEC END -->
350<!--TOC subsection id=sec21 Version 0.3alpha (2004-11-05)-->
351<h3 id="sec21" class="subsection">5.16&#XA0;&#XA0;Version 0.3alpha (2004-11-05)</h3><!--SEC END --><!--BEGIN NOTES document-->
352<hr class="footnoterule"><dl class="thefootnotes"><dt class="dt-thefootnotes">
353<a id="note1" href="#text1">*</a></dt><dd class="dd-thefootnotes"><div class="footnotetext">University of Tennessee Knoxville, Innovative
354Computing Laboratory</div>
355</dd></dl>
356<!--END NOTES-->
357<!--CUT END -->
358<!--HTMLFOOT-->
359<!--ENDHTML-->
360<!--FOOTER-->
361<hr style="height:2"><blockquote class="quote"><em>This document was translated from L<sup>A</sup>T<sub>E</sub>X by
362</em><a href="http://hevea.inria.fr/index.html"><em>H</em><em><span style="font-size:small"><sup>E</sup></span></em><em>V</em><em><span style="font-size:small"><sup>E</sup></span></em><em>A</em></a><em>.</em></blockquote></body>
363</html>
364

README.txt

1    
2            DARPA/DOE HPC Challenge Benchmark version 1.4.2
3            ***********************************************
4                           Piotr Luszczek (1)
5                           ==================
6                            October 12, 2012
7                            ================
8  
9
10
111  Introduction
12*=*=*=*=*=*=*=*
13
14   This is a suite of benchmarks that measure performance of processor,
15memory subsytem, and the interconnect. For details refer to the
16HPC Challenge web site (http://icl.cs.utk.edu/hpcc/.)
17  In essence, HPC Challenge consists of a number of tests each of which
18measures performance of a different aspect of the system.
19  If you are familiar with the High Performance Linpack (HPL) benchmark
20code (see the HPL web site: http://www.netlib.org/benchmark/hpl/) then
21you can reuse the build script file (input for make(1) command) and the
22input file that you already have for HPL. The HPC Challenge benchmark
23includes HPL and uses its build script and input files with only slight
24modifications. The most important change must be done to the line that
25sets the TOPdir variable. For HPC Challenge, the variable's value should
26always be ../../.. regardless of what it was in the HPL build script
27file.
28
29
302  Compiling
31*=*=*=*=*=*=
32
33   The first step is to create a build script file that reflects
34characteristics of your machine. This file is reused by all the
35components of the HPC Challenge suite. The build script file should be
36created in the hpl directory. This directory contains instructions (the
37files README and INSTALL) on how to create the build script file for
38your system. The hpl/setup directory contains many examples of build
39script files. A recommended approach is to copy one of them to the hpl
40directory and if it doesn't work then change it.
41  The build script file has a name that starts with Make. prefix and
42usally ends with a suffix that identifies the target system. For
43example, if the suffix chosen for the system is Unix, the file should be
44named Make.Unix.
45  To build the benchmark executable (for the system named Unix) type:
46make arch=Unix. This command should be run in the top directory (not in
47the hpl directory). It will look in the hpl directory for the build
48script file and use it to build the benchmark executable.
49  The runtime behavior of the HPC Challenge source code may be
50configured at compiled time by defining a few C preprocessor symbols.
51They can be defined by adding appropriate options to CCNOOPT and CCFLAGS
52make variables. The former controls options for source code files that
53need to be compiled without aggressive optimizations to ensure accurate
54generation of system-specific parameters. The latter applies to the rest
55of the files that need good compiler optimization for best performance.
56To define a symbol S, the majority of compilers requires option -DS to
57be used. Currently, the following options are available in the
58HPC Challenge source code: 
59 
60 
61   - HPCC_FFT_235: if this symbol is defined the FFTE code (an FFT
62   implementation) will use vector sizes and processor counts that are
63   not limited to powers of 2. Instead, the vector sizes and processor
64   counts to be used will be a product of powers of 2, 3, and 5.
65 
66   - HPCC_FFTW_ESTIMATE: if this symbol is defined it will affect the
67   way external FFTW library is called (it does not have any effect if
68   the FFTW library is not used). When defined, this symbol will call
69   the FFTW planning routine with FFTW_ESTIMATE flag (instead of
70   FFTW_MEASURE). This might result with worse performance results but
71   shorter execution time of the benchmark. Defining this symbol may
72   also positively affect the memory fragmentation caused by the FFTW's
73   planning routine.
74 
75   - HPCC_MEMALLCTR: if this symbol is defined a custom memory allocator
76   will be used to alleviate effects of memory fragmentation and allow
77   for larger data sets to be used which may result in obtaining better
78   performance.
79 
80   - HPL_USE_GETPROCESSTIMES: if this symbol is defined then
81   Windows-specific GetProcessTimes() function will be used to measure
82   the elapsed CPU time.
83 
84   - USE_MULTIPLE_RECV: if this symbol is defined then multiple
85   non-blocking receives will be posted simultaneously. By default only
86   one non-blocking receive is posted.
87 
88   - RA_SANDIA_NOPT: if this symbol is defined the HPC Challenge
89   standard algorithm for Global RandomAccess will not be used. Instead,
90   an alternative implementation from Sandia National Laboratory will be
91   used. It routes messages in software across virtual hyper-cube
92   topology formed from MPI processes.
93 
94   - RA_SANDIA_OPT2: if this symbol is defined the HPC Challenge
95   standard algorithm for Global RandomAccess will not be used. Instead,
96   instead an alternative implementation from Sandia National Laboratory
97   will be used. This implementation is optimized for number of
98   processors being powers of two. The optimizations are sorting of data
99   before sending and unrolling the data update loop. If the number of
100   process is not a power two then the code is the same as the one
101   performed with the RA_SANDIA_NOPT setting.
102 
103   - RA_TIME_BOUND_DISABLE: if this symbol is defined then the standard
104   Global RandomAccess code will be used without time limits. This is
105   discouraged for most runs because the standard algorithm tends to be
106   slow for large array sizes due to a large overhead for short MPI
107   messages.
108 
109   - USING_FFTW: if this symbol is defined the standard HPC Challenge
110   FFT implemenation (called FFTE) will not be used. Instead, FFTW
111   library will be called. Defining the USING_FFTW symbol is not
112   sufficient: appropriate flags have to be added in the make script so
113   that FFTW headers files can be found at compile time and the FFTW
114   libraries at link time.
115  
116
117
1183  Runtime Configuration
119*=*=*=*=*=*=*=*=*=*=*=*=
120
121   The HPC Challenge is driven by a short input file named hpccinf.txt
122that is almost the same as the input file for HPL (customarily called
123HPL.dat). Refer to the directory hpl/www/tuning.html for details about
124the input file for HPL. A sample input file is included with the
125HPC Challenge distribution.
126  The differences between HPL's input file and HPC Challenge's input
127file can be summarized as follows:
128  
129  
130   - Lines 3 and 4 are ignored. The output is always appended to the
131   file named hpccoutf.txt. 
132   - There are additional lines (starting with line 33) that may (but do
133   not have to) be used to customize the HPC Challenge benchmark. They
134   are described below. 
135  
136  The additional lines in the HPC Challenge input file (compared to the
137HPL input file) are:
138  
139  
140   - Lines 33 and 34 describe additional matrix sizes to be used for
141   running the PTRANS benchmark (one of the components of the
142   HPC Challenge benchmark). 
143   - Lines 35 and 36 describe additional blocking factors to be used for
144   running the PTRANS test. 
145  
146  Just for completeness, here is the list of lines of the HPC
147Challenge's input file and brief description of their meaning: 
148  
149   - Line 1: ignored 
150   - Line 2: ignored 
151   - Line 3: ignored 
152   - Line 4: ignored 
153   - Line 5: number of matrix sizes for HPL (and PTRANS) 
154   - Line 6: matrix sizes for HPL (and PTRANS) 
155   - Line 7: number of blocking factors for HPL (and PTRANS) 
156   - Line 8: blocking factors for HPL (and PTRANS) 
157   - Line 9: type of process ordering for HPL 
158   - Line 10: number of process grids for HPL (and PTRANS) 
159   - Line 11: numbers of process rows of each process grid for HPL (and
160   PTRANS) 
161   - Line 12: numbers of process columns of each process grid for HPL
162   (and PTRANS) 
163   - Line 13: threshold value not to be exceeded by scaled residual for
164   HPL (and PTRANS) 
165   - Line 14: number of panel factorization methods for HPL 
166   - Line 15: panel factorization methods for HPL 
167   - Line 16: number of recursive stopping criteria for HPL 
168   - Line 17: recursive stopping criteria for HPL 
169   - Line 18: number of recursion panel counts for HPL 
170   - Line 19: recursion panel counts for HPL 
171   - Line 20: number of recursive panel factorization methods for HPL 
172   - Line 21: recursive panel factorization methods for HPL 
173   - Line 22: number of broadcast methods for HPL 
174   - Line 23: broadcast methods for HPL 
175   - Line 24: number of look-ahead depths for HPL 
176   - Line 25: look-ahead depths for HPL 
177   - Line 26: swap methods for HPL 
178   - Line 27: swapping threshold for HPL 
179   - Line 28: form of L1 for HPL 
180   - Line 29: form of U for HPL 
181   - Line 30: value that specifies whether equilibration should be used
182   by HPL 
183   - Line 31: memory alignment for HPL 
184   - Line 32: ignored 
185   - Line 33: number of additional problem sizes for PTRANS 
186   - Line 34: additional problem sizes for PTRANS 
187   - Line 35: number of additional blocking factors for PTRANS 
188   - Line 36: additional blocking factors for PTRANS 
189  
190
191
1924  Running
193*=*=*=*=*=
194
195   The exact way to run the HPC Challenge benchmark depends on the MPI
196implementation and system details. An example command to run the
197benchmark could like like this: mpirun -np 4 hpcc. The meaning of the
198command's components is as follows: 
199  
200   - mpirun is the command that starts execution of an MPI code.
201   Depending on the system, it might also be aprun, mpiexec, mprun, poe,
202   or something appropriate for your computer.
203 
204   - -np 4 is the argument that specifies that 4 MPI processes should be
205   started. The number of MPI processes should be large enough to
206   accomodate all the process grids specified in the hpccinf.txt file.
207 
208   - hpcc is the name of the HPC Challenge executable to run. 
209  
210  After the run, a file called hpccoutf.txt is created. It contains
211results of the benchmark. This file should be uploaded through the web
212form at the HPC Challenge website.
213
214
2155  Source Code Changes across Versions (ChangeLog)
216*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
217
218  
219
220
2215.1  Version 1.4.3 (2013-08-26)
222===============================
223   
224  
225   1. Increased the size of scratch vector for local FFT tests that was
226   missed in the previous version (reported by SGI). 
227   2. Added Makefile for Blue Gene/P contributed by Vasil Tsanov. 
228  
229
230
2315.2  Version 1.4.2 (2012-10-12)
232===============================
233   
234  
235   1. Increased sizes of scratch vectors for local FFT tests to account
236   for runs on systems with large main memory (reported by IBM, SGI and
237   Intel). 
238   2. Reduced vector size for local FFT tests due to larger scratch
239   space needed. 
240   3. Added a type cast to prevent overflow of a 32-bit integer vector
241   size in FFT data generation routine (reported by IBM). 
242   4. Fixed variable types to handle array sizes that overflow 32-bit
243   integers in RandomAccess (reported by IBM and SGI). 
244   5. Changed time-bound code to be used by default in Global
245   RandomAccess and allowed for it to be switched off with a compile
246   time flag if necessary. 
247   6. Code cleanup to allow compilation without warnings of RandomAccess
248   test. 
249   7. Changed communication code in PTRANS to avoid large message sizes
250   that caused problems in some MPI implementations. 
251   8. Updated documentation in README.txt and README.html files. 
252  
253
254
2555.3  Version 1.4.1 (2010-06-01)
256===============================
257   
258  
259   1. Added optimized variants of RandomAccess that use Linear
260   Congruential Generator for random number generation. 
261   2. Made corrections to comments that provide definition of the
262   RandomAccess test. 
263   3. Removed initialization of the main array from the timed section of
264   optimized versions of RandomAccess. 
265   4. Fixed the length of the vector used to compute error when using
266   MPI implementation from FFTW. 
267   5. Added global reduction to error calculation in MPI FFT to achieve
268   more accurate error estimate. 
269   6. Updated documentation in README. 
270  
271
272
2735.4  Version 1.4.0 (2010-03-26)
274===============================
275   
276  
277   1. Added new variant of RandomAccess that uses Linear Congruential
278   Generator for random number generation. 
279   2. Rearranged the order of benchmarks so that HPL component runs last
280   and may be aborted if the performance of other components was not
281   satisfactory. RandomAccess is now first to assist in tuning the code.
282   
283   3. Added global initialization and finalization routine that allows
284   to properly initialize and finalize external software and hardware
285   components without changing the rest of the HPCC testing harness. 
286   4. Lack of hpccinf.txt is no longer reported as error but as a
287   warning. 
288  
289
290
2915.5  Version 1.3.2 (2009-03-24)
292===============================
293   
294  
295   1. Fixed memory leaks in G-RandomAccess driver routine. 
296   2. Made the check for 32-bit vector sizes in G-FFT optional. MKL
297   allows for 64-bit vector sizes in its FFTW wrapper. 
298   3. Fixed memory bug in single-process FFT. 
299   4. Update documentation (README). 
300  
301
302
3035.6  Version 1.3.1 (2008-12-09)
304===============================
305   
306  
307   1. Fixed a dead-lock problem in FFT component due to use of wrong
308   communicator. 
309   2. Fixed the 32-bit random number generator in PTRANS that was using
310   64-bit routines from HPL. 
311  
312
313
3145.7  Version 1.3.0 (2008-11-13)
315===============================
316   
317  
318   1. Updated HPL component to use HPL 2.0 source code 
319     
320      1. Replaced 32-bit Pseudo Random Number Generator (PRNG) with a
321      64-bit one. 
322      2. Removed 3 numerical checks of the solution residual with a
323      single one. 
324      3. Added support for 64-bit systems with large memory sizes
325      (before they would overflow during index calculations 32-bit
326      integers.) 
327  
328   2. Introduced a limit on FFT vector size so they fit in a 32-bit
329   integer (only applicable when using FFTW version 2.) 
330  
331
332
3335.8  Version 1.2.0 (2007-06-25)
334===============================
335  
336  
337  
338   1. Changes in the FFT component: 
339     
340      1. Added flexibility in choosing vector sizes and processor
341      counts: now the code can do powers of 2, 3, and 5 both
342      sequentially and in parallel tests. 
343      2. FFTW can now run with ESTIMATE (not just MEASURE) flag: it
344      might produce worse performance results but often reduces time to
345      run the test and cuases less memory fragmentation. 
346  
347   2. Changes in the DGEMM component: 
348     
349      1. Added more comprehensive checking of the numerical properties
350      of the test's results. 
351  
352   3. Changes in the RandomAccess component: 
353     
354      1. Removed time-bound functionality: only runs that perform
355      complete computation are now possible. 
356      2. Made the timing more accurate: main array initialization is not
357      counted towards performance timing. 
358      3. Cleaned up the code: some non-portable C language constructs
359      have been removed. 
360      4. Added new algorithms: new algorithms from Sandia based on
361      hypercube network topology can now be chosen at compile time which
362      results on much better performance results on many types of
363      parallel systems. 
364      5. Fixed potential resource leaks by adding function calls rquired
365      by the MPI standard. 
366  
367   4. Changes in the HPL component: 
368     
369      1. Cleaned up reporting of numerics: more accurate printing of
370      scaled residual formula. 
371  
372   5. Changes in the PTRANS component: 
373     
374      1. Added randomization of virtual process grids to measure
375      bandwidth of the network more accurately. 
376  
377   6. Miscellaneous changes: 
378     
379      1. Added better support for Windows-based clusters by taking
380      advantage of Win32 API. 
381      2. Added custom memory allocator to deal with memory fragmentation
382      on some systems. 
383      3. Added better reporting of configuration options in the output
384      file. 
385  
386  
387
388
3895.9  Version 1.0.0 (2005-06-11)
390===============================
391  
392
393
3945.10  Version 0.8beta (2004-10-19)
395==================================
396  
397
398
3995.11  Version 0.8alpha (2004-10-15)
400===================================
401  
402
403
4045.12  Version 0.6beta (2004-08-21)
405==================================
406  
407
408
4095.13  Version 0.6alpha (2004-05-31)
410===================================
411  
412
413
4145.14  Version 0.5beta (2003-12-01)
415==================================
416  
417
418
4195.15  Version 0.4alpha (2003-11-13)
420===================================
421  
422
423
4245.16  Version 0.3alpha (2004-11-05)
425===================================
426  
427-----------------------------------------------------------------------
428  
429   This document was translated from LaTeX by HeVeA (2).
430-----------------------------------
431  
432  
433 (1) University of Tennessee Knoxville, Innovative Computing Laboratory
434 
435 (2) http://hevea.inria.fr/index.html
436