ParallelDevicePerformanceProfile¶
- class ParallelDevicePerformanceProfile(configuration, processes_and_threads, equilibrium_methods=None, non_equilibrium_methods=None)¶
Class for performing timing and memory profiles of the different methods available for calculating the Green’s function and lesser Green’s function.
- Parameters:
configuration (
DeviceConfiguration
) – The device configuration with an attached calculator to profile.processes_and_threads (list of tuples of int) – The configurations of number of processes and threads to run, as list of tuples. E.g., [(1, 1), (2, 4)] will run a calculation with 1 MPI process and 1 thread per process, and a calculation with 2 MPI processes and 4 threads per process.
equilibrium_methods (
GreensFunction
|SparseGreensFunction
| sequence of (GreensFunction
|SparseGreensFunction
)) – The methods benchmarked for the equilibrium calculation. If no methods should be benchmarked for the equilibrium calculation, an empty list can be specified. Default:: (GreensFunction
,SparseGreensFunction
)non_equilibrium_methods – The methods benchmarked for the non-equilibrium calculation. If no methods should be benchmarked for the non-equilibrium calculation, an empty list can be specified. Default:: (
GreensFunction
,SparseGreensFunction
)
- equilibriumMethods()¶
- Returns:
The equilibrium methods profiled.
- Return type:
tuple of (
GreensFunction
|SparseGreensFunction
)
- generateScript(temporary_filename)¶
Generate a script to run a DevicePerformanceProfile.
- nlprint(stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)¶
Print out the profiling report.
- Parameters:
stream (file) – The stream to write to. This stream must support strings being written to it using ‘write’. Default:
sys.stdout
- nonEquilibriumMethods()¶
- Returns:
The non-equilibrium methods profiled.
- Return type:
tuple of (
GreensFunction
|SparseGreensFunction
)
- processesAndThreads()¶
- Returns:
The list of processes and threads configurations profiled.
- Return type:
tuple of (int, int)
- runAsSubProcess()¶
Run a serie of DevicePerformanceProfile for each combination of processes and threads provided.
- uniqueString()¶
Return a unique string representing the state of the object.
Notes¶
This object extends the functionalities of DevicePerformanceProfile by running
several profiles for different parallel settings, defined by the input parameter
processes_and_threads
. The aim is to
help the user to to determine the best combination of parallelization strategy
and Green’s function algorithm for the simulation of a given device, without the need
to run DevicePerformanceProfile several times.
The input parameters are the same as in DevicePerformanceProfile, except for
processes_and_threads
. This parameter determines on how many processes the
contour point calculation is parallelized (mimicking the usage of
processes_per_contour_point
), and how many threads per process are utilized.
For each method and each processes and threads configuration the elapsed time and memory usage are reported.
Note
The profiler has to be launched in serial mode (e.g. without using mpiexec
,
or using mpiexec -n 1
). The parallel runs are
spawned by the profiler itself. In order to obtain reliable results, there should be enough
resources available for the spawned processes. I.e., the number of available physical cores
should be at least equal to the largest sum of processes and threads.
For example, if 2 processes and 2 threads are
defined in processes_and_threads
, then 4 physical cores should be available.
If fewer cores are available, the results may be unreliable because the cores will be
over-occupied.
Note that some clusters do not support process spawning. On such systems
ParallelDevicePerformanceProfile will not work.
Usage Example¶
The following script will read a configuration from file and run a profile for 3 different combinations of processes and threads:
(1, 1): A single process per contour point and a single thread.
(1, 2): A single process per contour point and two threads per process.
(2, 1): Two processes per contour point and a single threads per process.
device_configuration = nlread('device.hdf5', DeviceConfiguration)[-1]
profile = ParallelDevicePerformanceProfile(
device_configuration,
processes_and_threads=[(1, 1), (1, 2), (2, 1)])
nlprint(profile)
Here is an example of the output you might get, divided in different sections.
First we see an header listing the available profiles corresponding to the different
entries of processes_and_threads
:
+------------------------------------------------------------------------------+
| Parallel Device Performance Profile |
| |
| 3 available device performance profiles: |
| 1 process 1 thread |
| 1 process 2 threads |
| 2 processes 1 thread |
+------------------------------------------------------------------------------+
It follows a detailed report of memory and timing for each profile, with the same structure as the output of DevicePerformanceProfile.
+------------------------------------------------------------------------------+
| Device Performance Profile (1) |
| 1 process 1 thread |
+------------------------------------------------------------------------------+
| Contour point timing (s): |
| EQ NEQ |
| GreensFunction 12.46 18.41 |
| SparseGreensFunction 45.76 32.15 |
| |
| Fastest EQ method (by 3.7 times): GreensFunction |
| Fastest NEQ method (by 1.7 times): GreensFunction |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB): |
| EQ NEQ |
| GreensFunction 1901.38 3966.34 |
| SparseGreensFunction 3536.50 2596.16 |
| |
| Most memory-efficient EQ method (by 1.9 times): GreensFunction |
| Most memory-efficient NEQ method (by 1.5 times): SparseGreensFunction |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Device Performance Profile (2) |
| 1 process 2 threads |
+------------------------------------------------------------------------------+
| Contour point timing (s): |
| EQ NEQ |
| GreensFunction 8.24 13.19 |
| SparseGreensFunction 39.78 30.08 |
| |
| Fastest EQ method (by 4.8 times): GreensFunction |
| Fastest NEQ method (by 2.3 times): GreensFunction |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB): |
| EQ NEQ |
| GreensFunction 1887.05 3985.32 |
| SparseGreensFunction 3525.32 2629.77 |
| |
| Most memory-efficient EQ method (by 1.9 times): GreensFunction |
| Most memory-efficient NEQ method (by 1.5 times): SparseGreensFunction |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Device Performance Profile (3) |
| 2 processes 1 thread |
+------------------------------------------------------------------------------+
| Contour point timing (s): |
| EQ NEQ |
| GreensFunction 9.11 21.22 |
| SparseGreensFunction 39.42 34.22 |
| |
| Fastest EQ method (by 4.3 times): GreensFunction |
| Fastest NEQ method (by 1.6 times): GreensFunction |
+------------------------------------------------------------------------------+
| Peak memory usage/process (MB): |
| EQ NEQ |
| GreensFunction 1450.88 3919.13 |
| SparseGreensFunction 2224.71 1982.50 |
| |
| Most memory-efficient EQ method (by 1.5 times): GreensFunction |
| Most memory-efficient NEQ method (by 2.0 times): SparseGreensFunction |
+------------------------------------------------------------------------------+
At the bottom we have a summary of time and memory consumption for the different processes and threads configurations.
+------------------------------------------------------------------------------+
| Summary Report |
| |
| The best and worst case scenario for resource occupation (time and peak |
| memory) is reported for each method. The resource occupation is normalized |
| to the number of physical cores utilized, here referred to as Processing |
| Units (PU). |
+------------------------------------------------------------------------------+
| Equilibrium GreensFunction: |
| Time*PU (s) Memory/PU (MB) |
| 1 process 1 thread (1 PU) 12.46 (best) 1901.38 (worst) |
| 1 process 2 threads (2 PU) 16.47 943.52 (best) |
| 2 processes 1 thread (2 PU) 18.23 (worst) 1450.88 |
+------------------------------------------------------------------------------+
| Equilibrium SparseGreensFunction: |
| Time*PU (s) Memory/PU (MB) |
| 1 process 1 thread (1 PU) 45.76 (best) 3536.50 (worst) |
| 1 process 2 threads (2 PU) 79.57 (worst) 1762.66 (best) |
| 2 processes 1 thread (2 PU) 78.83 2224.71 |
+------------------------------------------------------------------------------+
| Non-equilibrium GreensFunction: |
| Time*PU (s) Memory/PU (MB) |
| 1 process 1 thread (1 PU) 18.41 (best) 3966.34 (worst) |
| 1 process 2 threads (2 PU) 26.38 1992.66 (best) |
| 2 processes 1 thread (2 PU) 42.44 (worst) 3919.13 |
+------------------------------------------------------------------------------+
| Non-equilibrium SparseGreensFunction: |
| Time*PU (s) Memory/PU (MB) |
| 1 process 1 thread (1 PU) 32.15 (best) 2596.16 (worst) |
| 1 process 2 threads (2 PU) 60.15 1314.88 (best) |
| 2 processes 1 thread (2 PU) 68.44 (worst) 1982.50 |
+------------------------------------------------------------------------------+
In the summary both the memory and time are normalized to the number of processing units (PU) utilized. The number of PUs is defined as the sum between the number of processes and the number of threads and it can be interpreted simply as the number of physical cores utilized.
On the right column the peak memory per PU is reported. On the left column the total CPU time, defined as wallclock time times the number of PUs utilized, is indicated.
Interpreting the Summary Report¶
The real simulation wallclock time and memory consumption depends on
how many contour points can be simultaneously calculated, and it is not possible to give
a general estimate upfront. The user should be able to interpret the quantities reported in the summary,
especially when comparing different number of PUs. Let’s consider as an example
an equilibrium calculation with GreensFunction
method. From the summary we read
+------------------------------------------------------------------------------+
| Equilibrium GreensFunction: |
| Time*PU (s) Memory/PU (MB) |
| 1 process 1 thread (1 PU) 12.46 (best) 1901.38 (worst) |
| 1 process 2 threads (2 PU) 16.47 943.52 (best) |
| 2 processes 1 thread (2 PU) 18.23 (worst) 1450.88 |
+------------------------------------------------------------------------------+
Assume that 48 contour points will be calculated, and that the calculation will run on 48 physical
cores. In this case a contour integration will take roughly as
much as indicated in the Time*PU
field (i.e. 12.46 seconds for 1 process and 1 thread, 16.47 s for 1
process and 2 threads, etc.). The total memory usage will be given multiplying the quantities in
the memory column by 48, because all PUs will be engaged in the calculation. In this case the indication
of best
and worst
time and memory are faithful.
But what if we have for example 96 physical cores available? In this case the calculation launched with 1 process and 1 threads per contour point will have some resources kept idle, because it will utilize at most one physical core per contour point. The wallclock time will be still approximately 12.46 s with no speedup, and similarly the total memory will be still given by (1901*48)MB.
Both configurations with 2 PUs will be able to run all contour points
simultaneously, thus utilizing twice the number of physical cores with respect to the case with 1 PUs.
Therefore the wallclock time will be approximately half the time in the Time*PU
column (i.e.,
8.2 s for 1 process, 2 threads and 9.1 for 2 processes, 1 thread).
The total memory consumption will be (943*96)MB for the 1 process, 2 threads case, and
(1451*96)MB for the 2 processes, 1 thread case. In this case, using 2 physical cores per contour
point is advantageous.