DEFINITION OF COMPUTING UNITS CPU: The "brains" of the computer. Responsable for all computations, loading of data, etc Core: An individual CPU unit. For instance, a Dual Core CPU can be thought of as having two seperate CPU's in a single package, with their own dedicated registers and cache. [This isn't entirely correct, but its a simlified explanation] Process: A process can be thought of as a program. When you start a program, it kicks of its process, and all memory that is allocated is allocated at the process level. Thread: A thread is a unit of execution within a process. One process can have many threads. For instance, you could have a thread to handle User Input, a thread for program control, a few threads for AI managment, a thread for audio, etc. All these threads exist within a single process. Within most modern OS's, the thread is the smallest unit of execution; the OS schedules threads [usually based on priority], and the CPU spends some time operating on a thread, before swapping in a new one to work on [giving the illusion multiple things can happen at the same time]. On a multiple-CPU or multi-core system, multiple threads could be run at the same time, hence why there is an increasing focus on software parallelization. wHEN WILL THIS BECOMES IMPORTANT? In OpenMP, KMP_AFFINITY controls how the "software" threads are distributed across the whole "physical threads", usually with hyper-threading, you could double the number of "physical threads". For example, when do lscpu
------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz Stepping: 3 CPU MHz: 2492.382 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 4988.57 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 ------------------------------------------------------------------- Here we can see there are 1 socket, with 4 cores on each, and 8 threads in total. However, there are different ways to schedule your threads distributed over the whole CPUs. 2.1. Compact Scheduling Option #1 is often referred to as “compact” scheduling and is depicted in the diagram to the right. It keeps all of your threads running on a single physical processor if possible, and this is what you would want if all of the threads in your application need to repeatedly access different parts of a large array. This is because all of the cores on the same physical processor can access the memory banks associated with (or “owned by”) that processor at the same speed. However, cores cannot access memory stored on memory banks owned by a different processor as quickly; this is phenomenon is called NUMA (non-uniform memory access). If your threads all need to access data stored in the memory owned by one processor, it is often best to put all of your threads on the processor who owns that memory. 2.2. Round-Robin Scheduling Option #2 is called “scatter” or “round-robin” scheduling and is ideal if your threads are largely independent of each other and don’t need to access a lot of memory that other threads need. The benefit to round-robin thread scheduling is that not all threads have to share the same memory channel and cache, effectively doubling the memory bandwidth and cache sizes available to your application. The tradeoff is that memory latency becomes higher as threads have to start accessing memory that might be owned by another processor. Reference: http://www.glennklockwood.com/hpc-howtos/process-affinity.html |
AuthorShaowu Pan Archives
December 2017
Categories
All
|