vadim suhomlinov improvement of multiline software

Information about vadim suhomlinov improvement of multiline software

Published on November 28, 2007

Author: george

Source: authorstream.com

Content

Enhancing Quality of Multi-threaded software. Intel Threading Tools Use Cases:  Enhancing Quality of Multi-threaded software. Intel Threading Tools Use Cases 2007 Vadim Sukhomlinov [email protected] Denys Kotlyarov [email protected] Agenda:  Agenda Why Multi-threading? Multithreading and SW Lifecycle Intel Threading Tools Slide3:  3 * Other brands and names may be claimed as the property of others. How to Double Performance and Doesn’t Burn? P0 ~ f² Core Die/Socket f Slide4:  4 * Other brands and names may be claimed as the property of others. 2000 2008+ Average SPECInt2000 of SPECFP2000 rates Relative Performance to 1.4Ghz Intel® Pentium® 4 Processor Источник: Intel 2004 3X Forecast PERFORMANCE Through Parallelism Multicore is quickly becoming pervasive Single-threaded apps will be left behind:  5 Multicore is quickly becoming pervasive Single-threaded apps will be left behind All products and dates are preliminary and subject to change without notice. * Source: IDC Desktop Performance Server Mobile Performance Projected run rate exiting the year. Source: Intel 2005 Multicore Shipping Multicore Shipping Multicore Shipping 2006 >70% >70% >85% 2007 >90% >90% ~100% Dual Core Quad Core Slide6:  6 Growing Availability of Multithreaded SW Activision (Ravensoft) Adobe Algorithmics Alias Autodesk Business Objects Cakewalk CodecPeople Computer Associates Corel (WordPerfect) Cyberlink Discreet IBM id Software Landmark Macromedia Mainconcept Maxon mental images Microsoft (Office Suite) Midway MSC Novell SUSE Oracle Pegasus Pinnacle Pixar (Renderman) Paradigm PTC SAP SAS Siebel CRM Signet Skype SLB SnapStream Sonic (Roxio) Sony Steinberg SunGard Sybase Symantec Thomson THQ Ubisoft UGS Valve Yahoo (Musicmatch) Multithreading as Competitive Advantage What is Parallelism?:  7 What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes Communication through Inter-Process Communication (IPC) Single process, multiple threads Communication through shared memory Threads – Benefits & Risks:  8 Threads – Benefits & Risks Benefits Competitive advantage for Modern Software Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput IPC through shared memory is more efficient Risks Increases complexity of the application Difficult to debug and test (data races, deadlocks, etc.) Common Question for SW Designers:  9 Common Question for SW Designers Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use? Threading is Complex Threading Impact to Software Lifecycle:  Threading Impact to Software Lifecycle Requirements analysis and system specification Planning of properties, scalability System and software design Complicated architecture development. Implementation and unit testing New development paradigm, uncommon bugs & issues Integration, system verification and validation Quality assurance vs planned properties: scalability. Testing obstacles. Operation support and Maintenance Analysis & reproducing of customers issues, workload-specific performance bottlenecks, scalability degradation Disposal Multithreading – IS a design goal. Cost of Issues with Multithreading increases with moving to next phase Create, Debug and Optimize Threaded Applications using Intel® Software Development Products:  11 Create, Debug and Optimize Threaded Applications using Intel® Software Development Products Introduce Threads/ Design Correctness/ Debug Optimize/ Tune Leverage built-in threading support, and highly optimized threaded libraries that enable performance gains even if an application isn’t threaded! Detect even latent programming challenges unique to parallel programming Tune for performance and scalability. Visualize threading issues to help focus threading optimization. Analyze your application and identify multi-core performance bottlenecks and hotspots. Analysis Intel has a broad toolset to help develop fast, reliable threaded applications Sequential Development Cycle:  Sequential Development Cycle Planning Scalability - Amdahl Law:  13 Planning Scalability - Amdahl Law Upper bound of performance increase Serial Code limits Scalability n = 2 n = ∞ Sequential Development Cycle:  Sequential Development Cycle SW Design: Parallel Programming Models:  15 SW Design: Parallel Programming Models Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation Implementation: OpenMP Standard:  16 Implementation: OpenMP Standard Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally Sequential program evolves into a parallel program Implementation: OpenMP Parallelization:  Implementation: OpenMP Parallelization void test(int first, int last) { for (int i = first; i <= last; ++i) { a[i] = b[i] * c[i]; } } Each loop is independent; order of execution does not matter if(x < 0) a = foo(x); else a = x + 5; b = bat(y); c = baz(x + y); j = a*b+c; #pragma omp parallel for #pragma omp parallel sections { #pragma omp section if(x < 0) a = foo(x); else a = x + 5; #pragma omp section b = bat(y); #pragma omp section c = baz(x + y); } j = a+b+c; Assignments to ‘a’, ‘b’, and ‘c’ are independent Sequential Development Cycle:  Sequential Development Cycle Implementation: OpenMP support in Visual C++:  Implementation: OpenMP support in Visual C++ A specification for multithreaded programs It consists of a set of simple #pragmas and runtime routines #pragma omp parallel Most value, where? Parallelizing large loops with no loop-dependencies Intel C++/Fortran implements the full OpenMP 2.5 standard with task extensions http://www.openmp.org Visual C++ 2005 implements the full OpenMP 2.5 standard Intel® Threading Building Blocks Scalable Threads Faster:  20 Intel® Threading Building Blocks Scalable Threads Faster Описание Simplify threading for performance via a C++ template-based runtime library Использование Implementation aid: Easily introduce threading for utilizing multi-core platforms Performance aid: Use common algorithms tuned for performance and scalability Quality aid: Employ pre-packaged routines for common idioms and containers Design aid: Focus on higher level of abstraction via tasks and scalable patterns Поддержка Intel®, Microsoft* and GNU* Compilers APIs – OpenMP*, Windows* threads, POSIX* threads Special Tools support – Intel® Thread Checker and Intel® Thread Profiler Платформы Less code to achieve parallelism Example: 2D Ray Tracing Application:  21 Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include "tbb/task_scheduler_init.h" #include "tbb/spin_mutex.h" tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include "tbb/parallel_for.h" #include "tbb/blocked_range2d.h" class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone. Focus on work to do, not “how” (thread control) to manage threads Intel® TBB offers cleaner Design, competitive performance and platform portability Sequential Development Cycle:  Sequential Development Cycle Intel® Thread Checker 3.0 for Windows* and Linux* Create Threads Faster:  23 Intel® Thread Checker 3.0 for Windows* and Linux* Create Threads Faster Detects challenging data races and deadlocks Pinpoints errors to the source code line Works on standard debug builds without recompiling Supports 32-bit and 64-bit applications Batch scripts integration for regression test runs Recommends modules to instrument by usage Minimize instrumentation overhead Windows Supports Microsoft Visual Studio 2005* Linux* Introduction of native Linux* support through command line views Intel Confidential – NDA Required New New New New New Debugging for Correctness:  24 Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks Intel® Thread Checker VTune™ Performance Analyzer +DLLs (Instrumented) Binary Instrumentation Primes.exe Primes.exe (Instrumented) Runtime Data Collector threadchecker.thr (result file) Slide25:  25 PINPOINTS SOURCE CODE Sequential Development Cycle:  Sequential Development Cycle Common Performance Issues:  27 Common Performance Issues Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work Intel® Thread Profiler 3.0 for Windows* Optimize Threads Faster:  28 Intel® Thread Profiler 3.0 for Windows* Optimize Threads Faster Key Benefits Shows how much of your application is not optimally parallel and where Identifies where thread specific overhead impacts performance Highlights thread workload imbalances and thread activity Shows the number of cores utilized Pinpoints issues to the source code line Maximizes application time spent in parallel regions Supports 32 and 64-bit applications Supports Microsoft Visual Studio 2005* Intel Confidential – NDA Required New New Tuning for Performance:  29 Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications +DLL’s (Instrumented) Binary Instrumentation Primes.c Primes.exe (Instrumented) Runtime Data Collector Bistro.tp/guide.gvs (result file) Compiler Source Instrumentation Primes.exe /Qopenmp_profile Intel® Thread Profiler: critical path analysis:  30 Intel® Thread Profiler: critical path analysis Each duration on the critical path points to the single thread that limits program performance Time spend in transition between threads is the overhead time to switch and synchronize threads Decreasing duration of execution segments that relies on the critical path allows to improve application performance efficiency of system recourses utilization analysis Evolutionary Development :  Evolutionary Development Develop a system gradually in many repetitive stages: Increasing the knowledge of the system requirements and system functionality in each stage exposing the results to user comments. This can be achieved by using: The Iterative Model The Incremental Model The Prototyping Model Now we can get feedback from previous stage: Iterative Implementation Iterative Model:  Iterative Model Incremental Model:  Incremental Model Prototyping Model:  Prototyping Model Spiral Model:  Spiral Model Summary:  36 Summary Multithreading IS a competitive advantage Multithreading IS complex Multithreading impacts ALL phases of SW lifecycle Intel delivers several software developer products designed to make multi-threading easier and faster: Intel Thread Checker Intel Thread Profiler Intel Thread Building Blocks VTune Performance Analyzer Try the Intel Software developer tools today! Slide37:  37

#pragma presentations

C programming  session9 -
18. 10. 2017
0 views

C programming session9 -

Related presentations


Other presentations created by george

BOD
30. 11. 2007
0 views

BOD

STRESSMANAGEMENT
06. 12. 2007
0 views

STRESSMANAGEMENT

homebirth
10. 12. 2007
0 views

homebirth

Fuel Cell technology programme
07. 11. 2007
0 views

Fuel Cell technology programme

MoCap
23. 11. 2007
0 views

MoCap

training presentation
04. 01. 2008
0 views

training presentation

Winery Planning Guide
04. 01. 2008
0 views

Winery Planning Guide

2 html cgi perl
06. 11. 2007
0 views

2 html cgi perl

wu
14. 11. 2007
0 views

wu

mysap lessons lrnd
28. 11. 2007
0 views

mysap lessons lrnd

VENEZIA ITALIAN AUDIOVISUAL
03. 10. 2007
0 views

VENEZIA ITALIAN AUDIOVISUAL

Metabolism
06. 03. 2008
0 views

Metabolism

B3u5
10. 03. 2008
0 views

B3u5

Martin Thornton v4
12. 03. 2008
0 views

Martin Thornton v4

m220w01
07. 11. 2007
0 views

m220w01

IVAConferencegompers 1
18. 03. 2008
0 views

IVAConferencegompers 1

wdubb2x5klc7cpr
21. 03. 2008
0 views

wdubb2x5klc7cpr

Millennial 3
26. 03. 2008
0 views

Millennial 3

attitudestoimmigrants
07. 04. 2008
0 views

attitudestoimmigrants

domingos
19. 12. 2007
0 views

domingos

DESSAC
04. 10. 2007
0 views

DESSAC

KurlantzickPowerPoint
11. 10. 2007
0 views

KurlantzickPowerPoint

ccj jun2000
09. 10. 2007
0 views

ccj jun2000

domestic cooking en
04. 03. 2008
0 views

domestic cooking en

wecPPT0525f
24. 02. 2008
0 views

wecPPT0525f

Bridges TTVN
29. 12. 2007
0 views

Bridges TTVN

040502x
30. 12. 2007
0 views

040502x

IntellectualFreedomA ndPrivacy
15. 11. 2007
0 views

IntellectualFreedomA ndPrivacy

myslovitz2
21. 11. 2007
0 views

myslovitz2

BDDSlideShow
28. 02. 2008
0 views

BDDSlideShow

Storing
07. 01. 2008
0 views

Storing

san pedro mkt soc
16. 11. 2007
0 views

san pedro mkt soc

Kris Array Slicing
07. 01. 2008
0 views

Kris Array Slicing

EECCAspecifics eng
27. 09. 2007
0 views

EECCAspecifics eng