大规模并行处理器程序设计(英文版·第2版)

大规模并行处理器程序设计(英文版·第2版)
作 者: 柯克 胡文美
出版社: 机械工业出版社
丛编项:
版权说明: 本书为公共版权或经版权方授权,请支持正版图书
标 签: 程序设计 计算机/网络
ISBN 出版时间 包装 开本 页数 字数
未知 暂无 暂无 未知 0 暂无

作者简介

  David B.Kirk美国国家工程院院士、NVIDIAFellow,曾是NVIDIA公司首席科学家。他领导了nvidia图形技术开发,并使其成为当今最流行的大众娱乐平台,也是cuda技术的创始人之一。2002年,他荣获ACMSIGGRAPH计算机图形成就奖,以表彰其在把高性能计算机图形系统推向大众市场方面所做出的杰出贡献。他拥有麻省理工学院的机械工程学学士学位和硕士学位,加州理工学院的计算机科学博士学位。kirk博士是50项与图形芯片设计相关的专利和专利申请的发明者,发表了50多篇关于图形处理技术的论文,是可视化计算技术方面的权威。Wen-MeiW.Hwu(胡文美)拥有美国加州大学伯克利分校计算机科学博士学位,现任美国伊利诺伊大学厄巴纳—香槟分校(UIUC)协调科学实验室电气与计算机工程JerrySanders(AMD创始人)讲座教授、微软和英特尔联合资助的通用并行计算研究中心联合主任兼世界上第一个NVIDIACUDA卓越中心首席研究员。胡教授是世界顶级的并行处理器架构与编译器专家,担任美国下一代千万亿级计算机——蓝水系统的首席研究员。他是IEEEFellow、ACM Fellow。

内容简介

《经典原版书库:大规模并行处理器程序设计(英文版.第2版)》内容简介:作者结合自己多年从事并行计算课程教学的经验,以简洁、直观和实用的方式,详细剖析了编写并行程序所需的各种技术,并用丰富的案例说明了并行程序设计的整个开发过程,即从计算机思想开始,直到最终实现高效可行的并行程序。 与上一版相比,本版对书中内容进行全面修订和更新,更加系统地阐述并行程序设计,既介绍了基本并行算法模式,又补充了更多的背景资料,而且还介绍了一些新的实用编程技术和工具。具体更新情况如下:并行模式:新增3章并行模式方面的内容,详细说明了并行应用中涉及的诸多算法。cuda fortran:这一章简要介绍了针对cuda体系结构的编程接口,并通过丰富的实例阐释cuda编程。openacc:这一章介绍了使用指令表示并行性的开放标准,以简化并行编程任务。thrust:thrust是cudac/c++之上的一个抽象层。本版用一章的篇幅说明了如何利用thrust并行模板库以最少的编程工作来实现高性能应用。c++amp:微软开发的一种编程接口,用于简化windows环境中大规模并行处理编程。nvidia的kepler架构:探讨了nvidia高性能、节能的gpu架构的编程特性。

图书目录

preface

acknowledgements

chapter 1 introduction

1.1 heterogeneous parallel computing

1.2 architecture of a modem gpu

1.3 why more speed or parallelism?

1.4 speeding up real applications

1.5 parallel programming languages and models

1.6 overarching goals

1.7 organization of the book

references

chapter 2 history of gpu computing

2.1 evolution of graphics pipelines

the era of fixed-function graphics pipelines

evolution of programmable real-time graphics

unified graphics and computing processors

2.2 gpgpu: an intermediate step

2.3 gpu computing

scalable gpus

recent developments

future trends

references and further reading

chapter 3 introduction to data parallelism and coda c

3.1 data parallelism

3.2 cuda program structure

3.3 a vector addition kernel

3.4 device global memory and data transfer

3.5 kernel functions and threading

3.6 summary

function declarations

kernel launch

predefined variables

runtime api

3.7 exercises

references

chapter 4 data-parallel execution model

4.1 cuda thread organization

4.2 mapping threads to multidimensional data

4.3 matrix-matrix multiplication--a more complex kernel

4.4 synchronization and transparent scalability

4.5 assigning resources to blocks

4.6 querying device properties

4.7 thread scheduling and latency tolerance

4.8 summary

4.9 exercises

chapter 5 coda memories

5.1 importance of memory access efficiency

5.2 cuda device memory types

5.3 a strategy for reducing global memory traffic

5.4 a tiled matrix-matrix multiplication kernel

5.5 memory as a limiting factor to parallelism

5.6 summary

5.7 exercises

chapter 6 performance considerations

6.1 warps and thread execution

6.2 global memory bandwidth

6.3 dynamic partitioning of execution resources

6.4 instruction mix and thread granularity

6.5 summary

6.6 exercises

references

chapter 7 floating-point considerations

7.1 floating-point format

normalized representation of m

excess encoding of e

7.2 representable numbers

7.3 special bit patterns and precision in ieee format

7.4 arithmetic accuracy and rounding

7.5 algorithm considerations

7.6 numerical stability

7.7 summary

7.8 exercises

references

chapter 8 parallel patterns: convolution

8.1 background

8.2 ID parallel convolution a basic algorithm

8.3 constant memory and caching

8.4 tiled 1d convolution with halo elements

8.5 a simpler tiled 1d convolution--general caching

8.6 summary

8.7 exercises

chapter 9 parallel patterns: prefix sum

9.1 background

9.2 a simple parallel scan

9.3 work efficiency considerations

9.4 a work-efficient parallel scan

9.5 parallel scan for arbitrary-length inputs

9.6 summary

9.7 exercises

reference

chapter 10 parallel patterns: sparse matrix-vector multiplication

10.1 background

10.2 parallel spmv using csr

10.3 padding and transposition

10.4 using hybrid to control padding

10.5 sorting and partitioning for regularization

10.6 summary

10.7 exercises

references

chapter 11 application case study: advanced mri reconstruction

11.1 application background

11.2 iterative reconstruction

11.3 computing fhd

step 1: determine the kernel parallelism structure

step 2: getting around the memory bandwidth limitation.

step 3: using hardware trigonometry functions

step 4: experimental performance tuning

11.4 final evaluation

11.5 exercises

references

chapter 12 application case study: molecular visualization and analysis

12.1 application background

12.2 a simple kernel implementation

12.3 thread granularity adjustment

12.4 memory coalescing

12.5 summary

12.6 exercises

references

chapter 13 parallel programming and computational thinking

13.1 goals of parallel computing

13.2 problem decomposition

13.3 algorithm selection

13.4 computational thinking

13.5 summary

13.6 exercises

references

chapter 14 an introduction to opencltm

14.1 background

14.2 data parallelism model

14.3 device architecture

14.4 kernel functions

14.5 device management and kernel launch

14.6 electrostatic potential map in opencl

14.7 summary

14.8 exercises

references

chapter 15 parallel programming with openacc

15.1 0penacc versus cuda c

15.2 execution model

15.3 memory model

15.4 basic openacc programs

parallel construct

loop constmct

kernels construct

data management

asynchronous computation and data transfer

15.5 future directions of openacc

15.6 exercises

chapter 16 thrust: a productivity-oriented library for cuda

16.1 background

16.2 motivation

16.3 basic thrust features

iterators and memory space

interoperability

16.4 generic programming

16.5 benefits of abstraction

16.6 programmer productivity

robustness

real world performance

16.7 best practices

fusion

structure of arrays

implicit ranges

16.8 exercises

references

chapter 17 cuda fortran

17.1 cuda fortran and cuda c differences

17.2 a first cuda fortran program

17.3 multidimensional array in cuda fortran.

17.4 overloading host/device routines with generic interfaces

17.5 calling cuda c via iso_c_binding

17.6 kernel loop directives and reduction operations

17.7 dynamic shared memory

17.8 asynchronous data transfers

17.9 compilation and profiling

17.10 calling thrust from cuda fortran

17.11 exercises

chapter 18 an introduction to c + + amp

18.1 core c + + amp features

18.2 details of the c + + amp execution model

explicit and implicit data copies

asynchronous operation

section summary

18.3 managing accelerators

18.4 tiled execution

18.5 c + + amp graphics features

18.6 summary

18.7 exercises

chapter 19 programming a heterogeneous computing cluster

19.1 background

19.2 a running example

19.3 mpi basics

19.4 mpi point-to-point communication types

19.5 overlapping computation and communication

19.6 mpi collective communication

19.7 summary

19.8 exercises

reference

chapter 20 cuda dynamic parallelism

20.1 background

20.2 dynamic parallelism overview

20.3 important details

launch enviromnent configuration

apierrors and launch failures

events

streams

synchronization scope

20.4 memory visibility

global memory

zero-copy memory

constant memory

texture memory

20.5 a simple example

20.6 runtime limitations

memory footprint

nesting depth

memory allocation and lifetime

ecc errors

streams

events

launch pool

20.7 a more complex example

linear bezier curves

quadratic bezier curves

bezier curve calculation (predynamic parallelism)

bezier curve calculation (with dynamic parallelism)

20.8 summary

reference

chapter 21 conclusion and future outlook

21.1 goals revisited

21.2 memory model evolution

21.3 kernel execution control evolution

21.4 core performance

21.5 programming environment

21.6 future outlook

references

appendix A: matrix multiplication host-only version source code

appendix B: gpu compute capabilities

index