Parallel Computer Architecture A Hardware/Software Approach(英文版 第2版)

Parallel Computer Architecture A Hardware/Software Approach(英文版 第2版)
作 者: 勒斯
出版社: 机械工业出版社
丛编项:
版权说明: 本书为出版图书,暂不支持在线阅读,请支持正版图书
标 签: 暂缺
ISBN 出版时间 包装 开本 页数 字数
未知 暂无 暂无 未知 0 暂无

作者简介

暂缺《Parallel Computer Architecture A Hardware/Software Approach(英文版 第2版)》作者简介

内容简介

David E. Culler & Jaswinder Pal Singh with Anoop Gupta: Parallel Computer Architecture, A Hard ware/Software Approach. Second Edition.Copyright @ 1996 by Morgan Kaufmann Publishers, Inc.Harcourt Asia Pte Ltd under special arrangement with Morgan Kaufmann authorizes China Machine Press to print and exclusively distribute this edition, which is the only authorized complete and unabridged reproduction of the latest American Edition published and priced for sale in China only, not including Hong Kong SAR and Taiwan.Unauthorized export of this edition is a violation of the Copyright Act. Violation of this Law is subjected to Civil and Criminal penalties.

图书目录

Contents

Foreword vi

Preface xviii

1 Introduction

l.l Why Parallel Architecture

l.l.l Application Trends

l.l.2 Technology Trends

l.l.3 Architectural Trends

1.1.4 Supercomputers

l.l.5 Summary

l.2 Convergence of Parallel Architectures

l.2.l Communication Architecture

1.2.2Shared Address Space

l.2.3 Message Passing

1.2.4Convergence

1.2.5 Data Parallel Processing

l.2.6 Other Parallel Architectures

1.2.7A Generic Parallel Architecture

l.3 Fundamental Design Issues

l.3.l Communication Abstraction

l.3.2 Programming Model Requirements

l.3.3 Communication and Replication

1.3.4A Performance

l.3.5 Summary

1.4Concluding Remarks

l.5 Historical References

l.6 Exercises

2 Parallel Programs

2.1 Parallel Application Case Studies

2.1.1 Simulating Ocean Currents

2.1.2 Simulating the Evolution of Galaxies

2.1.3 Visualizing Complex Scenes Using Ray Tracing

2.1.4 Mining Data for Associations

2.2 The Parallelization Process

2.2.1 Steps in the Process

2.2.2 Parallelizing Computation versus Data

2.2.3 Goals of the Parallelization Process

2.3 Parallelization of an Example Program

2.3.1 The Equation Solver Kemel

2.3.2 Decomposition

2.3.3 Assignment

2.3.4 Orchestration under the Data Parallel Model

2.3.5 Orchestration under the Shared Address Space Model

2.3.6 Orchestration under the Message-Passing Model

2.4 Concluding Remarks

2.5 Exercises

3 Programming for Performance

3.1 Partitioning for Performance

3.1.1 Load Balance and Synchronization Wait Time

3.1.2 Reducing Inherent Communication

3.1.3 Reducing the Extra Work

3.1.4 Summary

3.2 Data Access and Communication in a Multimemory System

3.2.1 A Multiprocessor as an Extended Memory Hierarchy

3.2.2 Artifactual Communication in the Extended Memory Hierarchy

3.2.3 Artifactual Communication and Replication: The Working Set

Perspective

3.3 Orchestration for Performance

3.3.1 Reducing Artifactual Communication

3.3.2 Structuring Communication to Reduce Cost

3.4 Perfonnance Factors from the Processor's Perspective

3.5 The Parallel Application Case Studies: An In-Depth Look

3.5.1 Ocean

3.5.2 Bames-Hut

3.5.3 Raytrace

3.5.4 DataMining

3.6 Implications for Programming Models

3.6.1 Naming

3.6.2 Replication

3.6.3 Overhead and Granularity of Communication

3.6.4 Block Data Transfer

3.6.5 Synchronization

3.6.6 Hardware Cost and Design Complexity

3.6.7 Performance Model

3.6.8 Summary

3.7 Concluding Remarks

3.8 Exercises

4 Workload-Driven Evaluation

4.1 Scaling Workloads and Machines

4.1.1 Basic Measures of Multiprocessor Performance

4.1.2 Why Worry about Scaling?

4.1.3 Key Issues in Scahng

4.1.4 Scaling Models and Speedup Measures

4.1.5 Impact of Scaling Models on the Equation Solver Kemel

4.1.6 Scaling Workload Parameters

4.2 Evaluating a Real Machine

4.2.1 Performance Isolation Using Microbenchmarks

4.2.2 Choosing Workloads 216

4.2.3 Evaluating a Fixed-Size Machine

4.2.4 Varying Machine Size

4.2.5 Choosing Performance Metrics

4.3 Evaluating an Architectural Idea or Trade-off

4.3.1 Multiprocessor Simulation

4.3.2 Scaling Down Problem and Machine Parameters for Simulation

4.3.3 Dealing with the Parameter Space: An Example Evaluation

4.3.4 Summary

4.4 Illustrating Workload Characterization

4.4.1 Workload Case Studies

4.4.2 Workload Characteristics

4.5 Concluding Remarks

4.6 Exercises

5 Shared Memory Multiprocessors

5.1 Cache Coherence

5.1.1 The Cache Coherence Problem

5.1.2 Cache Coherence through Bus Snooping

5.2 Memory Consistency

5.2.1 Sequential Consistency

5.2.2 Sufficient Conditions for Preserving Sequential Consistency

5.3 Design Space for Snooping Protocols

5.3.1 A Three-State (MSI) Write-Back Invalidation Protocol

5.3.2 A Four-State (MESI) Write-Back Invalidation Protocol

5.3.3 A Four-State (Dragon) Write-Back Update Protocol

5.4 Assessing Protocol Design Trade-offs

5.4.1 Methodology

5.4.2 Bandwidth Requirement under the MESI Protocol

5.4.3 Impact of Protocol Optimizations

5.4.4 Trade-Offs in Cache Block Size

5.4.5 Update-Based versus Invalidation-Based Protocols

5.5 Synchronization

5.5.1 Components of a Synchronization Event

5.5.2 Role of the User and System

5.5.3 Mutual Exclusion

5.5.4 Point-to-Point Event Synchronization

5.5.5 Global (Barrier) Event Synchronization

5.5.6 Synchronization Summary

5.6 Implications for Software

5.7 Concluding Remarks

5.8 Exercises

6 Snoop-Based Multiprocessor Design

6.1 Correctness Requirements

6.2 Base Design: Single-Level Caches with an Atomic Bus

6.2.1 Cache Controller and Tag Design

6.2.2 Reportmg Snoop Results

6.2.3 Dealing with Write Backs

6.2.4 Base Organization

6.2.5 Nonatomic State Transitions

6.2.6 Serialization

6.2.7 Deadlock

6.2.8 Livelock and Starvation

6.2.9 Implementing Atomic Operations

6.3 Multilevel Cache Hierarchies

6.3.1 Maintaining Inclusion

6.3.2 Propagating Transactions for Coherence in the Hierarchy

6.4 Split-Transaction Bus

6.4.1 An Example Split-Transaction Design

6.4.2 Bus Design and Request-Response Matching

6.4.3 Snoop Results and Conflicting Requests

6.4.4 Flow Control

6.4.5 Path of a Cache Miss

6.4.6 Serialization and Sequential Consistency

6.4.7 Altemative Design Choices

6.4.8 Split-Transaction Bus with Multilevel Caches

6.4.9 Supporting Multiple Outstanding Misses from a Processor

Case Studies: SGl Challenge and Sun Enterprise 6000

6.5.1 sGl Powerpath-2 System Bus

6.5.2 SGl Processor and Memory Subsystems

6.5.3 SGl I/O Subsystem

6.5.4 SGl Challenge Memory System Performance

6.5.5 Sun Gigaplane System Bus

6.5.6 Sun Processor and Memory Subsystem

6.5.7 Sun I/O Subsystem

6.5.8 Sun Enterprise Memory System Performance

6.5.9 Application Performance

Extending Cache Coherence

6.6.l Shared Cache Designs

6.6.2 Coherence for Virtually indexed Caches

6.6.3 Translation Lookaside Buffer Coherence

6.6.4 Snoop-Based Cache Coherence on Rings

6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems

Concluding Remarks

Exercises

Scalable Multiprocessors

Scalability

7.l.l Bandwidth Scaling

7.1.2 Latency Scaling

7.1.3 CostScaling

7.1.4 Physical Scaling

7.1.5 Scaling in a Generic Parallel Architecture

Realizing Programming Models

7.2.1 Primitive Network Transacuons

7.2.2 Shared Address Space

7.2.3 Message Passing

7.2.4 Active Messages

7.2.5 Common Challenges

7.2.6 Communication Architecture Design Space

Physical DMA

7.3.1 Node-to-Network Interface

7.3.2 Implementing Communication Abstractions

7.3.3 A Case Study: nCUBE/2

7.3.4 Typical LAN Interfaces

7.4 User-Level Access

7.4.1 Node-to-Network Interface

7.4.2 Case Study: Thinking Machines CM-5

7.4.3 User-Level Handlers

7.5 Dedicated Message Processing

7.5.1 Case Study: Intel Paragon

7.5.2 Case Study: Meiko CS-2

7.6 Shared Physical Address Space

7.6.1 Case Study: CRAY T3D

7.6.2 Case Study: CRAY T3E

7.6.3 Summary

7.7 Clusters and Networks of Workstations

7.7.1 Case Study: Myrinet SBUS Lanai

7.7.2 Case Study: PCI Memory Channel

7.8 Implications for Parallel Software

7.8.1 Network Transaction Performance

7.8.2 Shared Address Space Operations

7.8.3 Message-Passing Operations

7.8.4 Application-Level Performance

7.9 Synchronization

7.9.1 Algorithms for Locks

7.9.2 Algorithms for Barriers

7.10 Concluding Remarks

7.11 Exercises

8 Directory-Based Cache Coherence

8.1 Scalable Cache Coherence

8.2 Overview of Directory-Based Approaches

8.2.1 Operation ofa Simple Directory Scheme

8.2.2 Scaling

8.2.3 Altematives for Organizing Directories

8.3 Assessing Directory Protocols and Trade-Offs

8.3.1 Data Sharing Pattems for Directory Schemes

8.3.2 Local versus Remote Traffic

8.3.3 Cache Block Size EfTects

8.4 Design Challenges for Directory Protocols

8.4.1 Performance

8.4.2 Correctness

8.5 Memory-Based Directory Protocols: The SGl Origin System

8.5.1 Cache Coherence Protocol

8.5.2 Dealing with Correctness Issues

8.5.3 Details of Directory Structure

8.5.4 Protocol Extensions

8.5.5 Overview of the Origin2000 Hardware

8.5.6 Hub Implementation

8.5.7 Performance Characteristics

8.6 Cache-Based Directory Protocols: The Sequent NUMA-Q

8.6.1 Cache Coherence Protocol

8.6.2 Dealing with Correctness Issues

8.6.3 Protocol Extensions

8.6.4 Overview of NUMA-Q Hardware

8.6.5 Protocol Interactions with SMP Node

8.6.6 IiQ-Link implementation

8.6.7 Performance Characteristics

8.6.8 Comparison Case Study: The HAL Sl Multiprocessor

8.7 Performance Parameters and Protocol Performance

8.8 Synchronization

8.8.l Performance of Synchronization Algorithms

8.8.2 implementing Atomic Primitives

8.9 implications for Parallel Software

8.l0 Advanced Topics

8.l0.l Reducing Directory Storage Overhead

8.10.2 Hierarchical Coherence

8.ll C oncluding Remarks

8.12 Exercises

9 Hardware/Software Trade-Offs

9.1 Relaxed Memory Consistency Models

9.1.l The System Specification

9.1.2 The Programmer's Interface

9.1.3 The Translation Mechanism

9.1.4 Consistency Models in Real Multiprocessor Systems

9.2 Overcoming Capacity Limitations

9.2.1 Tertiary Caches

9.2.2 Cache-Only Memory Architectures (COMA)

9.3 Reducing Hardware Cost

9.3.1 Hardware Access Control with a Decoupled Assist

9.3.2 Access Control through Code Instrumentation

9.3.3 Page-Based Access Control: Shared Virtual Memory

9.3.4 Access Control through Language and Compiler Support

9.4 Putting It All Together: A Taxonomy and Simple COMA

9.4.1 Putting It All Together: Simple COMA and Stache

9.5 Implications for Parallel Software

9.6 Advanced Topics

9.6.1 Flexibility and Address Constraints in CC-NUMA Systems

9.6.2 Implementing Relaxed Memory Consistency in Software

9.7 C oncluding Remarks

9.8 Exercises

10 Interconnection Network Design

10.1 Basic Definitions

10.2 Basic Communication Performance

10.2.1 Latency

10.2.2 Bandwidth

10.3 Organizational Structure

10.3.1 Links

10.3.2 Switches

10.3.3 Network Interfaces

10.4 Interconnection Topologies

10.4.1 Fully Connected Network

10.4.2 Linear Arrays and Rings

10.4.3 Multidimensional Meshes and Tori

10.4.4 Trees

10.4.5 Butterflies

10.4.6 Hypercubes

10.5 Evaluating Design Trade-Offs in Network Topology

10.5.1 Unloaded Latency

10.5.2 Latency under Load

10.6 Routing

10.6.1 Routing Mechanisms

10.6.2 Deterministic Routing

10.6.3 Deadlock Freedom

10.6.4 Virtual Channels

10.6.5 Up*-Down* Routing

10.6.6 Tum-Model Routing

10.6.7 Adaptive Routing

10.7 Switch Design

10.7.1 Ports

10.7.2 Intemal Datapath

10.7.3 Channel Buffers

10.7.4 Output Scheduling

10.7.5 Stacked Dimension Switches

10.8 Flow Control

10.8.1 Parallel Computer Networks versus 1 ANs and WANs

10.8.2 Link-Level Flow Control

10.8.3 End-to-End Flow Control

10.9 Case Studies

10.9.1 CRAY T3D Network

10.9.2 IBM SP-l, SP-2 Network

10.9.3 Scalable Coherent Interface

10.9.4 SGI Origin Network

10.9.5 Myricom Network

10.10 Concluding Remarks

10.11 Exercises

11 Latency Tolerance

11.1 Overview of Latency Tolerance

11.1.1 Latency Tolerance and the Communication Pipeline

11.1.2 Approaches

11.1.3 Fundamental Requirements, Benefits, and Limitations

11.2 Latency Tolerance in Explicit Message Passing

11.2.1 Structure of Communication

11.2.2 Block Data Transfer

11.2.3 Precommuucation

11.2.4 Proceeding Past Communication in the Same Thread

11.2.5 Multithreading

11.3 Latency Tolerance in a Shared Address Space

11.3.1 Structure of Communication

11.4 Block Data Transfer in a Shared Address Space

11.4.1 Techniques and Mechanisms

11.4.2 Policy Issues and Trade-Offs

11.4.3 Performance Benefits

11.5 Proceeding Past Long-Latency Events

11.5.1 Proceeding Past Writes

11.5.2 Proceeding Past Reads

11.5.3 Summary

11.6 Precommunication in a Shared Address Space

11.6.1 Shared Address Space without Caching of Shared Data

11.6.2 Cache-Coherent Shared Address Space

11.6.3 Performance Benefits

11.6.4 Summary

11.7 Multithreading in a Shared Address Space

11.7.1 Techniques and Mechanisms

11.7.2 Performance Benefits

11.7.3 Implementation Issues for the Blocked Scheme

11.7.4 Implementation Issues for the Interleaved Scheme

11.7.5 Integrating Multithreading with Multiple-Issue Processors

11.8 Lockup-Free Cache Design

11.9 Concluding Remarks

11.10 Exercises

12 Future Directions

12.1 Technology and Architecture

12.1.1 Evolutionary Scenario

12.1.2 HittingaWall

12.1.3 Potential Breakthroughs

12.2 Applications and System Software

12.2.1 Evolutionary Scenario

12.2.2 HittingaWall

12.2.3 Potential Breakthroughs

Appendix: Parallel Benchmark Suites

A.1 ScaLapack

A.2 TPC

A.3 SPLASH

A.4 NAS Parallel Benchmarks

A.5 PARKBENCH

A.6 Other Ongoing Efforts

References

Index