高性能Spark(影印版)

高性能Spark(影印版)
作 者: Holden Karau,Rachel Warren
出版社: 东南大学出版社
丛编项:
版权说明: 本书为公共版权或经版权方授权,请支持正版图书
标 签: 计算机?网络 软件工程及软件方法学
ISBN 出版时间 包装 开本 页数 字数
未知 暂无 暂无 未知 0 暂无

作者简介

  Holden Karau是一位跨性别加拿大人,在IBM Spark技术中心担任软件开发工程师。她是Spark代码贡献者,并且经常提交贡献代码,特别是PySpark和机器学习部分。Holden在多个国际活动中演讲Spark相关话题。 Rachel Warren是Alpine Data的软件工程师和数据科学家。在日常工作中,她使用Spark来处理真实世界的数据和机器学习问题。她也曾在工业界和学术界担任分析师和导师。

内容简介

本书描述了减少数据基础设施成本和开发时间的技巧,适用于软件工程师、数据工程师、开发者和系统管理员。你不仅可以从中获得关于Spark的全面理解,也将学会如何让它运转自如。 在本书中你将发现: * Spark SQL的新接口如何在SQL的RDD数据结构上改善性能 * Core Spark和Spark SQL之间的数据拼接选择 * 充分发挥标准RDD转换功能的技巧 * 如何处理Spark的键/值对范式的相关性能问题 * 编写高性能Spark代码,不使用Scala或JVM * 如何在应用建议的改进措施时测试功能和性能 * 使用Spark MLlib和Spark ML机器学习库 * Spark的流组件和外部社区软件包

图书目录

Preface

1.Introductioto High Performance Spark

What Is Spark and Why Performance Matters

What You CaExpect to Get from This Book

Spark Versions

Why Scala

To Be a Spark Expert You Have to Leara Little Scala Anyway

The Spark Scala API Is Easier to Use Thathe lava API

Scala Is More Performant ThaPython

Why Not Scala

Learning Scala

Conclusion

2.How Spark Works

How Spark Fits into the Big Data Ecosystem

Spark Components

Spark Model of Parallel Computing: RDDs

Lazy Evaluation

In-Memory Persistence and Memory Management

Immutability and the RDD Interface

Types of RDDs

Functions oRDDs: Transformations Versus Actions

Wide Versus Narrow Dependencies

Spark Job Scheduling

Resource AllocatioAcross Applications

The Spark Application

The Anatomy of a Spark lob

The DAG

Jobs

Stages

Tasks

Conclusion

3.DataFrames, Datasets, and Spark SQL

Getting Started with the SparkSessio(or HiveContext or SQLContext)

Spark SQL Dependencies

Managing Spark Dependencies

Avoiding Hive JARs

Basics of Schemas

DataFrame API

Transformations

Multi-DataFrame Transformations

PlaiOld SQL Queries and Interacting with Hive Data

Data RepresentatioiDataFrames and Datasets

Tungsten

Data Loading and Saving Functions

DataFrameWriter and DataFrameReader

Formats

Save Modes

Partitions (Discovery and Writing)

Datasets

Interoperability with RDDs, DataFrames, and Local Collections

Compile-Time Strong Typing

Easier Functional (RDD 'like') Transformations

Relational Transformations

Multi-Dataset Relational Transformations

Grouped Operations oDatasets

Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)

Query Optimizer

Logical and Physical Plans

Code Generation

Large Query Plans and Iterative Algorithms

Debugging Spark SQL Queries

BC/ODBC Server

Conclusion

4.Joins (SQL and Core)

Core Spark Joins

Choosing a JoiType

Choosing aExecutioPlan

Spark SQL Joins

DataFrame Joins

Dataset Joins

Conclusion

5.Effective Transformations

Narrow Versus Wide Transformations

Implications for Performance

Implications for Fault Tolerance

The Special Case of coalesce

What Type of RDD Does Your TransformatioReturn

Minimizing Object Creation

Reusing Existing Objects

Using Smaller Data Structures

Iterator-to-Iterator Transformations with mapPartitions

What Is aIterator-to-Iterator Transformation

Space and Time Advantages

AExample

Set Operations

Reducing Setup Overhead

Shared Variables

Broadcast Variables

Accumulators

Reusing RDDs

Cases for Reuse

Deciding if Repute Is Inexpensive Enough

Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files

Alluxio (nee Tachyon)

LRU Caching

Noisy Cluster Considerations

Interactiowith Accumulators

Conclusion

6.Working with Key/Value Data

The Goldilocks Example

Goldilocks Versio0: Iterative Solution

How to Use PairRDDFunctions and OrderedRDDFunctions

Actions oKey/Value Pairs

What's So Dangerous About the groupByKey Function

Goldilocks Versio1: groupByKey Solution

Choosing aAggregatioOperation

Dictionary of AggregatioOperations with Performance Considerations

Multiple RDD Operations

Co-Grouping

Partitioners and Key/Value Data

Using the Spark Partitioner Object

Hash Partitioning

Range Partitioning

Custom Partitioning

Preserving Partitioning InformatioAcross Transformations

Leveraging Co-Located and Co-Partitioned RDDs

Dictionary of Mapping and Partitioning Functions PairRDDFunctions

Dictionary of OrderedRDDOperations

Sorting by Two Keys with SortByKey

Secondary Sort and repartitionAndSortWithinPartitions

Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function

How Not to Sort by Two Orderings

Goldilocks Versio2: Secondary Sort

A Different Approach to Goldilocks

Goldilocks Versio3: Sort oCell Values

Straggler Detectioand Unbalanced Data

Back to Goldilocks (Again)

Goldilocks Versio4: Reduce to Distinct oEach Partition

Conclusion

7.Going Beyond Scala

Beyond Scala withithe JVM

Beyond Scala, and Beyond the JVM

How PySpark Works

How SparkR Works

Spark.jl (Julia Spark)

How Eclair JS Works

Spark othe CommoLanguage Runtime (CLR)——C# and Friends

Calling Other Languages from Spark

Using Pipe and Friends

JNI

Java Native Access (JNA)

Underneath Everything Is FORTRAN

Getting to the GPU

The Future

Conclusion

8.Testing and Validation

Unit Testing

General Spark Unit Testing

Mocking RDDs

Getting Test Data

Generating Large Datasets

Sampling

Property Checking with ScalaCheck

Computing RDD Difference

IntegratioTesting

Choosing Your IntegratioTesting Environment

Verifying Performance

Spark Counters for Verifying Performance

Projects for Verifying Performance

Job Validation

Conclusion

9.Spark MLlib and ML

Choosing BetweeSpark MLlib and Spark ML

Working with MLlib

Getting Started with MLlib (Organizatioand Imports)

MLlib Feature Encoding and Data Preparation

Feature Scaling and Selection

MLlib Model Training

Predicting

Serving and Persistence

Model Evaluation

Working with Spark ML

Spark ML Organizatioand Imports

Pipeline Stages

ExplaiParams

Data Encoding

Data Cleaning

Spark ML Models

Putting It All Together ia Pipeline

Training a Pipeline

Accessing Individual Stages

Data Persistence and Spark ML

Extending Spark ML Pipelines with Your OwAlgorithms

Model and Pipeline Persistence and Serving with Spark ML

General Serving Considerations

Conclusion

10.Spark Components and Packages

Stream Processing with Spark

Sources and Sinks

Batch Intervals

Data Checkpoint Intervals

Considerations for DStreams

Considerations for Structured Streaming

High Availability Mode (or Handling Driver Failure or Checkpointing)

GraphX

Using Community Packages and Libraries

Creating a Spark Package

Conclusion

A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist

Index