学习Spark(影印版 英文版)

学习Spark(影印版 英文版)
作 者: Holden Karau Andy Konwinski Patrick Wendell Matei Zaharia
出版社: 东南大学出版社
丛编项:
版权说明: 本书为公共版权或经版权方授权,请支持正版图书
标 签: 暂缺
ISBN 出版时间 包装 开本 页数 字数
未知 暂无 暂无 未知 0 暂无

作者简介

暂缺《学习Spark(影印版 英文版)》作者简介

内容简介

所有领域中产生的数据都越来越大。你如何有效地利用这些数据?《学习Spark(影印版 英文版)》介绍了ApacheSpark,一种能迅速执行数据分析过程的开源集群计算系统。利用Spark,你能够通过Python、Java和Scala中的简单API迅速地处理大数据集。《学习Spark(影印版 英文版)》由Spark的开发者撰写完成,得到数据科学家和工程师的支持,《学习Spark(影印版 英文版)》中的内容能够随时运行。你将学习如何只通过几行代码执行并行任务,并覆盖了从简单批量作业到流处理和机器学习等应用。

图书目录

Foreword

Preface

1. Introduction to Data Analysis with Spark

What Is Apache Spark?

A Unified Stack

Spark Core

Spark SQL

Spark Streaming

MLlib

GraphX

Cluster Managers

Who Uses Spark, and for What?

Data Science Tasks

Data Processing Applications

A Brief History of Spark

Spark Versions and Releases

Storage Layers for Spark

2. Downloading Spark and Getting Started

Downloading Spark

Introduction to Spark's Python and Scala Shells

Introduction to Core Spark Concepts

Standalone Applications

Initializing a SparkContext

Building Standalone Applications

Conclusion

3. Programming with RDDs

RDD Basics

Creating RDDs

RDD Operations

Transformations

Actions

Lazy Evaluation

Passing Functions to Spark

Python

Scala

Java

Common Transformations and Actions

Basic RDDs

Converting Between RDD Types

Persistence (Caching)

Conclusion

4. Working with Key/Value Pairs

Motivation

Creating Pair RDDs

Transformations on Pair RDDs

Aggregations

Grouping Data

Joins

Sorting Data

Actions Available on Pair RDDs

Data Partitioning (Advanced)

Determining an RDD's Partitioner

Operations That Benefit from Partitioning

Operations That Affect Partitioning

Example: PageRank

Custom Partitioners

Conclusion

5. Loading and Saving Your Data

Motivation

File Formats

Text Files

JSON

Comma-Separated Values and Tab-Separated Values

SequenceFiles

Object Files

Hadoop Input and Output Formats

File Compression

Filesystems

Local/“Regular” FS

Amazon $3

HDFS

Structured Data with Spark SQL

Apache Hive

JSON

Databases

Java Database Connectivity

Cassandra

HBase

Elasticsearch

Conclusion

6. Advanced Spark Programming

Introduction

Accumulators

Accumulators and Fault Tolerance

Custom Accumulators

Broadcast Variables

Optimizing Broadcasts

Working on a Per-Partition Basis

Piping to External Programs

Numeric RDD Operations

Conclusion

7. Running on a Cluster

Introduction

Spark Runtime Architecture

The Driver

Executors

Cluster Manager

Launching a Program

Summary

Deploying Applications with spark-submit

Packaging Your Code and Dependencies

A Java Spark Application Built with Maven

A Scala Spark Application Built with sbt

Dependency Conflicts

Scheduling Within and Between Spark Applications

Cluster Managers

Standalone Cluster Manager

Hadoop YARN

Apache Mesos

Amazon EC2

Which Cluster Manager to Use?

Conclusion

8. Tuning and Debugging Spark

Configuring Spark with SparkConf

Components of Execution: Jobs, Tasks, and Stages

Finding Information

Spark Web UI

Driver and Executor Logs

Key Performance Considerations

Level of Parallelism

Serialization Format

Memory Management

Hardware Provisioning

Conclusion

9. Spark SQL

Linking with Spark SQL

Using Spark SQL in Applications

Initializing Spark SQL

Basic Query Example

SchemaRDDs

Caching

Loading and Saving Data

Apache Hive

Parquet

JSON

From RDDs

JDBC/ODBC Server

Working with Beeline

Long-Lived Tables and Queries

User-Defined Functions

Spark SQL UDFs

Hive UDFs

Spark SQL Performance

Performance Tuning Options

Conclusion

10. Spark Streaming

A Simple Example

Architecture and Abstraction

Transformations

Stateless Transformations

Stateful Transformations

Output Operations

Input Sources

Core Sources

Additional Sources

Multiple Sources and Cluster Sizing

24/7 Operation

Checkpointing

Driver Fault Tolerance

Worker Fault Tolerance

Receiver Fault Tolerance

Processing Guarantees

Streaming UI

Performance Considerations

Batch and Window Sizes

Level of Parallelism

Garbage Collection and Memory Usage

Conclusion

11. Machine Learning with MLlib

Overview

System Requirements

Machine Learning Basics

Example: Spam Classification

Data Types

Working with Vectors

Algorithms

Feature Extraction

Statistics

Classification and Regression

Clustering

Collaborative Filtering and Recommendation

Dimensionality Reduction

Model Evaluation

Tips and Performance Considerations

Preparing Features

Configuring Algorithms

Caching RDDs to Reuse

Recognizing Sparsity

Level of Parallelism

Pipeline API

Conclusion

Index