Download Spark: The Definitive Guide: Big Data Processing Made Simple PDF

TitleSpark: The Definitive Guide: Big Data Processing Made Simple
File Size7.9 MB
Total Pages601
Table of Contents
                            Preface
	About the Authors
	Who This Book Is For
	Conventions Used in This Book
	Using Code Examples
	O’Reilly Safari
	How to Contact Us
	Acknowledgments
I. Gentle Overview of Big Data and Spark
	1. What Is Apache Spark?
		Apache Spark’s Philosophy
		Context: The Big Data Problem
		History of Spark
		The Present and Future of Spark
		Running Spark
			Downloading Spark Locally
			Launching Spark’s Interactive Consoles
			Running Spark in the Cloud
			Data Used in This Book
	2. A Gentle Introduction to Spark
		Spark’s Basic Architecture
			Spark Applications
		Spark’s Language APIs
		Spark’s APIs
		Starting Spark
		The SparkSession
		DataFrames
			Partitions
		Transformations
			Lazy Evaluation
		Actions
		Spark UI
		An End-to-End Example
			DataFrames and SQL
		Conclusion
	3. A Tour of Spark’s Toolset
		Running Production Applications
		Datasets: Type-Safe Structured APIs
		Structured Streaming
		Machine Learning and Advanced Analytics
		Lower-Level APIs
		SparkR
		Spark’s Ecosystem and Packages
		Conclusion
II. Structured APIs—DataFrames, SQL, and Datasets
	4. Structured API Overview
		DataFrames and Datasets
		Schemas
		Overview of Structured Spark Types
			DataFrames Versus Datasets
			Columns
			Rows
			Spark Types
		Overview of Structured API Execution
			Logical Planning
			Physical Planning
			Execution
		Conclusion
	5. Basic Structured Operations
		Schemas
		Columns and Expressions
			Columns
			Expressions
		Records and Rows
			Creating Rows
		DataFrame Transformations
			Creating DataFrames
			select and selectExpr
			Converting to Spark Types (Literals)
			Adding Columns
			Renaming Columns
			Reserved Characters and Keywords
			Case Sensitivity
			Removing Columns
			Changing a Column’s Type (cast)
			Filtering Rows
			Getting Unique Rows
			Random Samples
			Random Splits
			Concatenating and Appending Rows (Union)
			Sorting Rows
			Limit
			Repartition and Coalesce
			Collecting Rows to the Driver
		Conclusion
	6. Working with Different Types of Data
		Where to Look for APIs
		Converting to Spark Types
		Working with Booleans
		Working with Numbers
		Working with Strings
			Regular Expressions
		Working with Dates and Timestamps
		Working with Nulls in Data
			Coalesce
			ifnull, nullIf, nvl, and nvl2
			drop
			fill
			replace
		Ordering
		Working with Complex Types
			Structs
			Arrays
			split
			Array Length
			array_contains
			explode
			Maps
		Working with JSON
		User-Defined Functions
		Conclusion
	7. Aggregations
		Aggregation Functions
			count
			countDistinct
			approx_count_distinct
			first and last
			min and max
			sum
			sumDistinct
			avg
			Variance and Standard Deviation
			skewness and kurtosis
			Covariance and Correlation
			Aggregating to Complex Types
		Grouping
			Grouping with Expressions
			Grouping with Maps
		Window Functions
		Grouping Sets
			Rollups
			Cube
			Grouping Metadata
			Pivot
		User-Defined Aggregation Functions
		Conclusion
	8. Joins
		Join Expressions
		Join Types
		Inner Joins
		Outer Joins
		Left Outer Joins
		Right Outer Joins
		Left Semi Joins
		Left Anti Joins
		Natural Joins
		Cross (Cartesian) Joins
		Challenges When Using Joins
			Joins on Complex Types
			Handling Duplicate Column Names
		How Spark Performs Joins
			Communication Strategies
		Conclusion
	9. Data Sources
		The Structure of the Data Sources API
			Read API Structure
			Basics of Reading Data
			Write API Structure
			Basics of Writing Data
		CSV Files
			CSV Options
			Reading CSV Files
			Writing CSV Files
		JSON Files
			JSON Options
			Reading JSON Files
			Writing JSON Files
		Parquet Files
			Reading Parquet Files
			Writing Parquet Files
		ORC Files
			Reading Orc Files
			Writing Orc Files
		SQL Databases
			Reading from SQL Databases
			Query Pushdown
			Writing to SQL Databases
		Text Files
			Reading Text Files
			Writing Text Files
		Advanced I/O Concepts
			Splittable File Types and Compression
			Reading Data in Parallel
			Writing Data in Parallel
			Writing Complex Types
			Managing File Size
		Conclusion
	10. Spark SQL
		What Is SQL?
		Big Data and SQL: Apache Hive
		Big Data and SQL: Spark SQL
			Spark’s Relationship to Hive
		How to Run Spark SQL Queries
			Spark SQL CLI
			Spark’s Programmatic SQL Interface
			SparkSQL Thrift JDBC/ODBC Server
		Catalog
		Tables
			Spark-Managed Tables
			Creating Tables
			Creating External Tables
			Inserting into Tables
			Describing Table Metadata
			Refreshing Table Metadata
			Dropping Tables
			Caching Tables
		Views
			Creating Views
			Dropping Views
		Databases
			Creating Databases
			Setting the Database
			Dropping Databases
		Select Statements
			case…when…then Statements
		Advanced Topics
			Complex Types
			Functions
			Subqueries
		Miscellaneous Features
			Configurations
			Setting Configuration Values in SQL
		Conclusion
	11. Datasets
		When to Use Datasets
		Creating Datasets
			In Java: Encoders
			In Scala: Case Classes
		Actions
		Transformations
			Filtering
			Mapping
		Joins
		Grouping and Aggregations
		Conclusion
III. Low-Level APIs
	12. Resilient Distributed Datasets (RDDs)
		What Are the Low-Level APIs?
			When to Use the Low-Level APIs?
			How to Use the Low-Level APIs?
		About RDDs
			Types of RDDs
			When to Use RDDs?
			Datasets and RDDs of Case Classes
		Creating RDDs
			Interoperating Between DataFrames, Datasets, and RDDs
			From a Local Collection
			From Data Sources
		Manipulating RDDs
		Transformations
			distinct
			filter
			map
			sort
			Random Splits
		Actions
			reduce
			count
			first
			max and min
			take
		Saving Files
			saveAsTextFile
			SequenceFiles
			Hadoop Files
		Caching
		Checkpointing
		Pipe RDDs to System Commands
			mapPartitions
			foreachPartition
			glom
		Conclusion
	13. Advanced RDDs
		Key-Value Basics (Key-Value RDDs)
			keyBy
			Mapping over Values
			Extracting Keys and Values
			lookup
			sampleByKey
		Aggregations
			countByKey
			Understanding Aggregation Implementations
			Other Aggregation Methods
		CoGroups
		Joins
			Inner Join
			zips
		Controlling Partitions
			coalesce
			repartition
			repartitionAndSortWithinPartitions
			Custom Partitioning
		Custom Serialization
		Conclusion
	14. Distributed Shared Variables
		Broadcast Variables
		Accumulators
			Basic Example
			Custom Accumulators
		Conclusion
IV. Production Applications
	15. How Spark Runs on a Cluster
		The Architecture of a Spark Application
			Execution Modes
		The Life Cycle of a Spark Application (Outside Spark)
			Client Request
			Launch
			Execution
			Completion
		The Life Cycle of a Spark Application (Inside Spark)
			The SparkSession
			Logical Instructions
			A Spark Job
			Stages
			Tasks
		Execution Details
			Pipelining
			Shuffle Persistence
		Conclusion
	16. Developing Spark Applications
		Writing Spark Applications
			A Simple Scala-Based App
			Writing Python Applications
			Writing Java Applications
		Testing Spark Applications
			Strategic Principles
			Tactical Takeaways
			Connecting to Unit Testing Frameworks
			Connecting to Data Sources
		The Development Process
		Launching Applications
			Application Launch Examples
		Configuring Applications
			The SparkConf
			Application Properties
			Runtime Properties
			Execution Properties
			Configuring Memory Management
			Configuring Shuffle Behavior
			Environmental Variables
			Job Scheduling Within an Application
		Conclusion
	17. Deploying Spark
		Where to Deploy Your Cluster to Run Spark Applications
			On-Premises Cluster Deployments
			Spark in the Cloud
		Cluster Managers
			Standalone Mode
			Spark on YARN
			Configuring Spark on YARN Applications
			Spark on Mesos
			Secure Deployment Configurations
			Cluster Networking Configurations
			Application Scheduling
		Miscellaneous Considerations
		Conclusion
	18. Monitoring and Debugging
		The Monitoring Landscape
		What to Monitor
			Driver and Executor Processes
			Queries, Jobs, Stages, and Tasks
		Spark Logs
		The Spark UI
			Spark REST API
			Spark UI History Server
		Debugging and Spark First Aid
			Spark Jobs Not Starting
			Errors Before Execution
			Errors During Execution
			Slow Tasks or Stragglers
			Slow Aggregations
			Slow Joins
			Slow Reads and Writes
			Driver OutOfMemoryError or Driver Unresponsive
			Executor OutOfMemoryError or Executor Unresponsive
			Unexpected Nulls in Results
			No Space Left on Disk Errors
			Serialization Errors
		Conclusion
	19. Performance Tuning
		Indirect Performance Enhancements
			Design Choices
			Object Serialization in RDDs
			Cluster Configurations
			Scheduling
			Data at Rest
			Shuffle Configurations
			Memory Pressure and Garbage Collection
		Direct Performance Enhancements
			Parallelism
			Improved Filtering
			Repartitioning and Coalescing
			User-Defined Functions (UDFs)
			Temporary Data Storage (Caching)
			Joins
			Aggregations
			Broadcast Variables
		Conclusion
V. Streaming
	20. Stream Processing Fundamentals
		What Is Stream Processing?
			Stream Processing Use Cases
			Advantages of Stream Processing
			Challenges of Stream Processing
		Stream Processing Design Points
			Record-at-a-Time Versus Declarative APIs
			Event Time Versus Processing Time
			Continuous Versus Micro-Batch Execution
		Spark’s Streaming APIs
			The DStream API
			Structured Streaming
		Conclusion
	21. Structured Streaming Basics
		Structured Streaming Basics
		Core Concepts
			Transformations and Actions
			Input Sources
			Sinks
			Output Modes
			Triggers
			Event-Time Processing
		Structured Streaming in Action
		Transformations on Streams
			Selections and Filtering
			Aggregations
			Joins
		Input and Output
			Where Data Is Read and Written (Sources and Sinks)
			Reading from the Kafka Source
			Writing to the Kafka Sink
			How Data Is Output (Output Modes)
			When Data Is Output (Triggers)
		Streaming Dataset API
		Conclusion
	22. Event-Time and Stateful Processing
		Event Time
		Stateful Processing
		Arbitrary Stateful Processing
		Event-Time Basics
		Windows on Event Time
			Tumbling Windows
			Handling Late Data with Watermarks
		Dropping Duplicates in a Stream
		Arbitrary Stateful Processing
			Time-Outs
			Output Modes
			mapGroupsWithState
			flatMapGroupsWithState
		Conclusion
	23. Structured Streaming in Production
		Fault Tolerance and Checkpointing
		Updating Your Application
			Updating Your Streaming Application Code
			Updating Your Spark Version
			Sizing and Rescaling Your Application
		Metrics and Monitoring
			Query Status
			Recent Progress
			Spark UI
		Alerting
		Advanced Monitoring with the Streaming Listener
		Conclusion
VI. Advanced Analytics and Machine Learning
	24. Advanced Analytics and Machine Learning Overview
		A Short Primer on Advanced Analytics
			Supervised Learning
			Recommendation
			Unsupervised Learning
			Graph Analytics
			The Advanced Analytics Process
		Spark’s Advanced Analytics Toolkit
			What Is MLlib?
		High-Level MLlib Concepts
		MLlib in Action
			Feature Engineering with Transformers
			Estimators
			Pipelining Our Workflow
			Training and Evaluation
			Persisting and Applying Models
		Deployment Patterns
		Conclusion
	25. Preprocessing and Feature Engineering
		Formatting Models According to Your Use Case
		Transformers
		Estimators for Preprocessing
			Transformer Properties
		High-Level Transformers
			RFormula
			SQL Transformers
			VectorAssembler
		Working with Continuous Features
			Bucketing
			Scaling and Normalization
			StandardScaler
		Working with Categorical Features
			StringIndexer
			Converting Indexed Values Back to Text
			Indexing in Vectors
			One-Hot Encoding
		Text Data Transformers
			Tokenizing Text
			Removing Common Words
			Creating Word Combinations
			Converting Words into Numerical Representations
			Word2Vec
		Feature Manipulation
			PCA
			Interaction
			Polynomial Expansion
		Feature Selection
			ChiSqSelector
		Advanced Topics
			Persisting Transformers
		Writing a Custom Transformer
		Conclusion
	26. Classification
		Use Cases
		Types of Classification
			Binary Classification
			Multiclass Classification
			Multilabel Classification
		Classification Models in MLlib
			Model Scalability
		Logistic Regression
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
			Example
			Model Summary
		Decision Trees
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
		Random Forest and Gradient-Boosted Trees
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
		Naive Bayes
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
		Evaluators for Classification and Automating Model Tuning
		Detailed Evaluation Metrics
		One-vs-Rest Classifier
		Multilayer Perceptron
		Conclusion
	27. Regression
		Use Cases
		Regression Models in MLlib
			Model Scalability
		Linear Regression
			Model Hyperparameters
			Training Parameters
			Example
			Training Summary
		Generalized Linear Regression
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
			Example
			Training Summary
		Decision Trees
			Model Hyperparameters
			Training Parameters
			Example
		Random Forests and Gradient-Boosted Trees
			Model Hyperparameters
			Training Parameters
			Example
		Advanced Methods
			Survival Regression (Accelerated Failure Time)
			Isotonic Regression
		Evaluators and Automating Model Tuning
		Metrics
		Conclusion
	28. Recommendation
		Use Cases
		Collaborative Filtering with Alternating Least Squares
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
			Example
		Evaluators for Recommendation
		Metrics
			Regression Metrics
			Ranking Metrics
		Frequent Pattern Mining
		Conclusion
	29. Unsupervised Learning
		Use Cases
		Model Scalability
		k-means
			Model Hyperparameters
			Training Parameters
			Example
			k-means Metrics Summary
		Bisecting k-means
			Model Hyperparameters
			Training Parameters
			Example
			Bisecting k-means Summary
		Gaussian Mixture Models
			Model Hyperparameters
			Training Parameters
			Example
			Gaussian Mixture Model Summary
		Latent Dirichlet Allocation
			Model Hyperparameters
			Training Parameters
			Prediction Parameters
			Example
		Conclusion
	30. Graph Analytics
		Building a Graph
		Querying the Graph
			Subgraphs
		Motif Finding
		Graph Algorithms
			PageRank
			In-Degree and Out-Degree Metrics
			Breadth-First Search
			Connected Components
			Strongly Connected Components
			Advanced Tasks
		Conclusion
	31. Deep Learning
		What Is Deep Learning?
		Ways of Using Deep Learning in Spark
		Deep Learning Libraries
			MLlib Neural Network Support
			TensorFrames
			BigDL
			TensorFlowOnSpark
			DeepLearning4J
			Deep Learning Pipelines
		A Simple Example with Deep Learning Pipelines
			Setup
			Images and DataFrames
			Transfer Learning
			Applying Popular Models
		Conclusion
VII. Ecosystem
	32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
		PySpark
			Fundamental PySpark Differences
			Pandas Integration
		R on Spark
			SparkR
			sparklyr
		Conclusion
	33. Ecosystem and Community
		Spark Packages
			An Abridged List of Popular Packages
			Using Spark Packages
			External Packages
		Community
			Spark Summit
			Local Meetups
		Conclusion
Index
                        

Similer Documents