All the hype around Apache Spark over the last 2 years gives rise to a simple question: What is Spark, and why use it?
Spark is an open source, scalable, massively parallel, in-memory execution environment for running analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster.
Spark consists of a number of components:
- Spark Core: The foundation of Spark that provides distributed task dispatching, scheduling and basic I/O
- Spark Streaming: Analysis of real-time streaming data
- Spark Machine Learning Library (MLlib): A library of prebuilt analytics algorithms that can run in parallel across a Spark cluster on data loaded into memory
- Spark SQL + DataFrames: Spark SQL enables querying structured data from inside Java-, Python-, R- and Scala-based Spark analytics applications using either SQL or the DataFrames distributed data collection
- GraphX: A graph analysis engine and set of graph analytics algorithms running on Spark
- SparkR: The R programming language on Spark for executing custom analytics
Big data processing
Spark works to distribute data across a cluster, and process that data in parallel. It works in memory, making it much faster at processing data than MapReduce which shuffles files around on disk.
Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala.
These applications execute in parallel on partitioned, in-memory data in Spark. And they make use of prebuilt analytics algorithms in Spark to make predictions; identify patterns in data, such as in market basket analysis; and analyze networks—also known as graphs—to identify previous unknown relationships. You can also connect business intelligence (BI) tools to Spark to query in-memory data using SQL and have the query executed in parallel on in-memory data.
Spark can run on Apache Hadoop clusters, on its own cluster or on cloud-based platforms, and it can access diverse data sources such as data in Hadoop Distributed File System (HDFS) files, Apache Cassandra, Apache HBase or Amazon S3 cloud-based storage.
Scalable analytics applications can be built on Spark to analyze live streaming data or data stored in HDFS, relational databases, cloud-based storage and other NoSQL databases. Data from these sources can be partitioned and distributed across multiple machines and held in memory on each node in a Spark cluster. The distributed, partitioned, in-memory data is referred to as a Resilient Distributed Dataset (RDD).
A key Spark capability offers the opportunity to build in-memory analytics applications that combine different kinds of analytics to analyze data. For example, you can read log data into memory, apply a schema to the data to describe its structure, access it using SQL, analyze it with predictive analytics algorithms and write the predictive results back to disk. The results can be in a columnar file format for use and visualization by interactive query tools.
IBM made a strategic commitment to using Spark in 2015. http://www.ibmbigdatahub.com/blog/what-spark by Mike Ferguson