How to restart pyspark streaming query from checkpoint data? - apache-spark

I am creating a spark streaming application using pyspark 2.2.0
I am able to create a streaming query
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StreamingApp") \
.getOrCreate()
staticDataFrame = spark.read.format("parquet")\
.option("inferSchema","true").load("processed/Nov18/")
staticSchema = staticDataFrame.schema
streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger",1)\
.format("parquet")\
.load("processed/Nov18/")
daily_trs=streamingDataFrame.select("shift","date","time")
.groupBy("date","shift")\
.count("shift")
writer = df.writeStream\
.format("parquet")\
.option("path","data")\
.option("checkpointLocation","data/checkpoints")\
.queryName("streamingData")\
.outputMode("append")
query = writer.start()
query.awaitTermination()
The query is streaming and any additional file to "processed/Nov18" will be processed and stored to "data/"
If the streaming fails I want to restart the same query
Path to solution
According to official documentation I can get an id that can be used to restart the query
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery.id
The pyspark.streaming module contains StreamingContext class that has classmethod
classmethod getActiveOrCreate(checkpointPath, setupFunc)
https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.getOrCreate
can these methods be used somehow?
If anyone has any use case of production ready streaming app for reference ?

You should simply (re)start the pyspark application with the checkpoint directory available and Spark Structured Streaming does the rest. No changes required.
If anyone has any use case of production ready streaming app for reference ?
I'd ask on the Spark users mailing list.

Related

Why SparkSession class of pyspark is in pyspark.sql not pyspark?

As the official documents of Spark Starting Point: SparkSession puts"The entry point into all functionality in Spark is the SparkSession class."
So, I'm wodering why in pyspark the SparkSession is imported from pyspark.sql not pyspark itself. My logic is since SparkSession is the entry point of all fuctionality in Spark (SparkSql, SparkStreaming, SparkMLib, SparkGraphX, etc.), doesn't it make more sense to import SparkSession from spark but not spark.sql?
Primarily because pyspark is used for Spark Core - RDD-based APIs that were existing in Spark from the beginning, and SparkSession (originally as SQLContext) was added as a part of Spark SQL (original announcement).

Why do I see two jobs in Spark UI for a single read?

I am trying to run the below script to load file with 24k records. Is there any reason why I am seeing two jobs for single load in Spark UI.
code
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("DM")\
.getOrCreate()
trades_df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("s3://bucket/source.csv")
trades_df.rdd.numPartitions() is 1
Spark UI Image
That's because spark reads the csv file two times since you enabled inferSchema.
Read the comments for the function def csv(csvDataset: Dataset[String]): DataFrame on spark's github repo here.

How to execute SQL scripts with Spark

I want to create a database in Spark, and for this purpose, I have written a few SQL scripts which create the SQL tables.
My question is, how to integrate the SQL tables (the database) into Spark for later processing?
Could that be done using a Scala script or through the Spark console?
Thank you.
Using Scala :
import scala.io.Source
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("execute-query-files")
.master("local[*]") //since the jar will be executed locally
.getOrCreate()
val sqlQuery = Source.fromFile("path/to/data.sql").mkString //read file
spark.sql(sqlQuery) //execute query
Where spark is your spark session, already created.

What to set Spark Master address to when deploying on Kubernetes Spark Operator?

The official spark documentation only has information on the spark-submit method for deploying code to a spark cluster. It mentions we must prefix the address from kubernetes api server with k8s://. What should we do when deploying through Spark Operator?
For instance if I have a basic pyspark application that starts up like this how do I set the master:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')
Here I have local, where if I was running on a non-k8's cluster I would mention the master address with spark:// prefix or yarn. Must I also use the k8s:// prefix if deploying through the Spark Operator?
If not what should be used for master parameter?
It's better not to use setMaster in the code, but instead specify it when running the code via spark-submit, something like this (see documentation for details):
./bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
your_script.py
I haven't used Spark operator, but it should set master automatically, as I understand from the documentation.
you also need to get convert this code:
sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')
to more modern (see doc):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
as SQLContext is deprecated.
P.S. I recommend to get through first chapters of Learning Spark, 2ed that is freely available from the Databricks site.

Why does spark-submit fail with "AnalysisException: kafka is not a valid Spark SQL Data Source"?

I use Spark 2.1.0 with Kafka 0.10.2.1.
I write a Spark application that reads datasets from a Kafka topic.
The code is as follows:
package com.example;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class MLP {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("MLP")
.getOrCreate();
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092,localhost:9093")
.option("subscribe", "resultsTopic")
.load();
df.show();
spark.stop();
}
}
My deployment script is as follows:
spark-submit \
--verbose \
--jars${echo /home/hduser1/spark/jars/*.jar | tr ' ' ',') \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.10 \
--class com.**** \
--master (Spark Master URL) /path/to/jar
However I get the error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
kafka is not a valid Spark SQL Data Source.;
I've tried using the same application with a non-Jafka data source and the dataframe is correctly created. I've also tried using yarn in client mode and I get the same error.
Kafka as a Data Source for non-stream DataFrame - Datasets will be available from Spark 2.2, reference in this issue on Spark JIRA
As #JacekLaskowski mentioned, change package to (modified Jacek's version to use 2.2):
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
What's more, use readStream to read a stream of data.
You cannot use show with streaming data sources, instead use console format.
StreamingQuery query = df.writeStream()
.outputMode("append")
.format("console")
.start();
query.awaitTermination();
See this link
First of all, you should replace --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.10 (which I doubt works) with the following:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1
I don't think the version 2.10 was ever available. You may have thought about 2.1.0 that could have worked if you'd used 2.1.0 (not 2.10).
Secondly, remove --jars${echo /home/hduser1/spark/jars/*.jar | tr ' ' ',') which Spark loads anyway except some additional jars like the one for Kafka source.
That should give you access to kafka source format.

Resources