Is it possible to limit resources assigned to a Spark session? - apache-spark

I'm launching pySpark sessions with the following code:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import *
spark = SparkSession.builder.getOrCreate()
I've noticed that if a notebook is running a pySpark query, and a second notebook tries to start a Spark session, the second Spark session will not start until the first one has finished (i.e. the first session is taking all the resources).
Is there some way to limit the resources of a Spark session or parallelize multiple sessions somehow?

Related

Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’

The problem is that the pyspark script runs fine in one cluster, the error occurs when I run the same pyspark script in another yarn cluster. I guess the spark environment configurations are different between two cluster.
Here is the code to initialize the spark session.
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
hive_context = HiveContext(sc)
error
Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’

PySpark session builder doesn't start

I have a problem regarding PySpark in Jupyter notebook. I installed Java, and Spark added the path variables and didn't get an error. However, when I write builder it keeps running and doesn't start. I waited more than 30 minutes to start but it just kept running. Code like below:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job take around 30 minutes to complete. Is there a way to run these in parallel under the same spark/glue context? I don't want to create separate glue jobs if I can avoid it.
import datetime
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import *
# query the runtime arguments
args = getResolvedOptions(
sys.argv,
["JOB_NAME", "redshift_catalog_connection", "target_database", "target_schema"],
)
# build the job session and context
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# set the job execution timestamp
job_execution_timestamp = datetime.datetime.utcnow()
tables = []
for table in tables:
catalog_table = glueContext.create_dynamic_frame.from_catalog(
database="test", table_name=table, transformation_ctx=table
)
data_set = catalog_table.toDF().withColumn(
"batchLoadTimestamp", lit(job_execution_timestamp)
)
# covert back to glue dynamic frame
export_frame = DynamicFrame.fromDF(data_set, glueContext, "export_frame")
# remove null rows from dynamic frame
non_null_records = DropNullFields.apply(
frame=export_frame, transformation_ctx="non_null_records"
)
temp_dir = os.path.join(args["TempDir"], redshift_table_name)
stores_redshiftSink = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=non_null_records,
catalog_connection=args["redshift_catalog_connection"],
connection_options={
"dbtable": f"{args['target_schema']}.{redshift_table_name}",
"database": args["target_database"],
"preactions": f"truncate table {args['target_schema']}.{redshift_table_name};",
},
redshift_tmp_dir=temp_dir,
transformation_ctx="stores_redshiftSink",
) ```
You can do the following things to make this process faster
Enable concurrent execution of job.
Allot sufficient number of DPU.
Pass the list of tables as a parameter
Execute the job in parallel using Glue workflows or step functions.
Now suppose you have 100 table's to ingest, you can divide the list in 10 table's each and run the job concurrently 10 times.
Since your data will be loaded parallely so time of Glue job run will be decreased hence less cost will be incurred.
Alternate approach that will be way faster is to use redshift utility direct.
Create table in redshift and keep the batchLoadTimestamp column as default to current_timestamp.
Now create the copy command and load data into the table directly from s3.
Run the copy command using Glue python shell job leveraging pg8000.
Why this approach will be faster??
Because the spark redshift jdbc connector first unloads the spark dataframe to s3 then prepares a copy command to the redshift table. And while running copy command directly you are removing the overhead of running unload command and also reading data into spark df.

How to restart pyspark streaming query from checkpoint data?

I am creating a spark streaming application using pyspark 2.2.0
I am able to create a streaming query
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StreamingApp") \
.getOrCreate()
staticDataFrame = spark.read.format("parquet")\
.option("inferSchema","true").load("processed/Nov18/")
staticSchema = staticDataFrame.schema
streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger",1)\
.format("parquet")\
.load("processed/Nov18/")
daily_trs=streamingDataFrame.select("shift","date","time")
.groupBy("date","shift")\
.count("shift")
writer = df.writeStream\
.format("parquet")\
.option("path","data")\
.option("checkpointLocation","data/checkpoints")\
.queryName("streamingData")\
.outputMode("append")
query = writer.start()
query.awaitTermination()
The query is streaming and any additional file to "processed/Nov18" will be processed and stored to "data/"
If the streaming fails I want to restart the same query
Path to solution
According to official documentation I can get an id that can be used to restart the query
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery.id
The pyspark.streaming module contains StreamingContext class that has classmethod
classmethod getActiveOrCreate(checkpointPath, setupFunc)
https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.getOrCreate
can these methods be used somehow?
If anyone has any use case of production ready streaming app for reference ?
You should simply (re)start the pyspark application with the checkpoint directory available and Spark Structured Streaming does the rest. No changes required.
If anyone has any use case of production ready streaming app for reference ?
I'd ask on the Spark users mailing list.

How to enable streaming from Cassandra to Spark?

I have the following spark job:
from __future__ import print_function
import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("PySpark Cassandra Test")
sc = CassandraSparkContext(conf=conf)
stream = StreamingContext(sc, 2)
rdd=sc.cassandraTable("keyspace2","users").collect()
#print rdd
stream.start()
stream.awaitTermination()
sc.stop()
When I run this, it gives me the following error:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute
the shell script I run:
./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py
Comparing spark streaming with kafka, I have this line missing from the above code:
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})
where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.
Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.
See also: Reading from Cassandra using Spark Streaming for an example.

Resources