Unable to create Scheduler Pools in EMR using PySpark - apache-spark

I am fairly new to the concept of Spark Schedulers/Pooling and need to implement the same in one of my Projects. Just in order to understand the concept better, I scribbled the following streaming PySpark Code on my local and executed :
from pyspark.sql import SparkSession
import threading
def do_job(f1, f2):
df1 = spark.readStream.json(f1)
df2 = spark.readStream.json(f2)
df = df1.join(df2, "id", "inner")
df.writeStream.format("parquet").outputMode("append") \
.option("checkpointLocation", "checkpoint" + str(f1) + "/") \
.option("path", "Data/Sample_Delta_Data/date=A" + str(f1)) \
.start()
# outputs.append(df1.join(df2, "id", "inner").count())
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Demo") \
.master("local[4]") \
.config("spark.sql.autoBroadcastJoinThreshold", "50B") \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.sql.streaming.schemaInference", "true") \
.getOrCreate()
file_prefix = "data_new/data/d"
jobs = []
outputs = []
for i in range(0, 6):
file1 = file_prefix + str(i + 1)
file2 = file_prefix + str(i + 2)
thread = threading.Thread(target=do_job, args=(file1, file2))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
spark.streams.awaitAnyTermination()
# print(outputs)
As could be seen above, I am using FAIR Scheduler option and using 'Threading Library' in PySpark to implement Pooling.
As the matter fact, the above code is creating pools on my Local System but when I run the same on AWS EMR cluster, no Pools are getting created.
Am I missing something specific to AWS EMR ?
Suggestions please!
Regards.

Why are you using threading in pyspark? It handles executors cores --> [spark threading] for you. I understand you are new to spark but clearly not new to python. It's possible I missed a subtly here as streaming isn't my wheelhouse as much as spark is my wheelhouse.
The above code will launch all the work on the driver, and I think you really should read over the spark documentation to better understand how it handles parallelism. (You want to do the work on executors to really get the power you are looking for.)
With respect, this is how you do thing in python not pyspark/spark code. You likely are seeing the difference between client vs cluster code, and that could account for the difference. (It's a typical issues that occurs when coding locally vs in the cluster.)

Related

closing pydeequ callback server

I'm using pydeequ with Spark 3.0.1 to perform some constraint checks on data.
As for testing with the VerificationSuite, after calling VerificationResult.checkResultsAsDataFrame(spark, result), it seems that the callback server which gets started by pydeequ does not get terminated automatically.
If I run code containing pydeequ on an EMR cluster for example, the port 25334 seems to stay open after the spark application closes, unless I explicitly create a JavaGateway with the spark session, and call the close() method.
from pydeequ.verification import *
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=None, c=None)]).toDF()
from py4j.java_gateway import JavaGateway
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasSize(lambda x: x < 3) \
.hasMin("b", lambda x: x == 0) \
.isComplete("c") \
.isUnique("a") \
.isContainedIn("a", ["foo", "bar", "baz"]) \
.isNonNegative("b")) \
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)
a = JavaGateway(spark.sparkContext._gateway)
a.close()
If I don't implement the last 2 lines of code, the callback server stays open on the port.
Is there a way around this?
PyDeequ github says to use these to shutdown Spark session:
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

StructuredStreaming with ForEachWriter creating duplicates

Hi I'm trying to create a neo4j sink using pyspark and kafka, but for some reason this sink is creating duplicates in neo4j and I'm not sure why this is happening. I am expecting to get only one node, but it looks like it's creating 4. If someone has an idea, please let me know.
Kafka producer code:
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='10.0.0.38:9092')
message = {
'test_1': 'test_1',
'test_2': 'test_2'
}
producer.send('test_topic', json.dumps(message).encode('utf-8'))
producer.close()
Kafka consumer code:
from kafka import KafkaConsumer
import findspark
from py2neo import Graph
import json
findspark.init()
from pyspark.sql import SparkSession
class ForeachWriter:
def open(self, partition_id, epoch_id):
neo4j_uri = '' # neo4j uri
neo4j_auth = ('', '') # neo4j user, password
self.graph = Graph(neo4j_uri, auth=neo4j_auth)
return True
def process(self, msg):
msg = json.loads(msg.value.decode('utf-8'))
self.graph.run("CREATE (n: MESSAGE_RECEIVED) SET n.key = '" + str(msg).replace("'", '"') + "'")
raise KeyError('received message: {}. finished creating node'.format(msg))
spark = SparkSession.builder.appName('test-consumer') \
.config('spark.executor.instances', 1) \
.getOrCreate()
ds1 = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', '10.0.0.38:9092') \
.option('subscribe', 'test_topic') \
.load()
query = ds1.writeStream.foreach(ForeachWriter()).start()
query.awaitTermination()
neo4j graph after running code
After doing some searching, I found this snippet of text from Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming on chapter 11 p151 after describing open, process, and close for ForeachWriter:
This contract is part of the data delivery semantics because it allows us to remove duplicated partitions that might already have been sent to the sink but are reprocessed by Structured Streaming as part of a recovery scenario. For that mechanism to properly work, the sink must implement some persistent way to remember the partition/version combinations that it has already seen.
On another note from the spark website: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (see section on Foreach).
Note: Spark does not guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reasons, Spark optimization changes number of partitions, etc. See SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
It seems like I need to implement a check for uniqueness because Structured Streaming automatically reprocesses partitions in case of a fail if I am to use ForeachWriter, otherwise I have to switch to foreachBatch instead.

Use all workers PySpark YARN

How do I use all the workers in the cluster when I run PySpark in a notebook?
I'm running on Google Dataproc with YARN.
I use this configuration:
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([
('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11'),
('spark.executor.heartbeatInterval', "1000s"),
("spark.network.timeoutInterval", "1000s"),
("spark.network.timeout", "10000s"),
("spark.network.timeout", "1001s")
])
spark = SparkSession.builder \
.appName('testing bq v04') \
.config(conf=conf) \
.getOrCreate()
But it doesn't look like it is using all the available resources:
Here I provide some more context. The problem arises when I run label propagation algorithm with GraphFrame:
g_df = GraphFrame(vertices_df, edges_df)
result_iteration_2 = g_df.labelPropagation(maxIter=5)

Call class from external file in parallel pyspark

I'm trying to distribute data with IDs in cluster then call another class to do complex logic on these IDs in parallel in pySpark. I'm so confused of how to sort things out as the code below did not work
I have file myprocess.py contains
def class MyProcess():
def __init__(self, sqlcontext, x):
/* code here */
def complex_calculation(self):
/* lots of sql and statiscal steps */
Then I have my main wrapper control.py
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("complexlogic") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
sc.addPyFile(r"myprocess.py")
from myprocess import MyProcess
sqlContext = SQLContext(sc)
settings_bc = sc.broadcast({
'mysqlContext': sqlContext
})
/**
some code to create df_param
**/
df = df_param.repartition("id")
print('number of partitions', df.rdd.getNumPartitions())
rdd__param = df.rdd.map(lambda x: MyProcess(settings_bc.value, x).complex_calculation()).collect()
The error I get
_pickle.PicklingError: Could not serialize broadcast: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I understand that error is probably regarding passing sqlContext, but I think my issue is larger than that error, it is what is the right way to do what I'm trying to achieve (Edit: I'm trying to use the ID to filter 17 hive tables with that id and use these 17 tables to do complex math. If I move outside map how I will achieve parallelism) . Any help is greatly appreciated.

Resources