closing pydeequ callback server - apache-spark

I'm using pydeequ with Spark 3.0.1 to perform some constraint checks on data.
As for testing with the VerificationSuite, after calling VerificationResult.checkResultsAsDataFrame(spark, result), it seems that the callback server which gets started by pydeequ does not get terminated automatically.
If I run code containing pydeequ on an EMR cluster for example, the port 25334 seems to stay open after the spark application closes, unless I explicitly create a JavaGateway with the spark session, and call the close() method.
from pydeequ.verification import *
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=None, c=None)]).toDF()
from py4j.java_gateway import JavaGateway
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasSize(lambda x: x < 3) \
.hasMin("b", lambda x: x == 0) \
.isComplete("c") \
.isUnique("a") \
.isContainedIn("a", ["foo", "bar", "baz"]) \
.isNonNegative("b")) \
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show(truncate=False)
a = JavaGateway(spark.sparkContext._gateway)
a.close()
If I don't implement the last 2 lines of code, the callback server stays open on the port.
Is there a way around this?

PyDeequ github says to use these to shutdown Spark session:
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()

Related

Unable to create Scheduler Pools in EMR using PySpark

I am fairly new to the concept of Spark Schedulers/Pooling and need to implement the same in one of my Projects. Just in order to understand the concept better, I scribbled the following streaming PySpark Code on my local and executed :
from pyspark.sql import SparkSession
import threading
def do_job(f1, f2):
df1 = spark.readStream.json(f1)
df2 = spark.readStream.json(f2)
df = df1.join(df2, "id", "inner")
df.writeStream.format("parquet").outputMode("append") \
.option("checkpointLocation", "checkpoint" + str(f1) + "/") \
.option("path", "Data/Sample_Delta_Data/date=A" + str(f1)) \
.start()
# outputs.append(df1.join(df2, "id", "inner").count())
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Demo") \
.master("local[4]") \
.config("spark.sql.autoBroadcastJoinThreshold", "50B") \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.sql.streaming.schemaInference", "true") \
.getOrCreate()
file_prefix = "data_new/data/d"
jobs = []
outputs = []
for i in range(0, 6):
file1 = file_prefix + str(i + 1)
file2 = file_prefix + str(i + 2)
thread = threading.Thread(target=do_job, args=(file1, file2))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
spark.streams.awaitAnyTermination()
# print(outputs)
As could be seen above, I am using FAIR Scheduler option and using 'Threading Library' in PySpark to implement Pooling.
As the matter fact, the above code is creating pools on my Local System but when I run the same on AWS EMR cluster, no Pools are getting created.
Am I missing something specific to AWS EMR ?
Suggestions please!
Regards.
Why are you using threading in pyspark? It handles executors cores --> [spark threading] for you. I understand you are new to spark but clearly not new to python. It's possible I missed a subtly here as streaming isn't my wheelhouse as much as spark is my wheelhouse.
The above code will launch all the work on the driver, and I think you really should read over the spark documentation to better understand how it handles parallelism. (You want to do the work on executors to really get the power you are looking for.)
With respect, this is how you do thing in python not pyspark/spark code. You likely are seeing the difference between client vs cluster code, and that could account for the difference. (It's a typical issues that occurs when coding locally vs in the cluster.)

Where should I put my credential data streaming with Kafka in databricks?

I have some values in Azure Key Vault (AKV)
A simple initial googling was giving me
username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")
from kafka import KafkaConsumer
consumer = KafkaConsumer('TOPIC',
bootstrap_servers = 'SERVER:PORT',
enable_auto_commit = False,
auto_offset_reset = 'earliest',
consumer_timeout_ms = 2000,
security_protocol = 'SASL_SSL',
sasl_mechanism = 'PLAIN',
sasl_plain_username = username,
sasl_plain_password = pwd)
This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes)
So it doesn't solve my problem
My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)
from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *
schema = StructType() \
.add("EventHeader", StructType() \
.add("UUID", StringType()) \
.add("APPLICATION_ID", StringType())
.add("FORMAT", StringType())) \
.add("EmissionReportMessage", StructType() \
.add("reportId", StringType()) \
.add("startDate", StringType()) \
.add("endDate", StringType()) \
.add("unitOfMeasure", StringType()) \
.add("reportLanguage", StringType()) \
.add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))
parsed_kafka = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "SERVER:PORT") \
.option("subscribe", "TOPIC") \
.option("startingOffsets", "earliest") \
.load()\
.select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))
There are some issues
Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result
however, after a single run it is finished, and it is not listening to Kafka messages anymore
Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...
Where should I put my GenID or user/pass info
You would add SASL/SSL properties into option() parameters.
Ex. For SASL_PLAIN
option("kafka.sasl.jaas.config",
'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))
See related question
it will never stop
Because you run a streaming query starting with readStream rather than a batched read.
it will never show the result
You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read

Error while using Crealytics package to read Excel file

I'm trying to read an Excel file from HDFS location using Crealytics package and keep getting an error (Caused by: java.lang.ClassNotFoundException:org.apache.spark.sql.connector.catalog.TableProvider). My code is below. Any tips? When running the below code, the spark session initiates fine and the Crealytics package loads without error. The error only comes when running the "spark.read" code. The file location I'm using is accurate.
def spark_session(spark_conf):
conf = SparkConf()
for (key, val) in spark_conf.items():
conf.set(key, val)
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config(conf=conf) \
.getOrCreate()
return spark
spark_conf = {"spark.executor.memory": "16g",
"spark.yarn.executor.memoryOverhead": "3g",
"spark.dynamicAllocation.initialExecutors": 2,
"spark.driver.memory": "16g",
"spark.kryoserializer.buffer.max": "1g",
"spark.driver.cores": 32,
"spark.executor.cores": 8,
"spark.yarn.queue": "adhoc",
"spark.app.name": "CDSW_basic",
"spark.dynamicAllocation.maxExecutors": 32,
"spark.jars.packages": "com.crealytics:spark-excel_2.12:0.14.0"
}
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.load("/user/data/Block_list.xlsx")
I've also tried loading it outside of the session function with the code below yielding the same error once I try to read the file.
crealytics_driver_loc = "com.crealytics:spark-excel_2.12:0.14.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + crealytics_driver_loc + ' pyspark-shell'
Looks like I'm answering my own question. After a great deal of fiddling around, I've found that using an old version of crealytics works with my setup, though I'm uncertain why. The package that worked was version 13 ("com.crealytics:spark-excel_2.12:0.13.0"), though the newest is version 15.

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?

Call class from external file in parallel pyspark

I'm trying to distribute data with IDs in cluster then call another class to do complex logic on these IDs in parallel in pySpark. I'm so confused of how to sort things out as the code below did not work
I have file myprocess.py contains
def class MyProcess():
def __init__(self, sqlcontext, x):
/* code here */
def complex_calculation(self):
/* lots of sql and statiscal steps */
Then I have my main wrapper control.py
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("complexlogic") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
sc.addPyFile(r"myprocess.py")
from myprocess import MyProcess
sqlContext = SQLContext(sc)
settings_bc = sc.broadcast({
'mysqlContext': sqlContext
})
/**
some code to create df_param
**/
df = df_param.repartition("id")
print('number of partitions', df.rdd.getNumPartitions())
rdd__param = df.rdd.map(lambda x: MyProcess(settings_bc.value, x).complex_calculation()).collect()
The error I get
_pickle.PicklingError: Could not serialize broadcast: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I understand that error is probably regarding passing sqlContext, but I think my issue is larger than that error, it is what is the right way to do what I'm trying to achieve (Edit: I'm trying to use the ID to filter 17 hive tables with that id and use these 17 tables to do complex math. If I move outside map how I will achieve parallelism) . Any help is greatly appreciated.

Resources