SparkSubmitOperator with Multiple .jars - apache-spark

I am trying to write a data pipeline that reads a .tsv file from Azure Blob Storage and write the data to a MySQL database. I have a sensor that looks for a file with a given prefix within my storage container and then a SparkSubmitOperator which actually reads the data and writes it to the database.
The sensor works fine and when I write the data from local storage to MySQL, that works fine as well. However, I am having quite a bit of trouble reading the data from Blob Storage.
This is the simple Spark job that I am trying to run,
spark = (SparkSession
.builder \
.config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") \
.config("fs.azure.account.key.{}.blob.core.windows.net".format(blob_account_name), blob_account_key) \
.getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("WARN")
df_tsv = spark.read.csv("wasb://{}#{}.blob.core.windows.net/{}".format(blob_container, blob_account_name, blob_name), sep=r'\t', header=True)
mysql_url = 'jdbc:mysql://' + mysql_server
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver: "com.mysql.cj.jdbc.Driver" })
This is my SparkSubmitOperator,
spark_job = SparkSubmitOperator(
task_id="my-spark-app",
application="path/to/my/spark/job.py", # Spark application path created in airflow and spark cluster
name="my-spark-app",
conn_id="spark_default",
verbose=1,
conf={"spark.master":spark_master},
application_args=[tsv_file, mysql_server, mysql_user, mysql_password, mysql_table],
jars=azure_hadoop_jar + ", " + mysql_driver_jar,
driver_class_path=azure_hadoop_jar + ", " + mysql_driver_jar,
dag=dag)
I keep getting this error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
What exactly am I doing wrong?
I have both mysql-connector-java-8.0.27.jar and hadoop-azure-3.3.1.jar in my application. I have given the path to these in the driver_class_path and jars parameters. Is there something wrong with how I have done that here?
I have tried following the suggestions given here, Saving Pyspark Dataframe to Azure Storage, but they have not been helpful.

Related

Unable to create Scheduler Pools in EMR using PySpark

I am fairly new to the concept of Spark Schedulers/Pooling and need to implement the same in one of my Projects. Just in order to understand the concept better, I scribbled the following streaming PySpark Code on my local and executed :
from pyspark.sql import SparkSession
import threading
def do_job(f1, f2):
df1 = spark.readStream.json(f1)
df2 = spark.readStream.json(f2)
df = df1.join(df2, "id", "inner")
df.writeStream.format("parquet").outputMode("append") \
.option("checkpointLocation", "checkpoint" + str(f1) + "/") \
.option("path", "Data/Sample_Delta_Data/date=A" + str(f1)) \
.start()
# outputs.append(df1.join(df2, "id", "inner").count())
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Demo") \
.master("local[4]") \
.config("spark.sql.autoBroadcastJoinThreshold", "50B") \
.config("spark.scheduler.mode", "FAIR") \
.config("spark.sql.streaming.schemaInference", "true") \
.getOrCreate()
file_prefix = "data_new/data/d"
jobs = []
outputs = []
for i in range(0, 6):
file1 = file_prefix + str(i + 1)
file2 = file_prefix + str(i + 2)
thread = threading.Thread(target=do_job, args=(file1, file2))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
spark.streams.awaitAnyTermination()
# print(outputs)
As could be seen above, I am using FAIR Scheduler option and using 'Threading Library' in PySpark to implement Pooling.
As the matter fact, the above code is creating pools on my Local System but when I run the same on AWS EMR cluster, no Pools are getting created.
Am I missing something specific to AWS EMR ?
Suggestions please!
Regards.
Why are you using threading in pyspark? It handles executors cores --> [spark threading] for you. I understand you are new to spark but clearly not new to python. It's possible I missed a subtly here as streaming isn't my wheelhouse as much as spark is my wheelhouse.
The above code will launch all the work on the driver, and I think you really should read over the spark documentation to better understand how it handles parallelism. (You want to do the work on executors to really get the power you are looking for.)
With respect, this is how you do thing in python not pyspark/spark code. You likely are seeing the difference between client vs cluster code, and that could account for the difference. (It's a typical issues that occurs when coding locally vs in the cluster.)

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm not able to move/copy or rename the files

In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm able to delete a folder or file
but not able to move/copy or rename the files , keep getting the error
Below is the snipped I used:
from pyspark.sql import SparkSession
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# File path declaration
containerName = "ZZZZZZZ"
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=containerName)
tablePath = "/ws/tables/"
evnetName = 'abcd'
tableFilename= "Game_"+eventName+".kql"
tableFile = fullRootPath + tablePath + tableFilename
tempTableFile = fullRootPath + tablePath + tempPath + tableFilename
#empty the paths
sc._jsc.hadoopConfiguration().set('fs.defaultFS', fullRootPath)
fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(fullRootPath+tablePath), True)
## This one worked fine.
# dfStringFinalF contains a string
spark.sparkContext.parallelize([dfStringFinalF]).coalesce(1).saveAsTextFile(tempTunctionFile)
fs.copy(tempTunctionFile+'/part-00000' , tableFile)
#Copy , rename, cp, mv, nothing working on fs
Please help
It seems in Azure Synapse Analytics is having limitations with Spark Context and shutil libraries.
mssparkutils library helps in copying/moving files in the BLOB containers.
Here is the code sample
from notebookutils import mssparkutils
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=zzzz)
fsmounting = mssparkutils.fs.mount(
fullRootPath,
"/ws",
{"linkedService":"abcd"}
)
fsMove = mssparkutils.fs.mv(tempTableFile+'/part-00000',tableFile )
More reference from the below link :
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api

Running a Spark Streaming job in Zeppelin throws connection refused 8998 error

I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.
%pyspark
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
Py4JJavaError: An error occurred while calling o101.start.
: java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:
I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?
Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:
inputFolder = 'file:///home/sallos/tmp/'
streamingInputDF = (
spark
.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputFolder)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.SrcIPAddr,
window(streamingInputDF.Datefirstseen, "30 seconds"))
.sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)
query = (
streamingCountsDF
.writeStream.format("memory")\
.queryName("sumbytes")\
.outputMode("complete")\
.option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
.start()
)

StructuredStreaming with ForEachWriter creating duplicates

Hi I'm trying to create a neo4j sink using pyspark and kafka, but for some reason this sink is creating duplicates in neo4j and I'm not sure why this is happening. I am expecting to get only one node, but it looks like it's creating 4. If someone has an idea, please let me know.
Kafka producer code:
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='10.0.0.38:9092')
message = {
'test_1': 'test_1',
'test_2': 'test_2'
}
producer.send('test_topic', json.dumps(message).encode('utf-8'))
producer.close()
Kafka consumer code:
from kafka import KafkaConsumer
import findspark
from py2neo import Graph
import json
findspark.init()
from pyspark.sql import SparkSession
class ForeachWriter:
def open(self, partition_id, epoch_id):
neo4j_uri = '' # neo4j uri
neo4j_auth = ('', '') # neo4j user, password
self.graph = Graph(neo4j_uri, auth=neo4j_auth)
return True
def process(self, msg):
msg = json.loads(msg.value.decode('utf-8'))
self.graph.run("CREATE (n: MESSAGE_RECEIVED) SET n.key = '" + str(msg).replace("'", '"') + "'")
raise KeyError('received message: {}. finished creating node'.format(msg))
spark = SparkSession.builder.appName('test-consumer') \
.config('spark.executor.instances', 1) \
.getOrCreate()
ds1 = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', '10.0.0.38:9092') \
.option('subscribe', 'test_topic') \
.load()
query = ds1.writeStream.foreach(ForeachWriter()).start()
query.awaitTermination()
neo4j graph after running code
After doing some searching, I found this snippet of text from Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming on chapter 11 p151 after describing open, process, and close for ForeachWriter:
This contract is part of the data delivery semantics because it allows us to remove duplicated partitions that might already have been sent to the sink but are reprocessed by Structured Streaming as part of a recovery scenario. For that mechanism to properly work, the sink must implement some persistent way to remember the partition/version combinations that it has already seen.
On another note from the spark website: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (see section on Foreach).
Note: Spark does not guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reasons, Spark optimization changes number of partitions, etc. See SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
It seems like I need to implement a check for uniqueness because Structured Streaming automatically reprocesses partitions in case of a fail if I am to use ForeachWriter, otherwise I have to switch to foreachBatch instead.

Reading excel files in a streaming fashion in spark 2.0.0

I have a set of Excel format files which needs to be read from Spark(2.0.0) as and when an Excel file is loaded into a local directory. Scala version used here is 2.11.8.
I've tried using readstream method of SparkSession, but I'm not able to read in a streaming way. I'm able to read Excel files statically as:
val df = spark.read.format("com.crealytics.spark.excel").option("sheetName", "Data").option("useHeader", "true").load("Sample.xlsx")
Is there any other way of reading excel files in streaming way from a local directory?
Any answers would be helpful.
Thanks
Changes done:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir","file:///D:/pooja").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val dataFrame = spark.readStream.format("csv").option("inferSchema",true).option("header", true).load("file:///D:/pooja/sample.csv")
dataFrame.writeStream.format("console").start()
dataFrame.show()
Updated code:
val spark = SparkSession.builder().master("local[*]").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.format("com.crealytics.spark.excel").option("header", true).load("file:///filepath/*.xlsx")
df.writeStream.format("memory").queryName("tab").start().awaitTermination()
val res = spark.sql("select * from tab")
res.show()
Error:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source com.crealytics.spark.excel does not support streamed reading
Can anyone help me resolve this issue.
For a streaming DataFrame you have to provide Schema and currently, DataStreamReader does not support option("inferSchema", true|false). You can set SQLConf setting spark.sql.streaming.schemaInference, which needs to be set at session level.
You can refer here

Resources