Does Dataproc support Delta Lake format? - apache-spark

Is the Databricks Delta format available with Google's GCP DataProc?
For AWS and AZURE it is clear that this is so. However, when perusing, researching the internet, I am unsure that this is the case. Databricks docs less clear as well.
I am assuming Google feel their offerings are sufficient. E.g. Google Cloud Storage and is it mutable? This https://docs.gcp.databricks.com/getting-started/overview.html provides too little context.

Delta Lake format is supported by Dataproc. You can simply use it as any other data format like Parquet, ORC. The following is a code example from this article.
# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import sys
from pyspark.sql import SparkSession
from delta import *
def main():
input = sys.argv[1]
print("Starting job: GCS Bucket: ", input)
spark = SparkSession\
.builder\
.appName("DeltaTest")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.getOrCreate()
data = spark.range(0, 500)
data.write.format("delta").mode("append").save(input)
df = spark.read \
.format("delta") \
.load(input)
df.show()
spark.stop()
if __name__ == "__main__":
main()
You also need to add the dependency when submitting the job with --properties="spark.jars.packages=io.delta:delta-core_2.12:1.1.0".

Related

How to fix "File file /tmp/delta-table does not exist in Delta Lake?

Hello dear programmers,
I am currently setting up Delta Lake with Apache Spark. For the spark worker and master I am using the image docker.io/bitnami/spark:3.
What I am trying to do is via my python application is creating a new table of type delta via the spark master/worker I setup. However when I try to save the table I get the following error: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist.
This might have something to do with the worker/master container not being able to access my local files however I am not sure how to fix this. I also looked into using HDFS but should I be running a separate server for this because it seems to be built into Delta Lake already?
The code of my applications looks as follows:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.master("spark://spark:7077") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.1.0") \
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.mode("overwrite").format("delta").save("/tmp/delta-table")
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

How to ingest data from Eventhub to ADLS using Databricks cluster(Scala)

I'm want to ingest streaming data from Eventhub to ADLS gen2 with specified format.
I did for batch data ingestion, from DB to ADLS and Container to Container but now I want to try with streaming data ingestion.
Can you please guide me from where to start to proceed further step. I did create Eventhub, Databrick Instance and Storage Account in Azure.
You just need to follow the documentation (for Scala, for Python) for EventHubs Spark connector. In the simplest way the code looks as following (for Python):
readConnectionString = "..."
ehConf = {}
# this is required for versions 2.3.15+
ehConf['eventhubs.connectionString']=sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(readConnectionString)
df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# casting binary payload to String (but it's really depends on the
# data format inside the topic)
cdf = df.withColumn("body", F.col("body").cast("string"))
# write data to storage
stream = cdf.writeStream.format("delta")\
.outputMode("append")\
.option("checkpointLocation", "/path/to/checkpoint/directory")\
.start("ADLS location")
You may need to add additional options, like, starting positions, etc. but everything is described well in the documentation.

Why do I see two jobs in Spark UI for a single read?

I am trying to run the below script to load file with 24k records. Is there any reason why I am seeing two jobs for single load in Spark UI.
code
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("DM")\
.getOrCreate()
trades_df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("s3://bucket/source.csv")
trades_df.rdd.numPartitions() is 1
Spark UI Image
That's because spark reads the csv file two times since you enabled inferSchema.
Read the comments for the function def csv(csvDataset: Dataset[String]): DataFrame on spark's github repo here.

Databricks : structure stream data assignment and display

I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.
You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.

How to restart pyspark streaming query from checkpoint data?

I am creating a spark streaming application using pyspark 2.2.0
I am able to create a streaming query
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StreamingApp") \
.getOrCreate()
staticDataFrame = spark.read.format("parquet")\
.option("inferSchema","true").load("processed/Nov18/")
staticSchema = staticDataFrame.schema
streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger",1)\
.format("parquet")\
.load("processed/Nov18/")
daily_trs=streamingDataFrame.select("shift","date","time")
.groupBy("date","shift")\
.count("shift")
writer = df.writeStream\
.format("parquet")\
.option("path","data")\
.option("checkpointLocation","data/checkpoints")\
.queryName("streamingData")\
.outputMode("append")
query = writer.start()
query.awaitTermination()
The query is streaming and any additional file to "processed/Nov18" will be processed and stored to "data/"
If the streaming fails I want to restart the same query
Path to solution
According to official documentation I can get an id that can be used to restart the query
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery.id
The pyspark.streaming module contains StreamingContext class that has classmethod
classmethod getActiveOrCreate(checkpointPath, setupFunc)
https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.getOrCreate
can these methods be used somehow?
If anyone has any use case of production ready streaming app for reference ?
You should simply (re)start the pyspark application with the checkpoint directory available and Spark Structured Streaming does the rest. No changes required.
If anyone has any use case of production ready streaming app for reference ?
I'd ask on the Spark users mailing list.

Resources