Xml parsing on spark Structured Streaming - apache-spark

I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks.
I created a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load()
Later I converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data", expr("CAST(data as string)"))
Now, I need to extract few fields from df.xml_data column using xpath. Can you please suggest any possible solution?
If I create a dataframe directly for these xml files as xml_df = spark.read.format("xml").options(rowTag='Consumers').load("s3a://bkt/xmldata"), I'm able to query using xpath:
xml_df.select("Analytics.Amount1").show()
But, not sure how to do extract elements similarly on a Spark Streaming dataframe where data is in text format.
Are there any xml functions to convert text data using schema? I saw an example for json data using from_json.
Is it possible to use spark.read on a dataframe column?
I need to find aggregated "Amount1" for every 5 minutes window.
Thanks for your help

You can use com.databricks.spark.xml.XmlReader to read xml data from column but requires an RDD, which means that you need to transform your df to RDD using df.rdd which may impact performance.
Below is untested code from spark java:
import com.databricks.spark.xml
xmlRdd = df = kinDF.select("xml_data").map(r -> r[0])
new XmlReader().xmlRdd(spark, xmlRdd)

Related

How to select columns that contain any of the given strings as part of the column name in Pyspark [duplicate]

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Upsert into table using Merge for Scala for Scala using Databricks

With Databricks Delta Table you can upsert data from a source table, view, or DataFrame into a target Delta table using the merge operation. This operation is similar to the SQL MERGE INTO command but has additional support for deletes and extra conditions in updates, inserts, and deletes.
I can successfully carryout a Merge using the following Python code:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, delta_path)
(deltaTable
.alias("t")
.merge(loanUpdates.alias("s"), "t.loan_id = s.loan_id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
However, I need to use Scala. Therefore, can someone provide the code that will do the same in Scala. Basically, I help converting the Python code Scala.
There are examples provided here, https://docs.databricks.com/delta/delta-update.html#language-scala however I would like to able to use the Python code above
Based on your comment, loanUpdates is a string, but it needs to be a Dataframe. You can load a CSV into Spark using:
val loanUpdatesDf = spark.read.csv(loanUpdates)
You will probably need to use further options to read the csv correctly.

How to make empty values available on data frame while writing to kafka on spark streaming

I am getting one issue on writing my spark streaming dataframe to kafka. I am writing the dataframe as JSON structure. The following way am using to write
val df =df_agg.select($"Country",$"plan",$"value")
df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka").option("topic", "topicname").option("kafka.bootstrap.servers", "ddd.dl.uk.ddd.com:8002").option("sasl.kerberos.service.name","kafka").option("checkpointLocation", "/user/dddff/ddd/").option("kafka.security.protocol","SASL_PLAINTEXT").option("Partitioner.class","DefaultPartitioner").start().awaitTermination()
The issue is whenever my value for column "country" is empty , then its not even writting the field. For example i am getting dataframe df as
US,postpaid,300
CAN,prepaid,30
,postpaid,400
my output on kafka is
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
{"plan":postpaid,"value":400}
But my expected output is
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
{"country":"","plan":postpaid,"value":400}
How can i achieve this ? please help

How to create a spark dataframe after reading data directly from kafka queue

the data from kafka queue would be a line delimited json string like below
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"5678"},{"bcc":"5814","dsrc":"A","mid":"0003"},{"bcc":"5812","dsrc":"A","mid":"0006"}]}
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"1234"},{"bcc":"5814","dsrc":"A","mid":"0004"},{"bcc":"5812","dsrc":"A","mid":"0009"}]}
{"header":{"platform":"atm","msgtype":"1","version":"1.0"},"details":[{"bcc":"5814","dsrc":"A","aid":"1234"},{"bcc":"5814","dsrc":"A","mid":"0004"},{"bcc":"5812","dsrc":"A","mid":"0009"}]}
how can we create a dataframe in python for the above input? I have many columns to access the above is only a sample, the data would have 23 columns in total. Any help on this would be greatly appreciated.
You're looking for pyspark.sql.SQLContext.jsonRDD. Since Spark streaming is batched, your stream object will return a series of RDDs, each of which can be made into a DF via jsonRDD.

Spark Parse Text File to DataFrame

Currently, I can parse a text file to a Spark DataFrame by way of the RDD API with the following code:
def row_parse_function(raw_string_input):
# Do parse logic...
return pyspark.sql.Row(...)
raw_rdd = spark_context.textFile(full_source_path)
# Convert RDD of strings to RDD of pyspark.sql.Row
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
# Convert RDD of pyspark.sql.Row to Spark DataFrame.
data_frame = spark_sql_context.createDataFrame(row_rdd, schema)
Is this current approach ideal?
Or is there a better way to do this without using the older RDD API.
FYI, Spark 2.0.
Clay,
This is a good approach to load a file that has not specific format instead CSV, JSON, ORC, Parquet or from Database.
If you have any kind of specific logic to work on it, this is the best way to do that. Using RDD is for this kind of situation, when you need to run a specific logic in your data that is not trivial.
You can read here about the uses of the APIs of Spark. And you are in the situation of RDD is the best approach.

Resources