How to parse the streaming XML into dataframe? - apache-spark

I'm consuming the XML file from kafka topic .Can anyone tell me how to parse the XML into dataframe.
val df = sqlContext.read
.format("com.databricks.spark.xml")
//.option("rowTag","ns:header")
// .options(Map("rowTag"->"ntfyTrns:payloadHeader","rowTag"->"ns:header"))
.option("rowTag","ntfyTrnsDt:notifyTransactionDetailsReq")
.load("/home/ubuntu/SourceXML.xml")
df.show
df.printSchema()
df.select(col("ns:header.ns:captureSystem")).show()
I able to exact the information information from XML .I dont know how to pass or convert or load the RDD[String] from kafka topic to sql read API.
Thanks!

I am facing the same situation, doing some research I found that some people is using this method to convert the RDD to a DataFrame using the following code as shown here:
val wrapped = rdd.map(xml => s"""<a>$xml</a>""")
val df = new XmlReader().xmlRdd(sqlContext, wrapped)
You just have to obtain the RDD from the DStream, I am doing this using pyspark
streamElement = ssc.textFileStream("s3n://your_path")
streamElement.foreachRDD(process)
where process method has the following structure, so you can do everything with your rdds
def process(time, rdd):
return value

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

RDD String to Spark csv Reader

I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())

writing corrupt data from kafka / json datasource in spark structured streaming

In spark batch jobs I usually have a JSON datasource written to a file and can use corrupt column features of the DataFrame reader to write the corrupt data out in a seperate location, and another reader to write the valid data both from the same job. ( The data is written as parquet )
But in Spark Structred Streaming I'm first reading the stream in via kafka as a string and then using from_json to get my DataFrame. Then from_json uses JsonToStructs which uses a FailFast mode in the parser and does not return the unparsed string to a column in the DataFrame. (see Note in Ref) Then how can I write corrupt data that doesn't match my schema and possibly invalid JSON to another location using SSS?
Finally in the batch job the same job can write both dataframes. But Spark Structured Streaming requires special handling for multiple sinks. Then in Spark 2.3.1 (my current version) we should include details about how to write both corrupt and invalid streams properly...
Ref: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Expression-JsonToStructs.html
val rawKafkaDataFrame=spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.broker)
.option("kafka.ssl.truststore.location", path.toString)
.option("kafka.ssl.truststore.password", config.pass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.security.protocol", "SSL")
.option("subscribe", config.topic)
.option("startingOffsets", "earliest")
.load()
val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
// does not provide a corrupt column or way to work with corrupt
jsonDataFrame.select(from_json(col("value"), schema)).select("jsontostructs(value).*")
When you convert to json from string, and if it is not be able to parse with the schema provided, it will return null. You can filter the null values and select the string. Something like this.
val jsonDF = jsonDataFrame.withColumn("json", from_json(col("value"), schema))
val invalidJsonDF = jsonDF.filter(col("json").isNull).select("value")
I was just trying to figure out the _corrupt_record equivalent for structured streaming as well. Here's what I came up with; hopefully it gets you closer to what you're looking for:
// add a status column to partition our output by
// optional: only keep the unparsed json if it was corrupt
// writes up to 2 subdirs: 'out.par/status=OK' and 'out.par/status=CORRUPT'
// additional status codes for validation of nested fields could be added in similar fashion
df.withColumn("struct", from_json($"value", schema))
.withColumn("status", when($"struct".isNull, lit("CORRUPT")).otherwise(lit("OK")))
.withColumn("value", when($"status" <=> lit("CORRUPT"), $"value"))
.write
.partitionBy("status")
.parquet("out.par")

Reading excel files in a streaming fashion in spark 2.0.0

I have a set of Excel format files which needs to be read from Spark(2.0.0) as and when an Excel file is loaded into a local directory. Scala version used here is 2.11.8.
I've tried using readstream method of SparkSession, but I'm not able to read in a streaming way. I'm able to read Excel files statically as:
val df = spark.read.format("com.crealytics.spark.excel").option("sheetName", "Data").option("useHeader", "true").load("Sample.xlsx")
Is there any other way of reading excel files in streaming way from a local directory?
Any answers would be helpful.
Thanks
Changes done:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir","file:///D:/pooja").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val dataFrame = spark.readStream.format("csv").option("inferSchema",true).option("header", true).load("file:///D:/pooja/sample.csv")
dataFrame.writeStream.format("console").start()
dataFrame.show()
Updated code:
val spark = SparkSession.builder().master("local[*]").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.format("com.crealytics.spark.excel").option("header", true).load("file:///filepath/*.xlsx")
df.writeStream.format("memory").queryName("tab").start().awaitTermination()
val res = spark.sql("select * from tab")
res.show()
Error:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source com.crealytics.spark.excel does not support streamed reading
Can anyone help me resolve this issue.
For a streaming DataFrame you have to provide Schema and currently, DataStreamReader does not support option("inferSchema", true|false). You can set SQLConf setting spark.sql.streaming.schemaInference, which needs to be set at session level.
You can refer here

How to check if a DataFrame was already cached/persisted before?

For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone?
You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2.
Code example :
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val df = sparkSession.read.csv("src/main/resources/sales.csv")
df.createTempView("sales")
//interacting with catalog
val catalog = sparkSession.catalog
//print the databases
catalog.listDatabases().select("name").show()
// print all the tables
catalog.listTables().select("name").show()
// is cached
println(catalog.isCached("sales"))
df.cache()
println(catalog.isCached("sales"))
Using the above code you can list all the tables and check weather a table is cached or not.
You can check the working code example here

Resources