Spark Streaming Read Multiple Files with Dynamic Schema - apache-spark

I have a spark streaming application that reads multiples paths in a bucket.
Every path has a csv with a specific schema. How could I set the schema according to the path spark is reading.
Example:
# bucket structure: bucket_name/table_1/year/month/day/file.csv
schemas = {"table_1":"id INT, name STRING, status STRING", "table_2":"col1 STRING, col2 STRING"}
df_changes = spark.readStream\
.format("csv")\
.option("delimiter","|")\
.option("Header",True)\
.option("multiLine",True)\
.option('ignoreLeadingWhiteSpace',True)\
.option('ignoreTrailingWhiteSpace',True)\
.option("escape", "\"")\
.load(f"s3a://bucket_name/*/*/*/*/*/*.csv")\
.withColumn("file_path", input_file_name())\
.withColumn("raw_timestamp", current_timestamp())
def append_data(df,batchId):
file_path = df.first()['file_path']
lista = file_path.split('/')[:-4]
full_path = '/'.join(lista).replace('landing', 'raw') + '/cdc'
windowSpec = Window.partitionBy().orderBy(lit(None))
df.withColumn("row_number",row_number().over(windowSpec)).write.format("delta").mode('append').save(full_path)
df_changes.writeStream \
.foreachBatch(append_data) \
.option("checkpointLocation", "/checkpoint/")\
.start()
Is it possible to insert during the spark load the schema dynamically?
Something like: .load(f"s3a://bucket_name/*/*/*/*/*/*.csv", schemas=schemas[path-from-spark-splited])

Related

PySpark dataframe not getting saved in Hive due to file format mismatch

I want to write the streaming data from kafka topic to hive table.
I am able to create dataframes by reading kafka topic, but the data is not getting written to Hive Table due to file-format mismatch. I have specified dataframe.format("parquet") and the hive table is created with stored as parquet.
Below is the code snippet:
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable(table)
def write_to_hive(data,kafka_sink_name):
global table
table = kafka_sink_name
data.select(col("key"),col("value"),col("offset")) \
.writeStream.foreachBatch(hive_write_batch_data) \
.start().awaitTermination()
if __name__ == '__main__':
kafka_sink_name = sys.argv[1]
kafka_config = {
....
..
}
spark = SparkSession.builder.appName("Test Streaming").enableHiveSupport().getOrCreate()
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
write_to_hive(df1,kafka_sink_name)
Hive table is created as Parquet:
CREATE TABLE test.kafka_test(
key string,
value string,
offset bigint)
STORED AS PARQUET;
It is giving me the Error:
pyspark.sql.utils.AnalysisException: "The format of the existing table test.kafka_test is `HiveFileFormat`. It doesn\'t match the specified format `ParquetFileFormat`.;"
How do I write the dataframe to hive table ?
I dropped the hive table, and ran the Spark-streaming job. Table got created with the correct format.

Spark job reading the sorted AVRO files in dataframe, but writing to kafka without order

I have AVRO files sorted with ID and each ID has folder called "ID=234" and data inside the folder is in AVRO format and sorted on the basis of date.
I am running spark job which takes input path and reads avro in dataframe. This dataframe then writes to kafka topic with 5 partition.
val properties: Properties = getProperties(args)
val spark = SparkSession.builder().master(properties.getProperty("master"))
.appName(properties.getProperty("appName")).getOrCreate()
val sqlContext = spark.sqlContext
val sourcePath = properties.getProperty("sourcePath")
val dataDF = sqlContext.read.avro(sourcePath).as("data")
val count = dataDF.count();
val schemaRegAdd = properties.getProperty("schemaRegistry")
val schemaRegistryConfs = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegAdd,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME
)
val start = Instant.now
dataDF.select(functions.struct(properties.getProperty("message.key.name")).alias("key"), functions.struct("*").alias("value"))
.toConfluentAvroWithPlainKey(properties.getProperty("topic"), properties.getProperty("schemaName"),
properties.getProperty("schemaNamespace"))(schemaRegistryConfs)
.write.format("kafka")
.option("kafka.bootstrap.servers",properties.getProperty("kafka.brokers"))
.option("topic",properties.getProperty("topic")).save()
}
My use case is to write all messages from each ID (sorted on date) sequencially such as all sorted data from one ID 1 should be added first then from ID 2 and so on. Kafka message has key as ID.
Dont forgot that the data inside a RDD/dataset is shuffle when you do transformations so you lose the order.
the best way to achieve this is to read file one by one and send it to kafka instead of read a full directory in your val sourcePath = properties.getProperty("sourcePath")

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

Spark structured streaming - update data frame's schema on the fly

I have a simple structured streaming job which monitors a directory for CSV files and writes parquet files - no transformation in between.
The job starts by building a data frame from reading CSV files using readStream(), with a schema which I get from calling a function called buildSchema(). Here is the code:
var df = spark
.readStream
.option("sep", "|")
.option("header","true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
logger.info(" new batch indicator")
if (df.schema != buildSchema(spark, table_name).get) {
df = spark.sqlContext.createDataFrame(df.collectAsList(), buildSchema(spark, table_name).get)
}
val query =
df.writeStream
.format("parquet")
.queryName("convertCSVtoPqrquet for table " + table_name)
.option("path", output_base_dir + table_name + "/")
.trigger(ProcessingTime(60.seconds))
.start()
The job runs fine, but my question is, I'd like to always use the latest schema to build my data frame, or in other words, to read from the CSV files. While buildSchema() can get me the latest schema, I'm not sure how to call it periodically (or once per CSV file), and then use the latest schema to somehow re-generate or modify the data frame.
When testing, my observation is that only the query object is running continuously batch after batch; the log statement that I put, and the if()statement for schema comparison, only happened once at the beginning of the application.
Can data frame schema in structured streaming job be modified after query.start() is called? What would you suggest as a good workaround if we cannot change the schema of a data frame?
Thanks in advance.
You can utilize foreachBatch method to load the latest schema periodically, then compare it to the concrete micro batch dataframe schema.
Example:
var streamingDF = spark
.readStream
.option("sep", "|")
.option("header", "true")
.schema(buildSchema(spark, table_name).get) // buildSchema() gets schema for me
.csv(input_base_dir + table_name + "*")
val query =
streamingDF
.writeStream
.foreachBatch((ds, i) => {
logger.info(s"New batch indicator(${i})")
val batchDf =
if (ds.schema != buildSchema(spark, table_name).get) {
spark.sqlContext.createDataFrame(ds.collectAsList(), buildSchema(spark, table_name).get)
} else {
ds
}
batchDf.write.parquet(output_base_dir + table_name + "/")
})
.trigger(ProcessingTime(60.seconds))
.start()

How to read streaming data in XML format from Kafka?

I am trying to read XML data from Kafka topic using Spark Structured streaming.
I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?
My current code:
df = spark \
.readStream \
.format("kafka") \
.format('com.databricks.spark.xml') \
.options(rowTag="MainElement")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
The error:
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
.format("kafka") \
.format('com.databricks.spark.xml') \
The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).
In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.
As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.
Is there any way I can extract XML data from Kafka topic using structured streaming?
You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.
That should not be a big deal anyway. A Scala code could look as follows.
val input = spark.
readStream.
format("kafka").
...
load
val values = input.select('value cast "string")
val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))
// print XMLs and numbers to the stdout
val q = numbersFromXML.
writeStream.
format("console").
start
Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.
import xml.etree.ElementTree as ET
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
Then I wrote a python UDF
def parse(s):
xml = ET.fromstring(s)
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
actor_el = xml.find("DNmS:actor",ns)
if(actor_el ):
actor = actor_el.text
role_el.find('real_person:role', ns)
if(role_el):
role = role_el.text
return actor+"|"+role
Register this UDF
extractValuesFromXML = udf(parse)
XML_DF= df .withColumn("mergedCol",extractroot("value"))
AllCol_DF= xml_DF.withColumn("actorName", split(col("mergedCol"), "\\|").getItem(0))\
.withColumn("Role", split(col("mergedCol"), "\\|").getItem(1))
You cannot mix format this way. Kafka source is loaded as Row including number of values, like key, value and topic, with value column storing payload as a binary type:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
...
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
Parsing this content is the user responsibility and cannot be delegated to other data sources. See for example my answer to How to read records in JSON format from Kafka using Structured Streaming?.
For XML you'll likely need an UDF (UserDefinedFunction), although you can try Hive XPath functions first. You should also decode binary data.
Looks like the above approach works but it is not using the passed schema to parse the XML Document.
If you print the relation schema it is always
INFO XmlToAvroConverter - .convert() : XmlRelation Schema ={} root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- nullable: boolean (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
For ex: I am streaming following XML Documents from Kafka Topic
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Book>
<Author>John Doe</Author>
<Title>Test</Title>
<PubishedDate></PublishedDate>
</Book>
And here is the code i have to parse the XML into a DataFrame
kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING) msgKey","CAST(value AS STRING) xmlString")
var parameters = collection.mutable.Map.empty[String, String]
parameters.put("rowTag", "Book")
kafkaValueAsStringDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
xmlStringDF.printSchema()
val rdd: RDD[String] = xmlStringDF.as[String].rdd
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
logger.info(".convert() : XmlRelation Schema ={} "+relation.schema.treeString)
}
.start()
.awaitTermination()
When i read the same XML Documents from File System or S3 and use the spark-xml and it is parsing the schema as expected.
Thanks
Sateesh
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure that comes as the value of a Kafka message.
Given a nested XML like
<root>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<Finance>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</Finance>
</InputData>
</root>
you can then just use those SQL functions in your selectExpr statment as below:
df.readStream.format("kafka").options(...).load()
.selectExpr("CAST(value AS STRING) as value")
.selectExpr(
"xpath(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(value, '/CofiResults/InputData/Finance/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+

Resources