I want to use kafka custom partitioner for my pyspark application reading from kafka pushing to another kafka topic. Data is transformed from source to sink with pyspark processing. I want to have control over partition to which data should be pushed based on some key in data/message. In spark structured streaming documentation I couldn't find any reference or example for such use case. I am using python processing along with pyspark and confluent-kafka-python is used as kafka client there, but it also lacks documentation/example for custom partitioner.
is solution available to achieve this?
Below spark code is tried with partition column and is not pushing data as per partition column.
df = spark.range(5)
df = (df
.withColumn("topic", F.lit("test_temp"))
.withColumn("partition", (F.col("id")%2).cast("int"))
.withColumn("key", F.lit("test"))
.withColumn("value", F.lit("test_data"))
).select(["topic", "key", "value", "partition"])
df.printSchema()
(df.write.format("kafka").partitionBy("partition")
.option("kafka.bootstrap.servers", kafka_endpoint)
#.option("topic", "test_temp")
.save())
Output:
+---------+----+---------+---------+
| topic| key| value|partition|
+---------+----+---------+---------+
|test_temp|test|test_data| 0|
|test_temp|test|test_data| 1|
|test_temp|test|test_data| 0|
|test_temp|test|test_data| 1|
|test_temp|test|test_data| 0|
+---------+----+---------+---------+
root
|-- topic: string (nullable = false)
|-- key: string (nullable = false)
|-- value: string (nullable = false)
|-- partition: integer (nullable = true)
Kafka console consumer output:
./kafka-console-consumer --bootstrap-server <broker>:9092 --topic test_temp --partition 1
As written in the Structured Streaming documentation, you'd either add a column to the dataframe being written called partition of an int type, and this controls which partition that Spark will write to. Or you can add a JVM paritioner to the classpath.
If a “partition” column is not specified (or its value is null) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class option. If not present, Kafka default partitioner will be used.
Regarding the Confluent Python client - https://github.com/confluentinc/confluent-kafka-python/issues/1107
But kafka-python module does support custom partitioners
Related
I want to use the fact that my dataframes are already sorted by a key used for join.
df1.join(df2, df1.sorted_key == df2.sorted_key)
Both dataframes are large, BHJ or SHJ is not an option (SHJ crashes instead of spills)
How to hint Spark that the joined column is already sorted?
I read from SO that hive+bucket+pre-sort helps. However I can't see where the dataframe store its sort status.
df = session.createDataFrame([
('Alice', 1),
('Bob', 2)
])
df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
df = df.sort('_1')
df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
^ Even when I manually sort on the column _1, the dataframe doesn't seem to remember it's sorted by _1.
Also,
How does Spark know the sorted status?
Does a parquet dataset (without hive metadata) remember which columns are sorted? Does Spark recognize it?
How does Hive + bucket + pre-sort help skip sort?
Can I use Hive + pre-sort without bucketing to skip sort?
I saw in the databricks talk Spark bucketing has many limitations and is different from Hive bucketing. Is Hive bucketing preferred?
The optimization talk by Databricks says never use bucketing because it is too hard to maintain in practice. Is it true?
Spark: 3.0.0
Scala: 2.12
confluent
I am having spark structured streaming job and looking for an example for writing data frames to Kafka in Protbuf format.
I read messages from PostgreSQL and after doing all the transformations have a data frame with Key and Value:
root
|-- key: string (nullable = true)
|-- value: binary (nullable = false)
Code to push message to kafka:
val kafkaOptions = Seq(
KAFKA_BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringSerializer",
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG -> "io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer",
"schema.registry.url" -> "http://localhost:8081",
"topic" -> "test_users"
)
tDF
.write
.format(KAFKA)
.options(kafkaOptions.toMap)
.save()
Message in binary format is posted, but I am not able to deserialized as there no schema in confluent
Is there a lib that can simply things for me? Or a sample code that I can refer.
i am trying to read .txt file in Spark 2.4 and load it to dataframe.
FILE data looks like :-
under a single Manager there is many employee
Manager_21: Employee_575,Employee_2703,
Manager_11: Employee_454,Employee_158,
Manager_4: Employee_1545,Employee_1312
Code i have written in Scala Spark 2.4 :-
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("D:/path/myfile.txt")
df.printSchema()
Unfortunately while printing schema it is visible all Employee under single Manager_21.
root
|-- Manager_21: servant_575: string (nullable = true)
|-- Employee_454: string (nullable = true)
|-- Employee_1312 string (nullable = true)
.......
...... etc
I am not sure if it is possible in spark scala....
Expected Output:
all employee of a manager in same column.
for ex: Manager 21 has 2 employee and all are in same column.
Or How can we see which all employee are under a particular manager.
Manager_21 |Manager_11 |Manager_4
Employee_575 |Employee_454 |Employee_1545
Employee_2703|Employee_158|Employee_1312
is it possible to do some other way..... please suggest
Thanks
Try using spark.read.text then using groupBy and .pivot to get the desired result.
Example:
val df=spark.read.text("<path>")
df.show(10,false)
//+--------------------------------------+
//|value |
//+--------------------------------------+
//|Manager_21: Employee_575,Employee_2703|
//|Manager_11: Employee_454,Employee_158 |
//|Manager_4: Employee_1545,Employee_1312|
//+--------------------------------------+
import org.apache.spark.sql.functions._
df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))").
select("col.*").
show()
//+-------------+-------------+--------------+
//| Manager_11| Manager_21| Manager_4|
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//| Employee_158|Employee_2703| Employee_1312|
//+-------------+-------------+--------------+
UPDATE:
val df=spark.read.text("<path>")
val df1=df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))")
//create temp table
df1.createOrReplaceTempView("tmp_table")
sql("select col.* from tmp_table").show(10,false)
//+-------------+-------------+--------------+
//|Manager_11 |Manager_21 |Manager_4 |
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//|Employee_158 |Employee_2703|Employee_1312 |
//+-------------+-------------+--------------+
I have a schema -
|-- record_id: integer (nullable = true)
|-- Data1: string (nullable = true)
|-- Data2: string (nullable = true)
|-- Data3: string (nullable = true)
|-- Time: timestamp (nullable = true)
I want to know the record for each record id with latest timestamp. I have not been able to do this in structured streaming. In Spark Streaming, I have achieved this on each incoming batch by using foreachRDD, and converting each incoming RDD to a dataframe and then running my sql query on it.
However, this yields results only on each new RDD, and not using the whole history. I know I can do this in Spark Streaming using Key Value Pairs, but I'm rather interested in running SQL queries on whole of the History (group by, joins and such). How can I do it in Spark Streaming, and not in Spark Structured Streaming ?
Another reason I cant do this in structured streaming is because I cant use Streaming Aggregation before Joins, which is kind of what I require for this.
I am trying to read XML data from Kafka topic using Spark Structured streaming.
I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?
My current code:
df = spark \
.readStream \
.format("kafka") \
.format('com.databricks.spark.xml') \
.options(rowTag="MainElement")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
The error:
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
.format("kafka") \
.format('com.databricks.spark.xml') \
The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).
In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.
As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.
Is there any way I can extract XML data from Kafka topic using structured streaming?
You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.
That should not be a big deal anyway. A Scala code could look as follows.
val input = spark.
readStream.
format("kafka").
...
load
val values = input.select('value cast "string")
val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))
// print XMLs and numbers to the stdout
val q = numbersFromXML.
writeStream.
format("console").
start
Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.
import xml.etree.ElementTree as ET
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
Then I wrote a python UDF
def parse(s):
xml = ET.fromstring(s)
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
actor_el = xml.find("DNmS:actor",ns)
if(actor_el ):
actor = actor_el.text
role_el.find('real_person:role', ns)
if(role_el):
role = role_el.text
return actor+"|"+role
Register this UDF
extractValuesFromXML = udf(parse)
XML_DF= df .withColumn("mergedCol",extractroot("value"))
AllCol_DF= xml_DF.withColumn("actorName", split(col("mergedCol"), "\\|").getItem(0))\
.withColumn("Role", split(col("mergedCol"), "\\|").getItem(1))
You cannot mix format this way. Kafka source is loaded as Row including number of values, like key, value and topic, with value column storing payload as a binary type:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
...
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
Parsing this content is the user responsibility and cannot be delegated to other data sources. See for example my answer to How to read records in JSON format from Kafka using Structured Streaming?.
For XML you'll likely need an UDF (UserDefinedFunction), although you can try Hive XPath functions first. You should also decode binary data.
Looks like the above approach works but it is not using the passed schema to parse the XML Document.
If you print the relation schema it is always
INFO XmlToAvroConverter - .convert() : XmlRelation Schema ={} root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- nullable: boolean (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
For ex: I am streaming following XML Documents from Kafka Topic
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Book>
<Author>John Doe</Author>
<Title>Test</Title>
<PubishedDate></PublishedDate>
</Book>
And here is the code i have to parse the XML into a DataFrame
kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING) msgKey","CAST(value AS STRING) xmlString")
var parameters = collection.mutable.Map.empty[String, String]
parameters.put("rowTag", "Book")
kafkaValueAsStringDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
xmlStringDF.printSchema()
val rdd: RDD[String] = xmlStringDF.as[String].rdd
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
logger.info(".convert() : XmlRelation Schema ={} "+relation.schema.treeString)
}
.start()
.awaitTermination()
When i read the same XML Documents from File System or S3 and use the spark-xml and it is parsing the schema as expected.
Thanks
Sateesh
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure that comes as the value of a Kafka message.
Given a nested XML like
<root>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<Finance>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</Finance>
</InputData>
</root>
you can then just use those SQL functions in your selectExpr statment as below:
df.readStream.format("kafka").options(...).load()
.selectExpr("CAST(value AS STRING) as value")
.selectExpr(
"xpath(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(value, '/CofiResults/InputData/Finance/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+