How can i send my structured streaming dataframe to kafka? - apache-spark

Hello everyone !
I'm trying to send my structured streaming dataframe to one of my kafka topics, detection.
This is the schema of the structued streaming dataframe:
root
|-- timestamp: timestamp (nullable = true)
|-- Sigma: string (nullable = true)
|-- time: string (nullable = true)
|-- duration: string (nullable = true)
|-- SourceComputer: string (nullable = true)
|-- SourcePort: string (nullable = true)
|-- DestinationComputer: string (nullable = true)
|-- DestinationPort: string (nullable = false)
|-- protocol: string (nullable = true)
|-- packetCount: string (nullable = true)
|-- byteCount: string (nullable = true)
but then i try to send the dataframe, with this method:
dfwriter=df \
.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "/Documents/checkpoint/logs") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("failOnDataLoss", "false") \
.option("topic", detection) \
.start()
Then i got the error :
pyspark.sql.utils.AnalysisException: cannot resolve 'value' given input columns: [DestinationComputer, DestinationPort, Sigma, SourceComputer, SourcePort, byteCount, duration, packetCount, processName, protocol, time, timestamp]; line 1 pos 5;
If i send a dataframe with juste the column value it works, i receive the data on my kafka topic consumer.
Any idea to send my dataframe with all columns ?
Thank you !

Your dataframe has no value column, as the error says.
You'd need to "embed" all columns under a value StructType column, then use a function like to_json, not CAST( .. AS STRING)
In Pyspark, that'd be something like struct(to_json(struct($"*")).as("value") within a select query
Similar question - Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Related

DateType column read as StringType from CSV file even when appropriate schema provided

I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().
Sample input data:
"school_id","gender","class","doj"
"1","M","9","01/01/2020"
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"
My code:
from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
StructField("gender", StringType(),True) ,\
StructField("class", StringType(),True) ,\
StructField("doj", DateType(),True)
])
school_students_df = spark.read.format("csv") \
.option("header", True) \
.option("schema", school_students_schema) \
.option("dateFormat", "dd/MM/yyyy") \
.load("/user/test/school_students.csv")
school_students_df.printSchema()
Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)
Expected output:
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)
Runtime environment
Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Requesting your help to understand:
Why is the column being parsed as StringType even though DateType is mentioned in schema?
What needs to be done in the code so that the column doj is parsed as DateType()?
You should use
.schema(school_students_schema)
instead of
.option("schema", school_students_schema)
(There is no "schema" in the available option list.)
Need
.option("dateFormat", "some format")
or the appropriate default format. Becomes stringtype if not correct format.
Only 1 date format possible this way btw. Otherwise in line manipulation.

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

How to write result of streaming query to multiple database tables?

I am using spark structured streaming and reading from Kafka topic. The goal is to write the message to PostgreSQL database multiple tables.
The message schema is:
root
|-- id: string (nullable = true)
|-- name: timestamp (nullable = true)
|-- comment: string (nullable = true)
|-- map_key_value: map (nullable = true)
|-- key: string
|-- value: string (valueContainsNull = true)
While writing to one table after dropping map_key_value works with below code:
My write code is:
message.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("jdbc").option("url", "url")
.option("user", "username")
.option("password", "password")
.option(JDBCOptions.JDBC_TABLE_NAME, "table_1')
.mode(SaveMode.Append).save();
}.outputMode(OutputMode.Append()).start().awaitTermination()
I want to write the message to two DB tables table 1(id, name, comment) and tables 2 need have the map_key_value.
You will need N streaming queries for N sinks; t1 and t2 both count as a separate sink.
writeStream does not currently write to jdbc so you should use foreachBatch operator.

Can't read CSV string using PySpark

The scenario is: EventHub -> Azure Databricks (using pyspark)
File format: CSV (Quoted, Pipe delimited and custom schema )
I am trying to read CSV strings comming from eventhub. Spark is successfully creating the dataframe with the proper schema, but the dataframe end up empty after every message.
I managed to do some tests outside streaming environment, and when getting the data from a file, all goes well, but it fails when the data comes from a string.
So I found some links to help me on this, but none worked:
can-i-read-a-csv-represented-as-a-string-into-apache-spark-using-spark-csv?rq=1
Pyspark - converting json string to DataFrame
Right now I have the code below:
schema = StructType([StructField("Decisao",StringType(),True), StructField("PedidoID",StringType(),True), StructField("De_LastUpdated",StringType(),True)])
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
df = spark.read \
.option("header", "true") \
.option("mode","FAILFAST") \
.option("delimiter","|") \
.schema(schema) \
.csv(csvData)
df.show()
Is that even possible to do with CSV files?
You can construct schema like this via Row and split on | delimiter
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
schemaDF = csvData\
.map(lambda x: x.split("|"))\
.map(lambda x: Row(x[0],\
x[1],\
x[2],\
x[3],\
x[4]))\
.toDF(["Decisao", "PedidoID", "De_LastUpdated", "col4", "col5"])
for i in schemaDF.take(1): print(i)
Row(Decisao='DECISAO', PedidoID='PEDIDOID', De_LastUpdated='DE_LASTUPDATED\r\n"asdasdas"', col4='"1015905177"', col5='"sdfgsfgd"')
schemaDF.printSchema()
root
|-- Decisao: string (nullable = true)
|-- PedidoID: string (nullable = true)
|-- De_LastUpdated: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)

Resources