Can't read CSV string using PySpark - python-3.x

The scenario is: EventHub -> Azure Databricks (using pyspark)
File format: CSV (Quoted, Pipe delimited and custom schema )
I am trying to read CSV strings comming from eventhub. Spark is successfully creating the dataframe with the proper schema, but the dataframe end up empty after every message.
I managed to do some tests outside streaming environment, and when getting the data from a file, all goes well, but it fails when the data comes from a string.
So I found some links to help me on this, but none worked:
can-i-read-a-csv-represented-as-a-string-into-apache-spark-using-spark-csv?rq=1
Pyspark - converting json string to DataFrame
Right now I have the code below:
schema = StructType([StructField("Decisao",StringType(),True), StructField("PedidoID",StringType(),True), StructField("De_LastUpdated",StringType(),True)])
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
df = spark.read \
.option("header", "true") \
.option("mode","FAILFAST") \
.option("delimiter","|") \
.schema(schema) \
.csv(csvData)
df.show()
Is that even possible to do with CSV files?

You can construct schema like this via Row and split on | delimiter
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
schemaDF = csvData\
.map(lambda x: x.split("|"))\
.map(lambda x: Row(x[0],\
x[1],\
x[2],\
x[3],\
x[4]))\
.toDF(["Decisao", "PedidoID", "De_LastUpdated", "col4", "col5"])
for i in schemaDF.take(1): print(i)
Row(Decisao='DECISAO', PedidoID='PEDIDOID', De_LastUpdated='DE_LASTUPDATED\r\n"asdasdas"', col4='"1015905177"', col5='"sdfgsfgd"')
schemaDF.printSchema()
root
|-- Decisao: string (nullable = true)
|-- PedidoID: string (nullable = true)
|-- De_LastUpdated: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)

Related

DateType column read as StringType from CSV file even when appropriate schema provided

I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().
Sample input data:
"school_id","gender","class","doj"
"1","M","9","01/01/2020"
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"
My code:
from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
StructField("gender", StringType(),True) ,\
StructField("class", StringType(),True) ,\
StructField("doj", DateType(),True)
])
school_students_df = spark.read.format("csv") \
.option("header", True) \
.option("schema", school_students_schema) \
.option("dateFormat", "dd/MM/yyyy") \
.load("/user/test/school_students.csv")
school_students_df.printSchema()
Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)
Expected output:
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)
Runtime environment
Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Requesting your help to understand:
Why is the column being parsed as StringType even though DateType is mentioned in schema?
What needs to be done in the code so that the column doj is parsed as DateType()?
You should use
.schema(school_students_schema)
instead of
.option("schema", school_students_schema)
(There is no "schema" in the available option list.)
Need
.option("dateFormat", "some format")
or the appropriate default format. Becomes stringtype if not correct format.
Only 1 date format possible this way btw. Otherwise in line manipulation.

How to use writestream.outputmode(append) in PySpark in combination with Groupby function on Timestamp per hour?

I have a large dataframe with records from machines that do something (event_name) and contains a timestamp:
root
|-- device_Model: string (nullable = true)
|-- application_version: string (nullable = true)
|-- location_country: string (nullable = true)
|-- event_name: string (nullable = true)
|-- data_eventTimeStamp: timestamp (nullable = true)
|-- event_count: long (nullable = true)
What I would like to do is to count the event_name by hour.
To do so I have written the following code:
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import *
## Load the data from its source.
events = spark \
.readStream \
.format("delta") \
.load(load_path)
## Timestamp per hour.
df_timestamp = events \
.withColumn("Timestamp", date_trunc('hour',to_timestamp("data_eventTimestamp","dd-MM-yyyy HH:mm:ss")))
## Groupby and count events per hour.
event_count = df_timestamp \
.groupBy("device_Model","application_version","location_country","event_name","Timestamp","event_count") \
.sum("event_count")
## Write stream to Delta Table
event_count.coalesce(1) \
.writeStream \
.trigger(once=True) \
.outputMode("complete") \
.format("delta") \
.option("checkpointLocation", "...") \
.start("...")
This code works as expected, but every time this code is trigged it will create a new
(large) .snappy.parquet file. All these files will in the end stack up and consume space in a sink.
Instead of complete mode I prefer append mode as it will first create a large .snappy.parquet file of the existing data and writes smaller snappy.parquet files with only new events from the loaded data.
However, if I use append mode it will tell me that it won't work because it lacks a watermark. Yet, I don't see how to use a watermark in this context.
Does someone know how to solve this issue? Thanks!

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

How can i send my structured streaming dataframe to kafka?

Hello everyone !
I'm trying to send my structured streaming dataframe to one of my kafka topics, detection.
This is the schema of the structued streaming dataframe:
root
|-- timestamp: timestamp (nullable = true)
|-- Sigma: string (nullable = true)
|-- time: string (nullable = true)
|-- duration: string (nullable = true)
|-- SourceComputer: string (nullable = true)
|-- SourcePort: string (nullable = true)
|-- DestinationComputer: string (nullable = true)
|-- DestinationPort: string (nullable = false)
|-- protocol: string (nullable = true)
|-- packetCount: string (nullable = true)
|-- byteCount: string (nullable = true)
but then i try to send the dataframe, with this method:
dfwriter=df \
.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "/Documents/checkpoint/logs") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("failOnDataLoss", "false") \
.option("topic", detection) \
.start()
Then i got the error :
pyspark.sql.utils.AnalysisException: cannot resolve 'value' given input columns: [DestinationComputer, DestinationPort, Sigma, SourceComputer, SourcePort, byteCount, duration, packetCount, processName, protocol, time, timestamp]; line 1 pos 5;
If i send a dataframe with juste the column value it works, i receive the data on my kafka topic consumer.
Any idea to send my dataframe with all columns ?
Thank you !
Your dataframe has no value column, as the error says.
You'd need to "embed" all columns under a value StructType column, then use a function like to_json, not CAST( .. AS STRING)
In Pyspark, that'd be something like struct(to_json(struct($"*")).as("value") within a select query
Similar question - Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Resources