Convert string data to struct data - apache-spark

I have a string of the form: {'text':'abc'},{'text':'def'} I need to get an array of the form ['abc','def']
I use the following code: schema = StructType([StructField('text_str', StringType(), True)]) dsdf.withColumn('text', from_json(col('text'), schema)).show(truncate=False)
Which returns ['abc']. How to get what I really need?

dsdf.select( \
expr( \
"transform( split(language,','), x -> from_json(x,'text String'))" \
).alias("text")\
).show()
I'm using expr to make a sql string to run transform this has the widest compatibility for versions of spark, but transform can be run natively in recent versions of pyspark.
split will produce an array based off the assumption you can split them by ','
transform will operate on each item in an array
from_json as you know parses json.

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

How to extract information from a nested XML_String in Spark-Structured-Streaming

I have a spark-structured application connected to ActiveMQ. The application receives messages from a topic. These messages are in the form of a StringXML. I want to extract information from this nested-XML. How can I do this?
I referred to this post, but was not able to implement something similar in Scala.
XML Format:
<CofiResults>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S </FilterClass>
<InputData format="something" id="someID"><ns2:FrdReq xmlns:ns2="http://someone.com">
<HeaderSegment xmlns="https://somelink.com">
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
.
.
.
My Code:
val df = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("brokerUrl", brokerUrl_)
.option("topic", topicName_)
.option("persistence", "memory")
.option("cleanSession", "true")
.option("username", username_)
.option("password", password_)
.load()
val payload_ = df.select('payload cast "string") // This payload IS the XMLString
Now I need to extract ExecutionTime, Version, and other fields from the above XML.
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure.
Given a nested XML like (for simplicity, I have omitted any tag parameters)
<CofiResults>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<ns2>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</ns2>
</InputData>
</CofiResults>
you can then just use those SQL functions (without createOrReplaceTempView) in your selectExpr statment as below:
.selectExpr("CAST(payload AS STRING) as payload")
.selectExpr(
"xpath(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(payload, '/CofiResults/InputData/ns2/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+

More convenient way to reproduce pyspark sample

Most of the questions about spark are used show as code example without the code that generates the dataframe, like this:
df.show()
+-------+--------+----------+
|USER_ID|location| timestamp|
+-------+--------+----------+
| 1| 1001|1265397099|
| 1| 6022|1275846679|
| 1| 1041|1265368299|
+-------+--------+----------+
How can I reproduce this code in my programming environment without rewriting it manually? pyspark have some equivalent of read_clipboard in pandas?
Edit
The lack of a function to import data into my environment is a big obstacle for me to help others with pyspark in Stackoverflow.
So my question is:
What is the most convenient way to reproduce data pasted in stackoverflow from show command into my environment?
You can always use the following function :
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
It's one of the suggestions given in the following question : How to make good reproducible Apache Spark examples.
Note 1: Sometimes, there might be special cases where this might not apply for some reason or the other and which can generate in errors/issues i.e Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord").
So please use it with caution !
Note 2: (Disclaimer) I'm not the original author of the code. Thanks to #MaxU for the code. I just made some modifications on it.
Late answer, but I often face the same issue so wrote a small utility for this https://github.com/ollik1/spark-clipboard
It basically allows copy-pasting data frame show strings to spark. To install it, add jcenter dependency com.github.ollik1:spark-clipboard_2.12:0.1 and spark config .config("fs.clipboard.impl", "com.github.ollik1.clipboard.ClipboardFileSystem") After this, data frames can be read directly from the system clipboard
val df = spark.read
.format("com.github.ollik1.clipboard")
.load("clipboard:///*")
or alternatively files if you prefer. Installation details and usage are described in the read me file.
You can always read the data in pandas as a pandas dataframe and then convert it back to a spark dataframe. No, there is not a direct equivalent of read_clipboard in pyspark unlike pandas.
The reason is that Pandas dataframes are mostly flat structures where as spark dataframes can have complex structures like struct, arrays etc, since it has a wide variety of data types and those doesn't appear on console output, it is not possible to recreate the dataframe from the output.
You can combine panda read_clipboard, and convert to pyspark dataframe
from pyspark.sql.types import *
pdDF = pd.read_clipboard(sep=',',
index_col=0,
names=['USER_ID',
'location',
'timestamp',
])
mySchema = StructType([ StructField("USER_ID", StringType(), True)\
,StructField("location", LongType(), True)\
,StructField("timestamp", LongType(), True)])
#note: True (implies nullable allowed)
df = spark.createDataFrame(pdDF,schema=mySchema)
Update:
What #terry really want is copy ASCII code table to python , and following is
example. When you parse data into python , then you can convert to anything.
def parse(ascii_table):
header = []
data = []
for line in filter(None, ascii_table.split('\n')):
if '-+-' in line:
continue
if not header:
header = filter(lambda x: x!='|', line.split())
continue
data.append(['']*len(header))
splitted_line = filter(lambda x: x!='|', line.split())
for i in range(len(splitted_line)):
data[-1][i]=splitted_line[i]
return header, data

Deserializing Event Hub messages in Azure Databricks

I have an Azure Databricks script in Python that reads JSON messages from Event Hub using Structured Streaming, processes the messages and saves the results in Data Lake Store.
The messages are sent to the Event Hub from an Azure Logic App that reads tweets from the Twitter API.
I am trying to deserialize the body of the Event Hub message in order the process its contents. The message body is first converted from binary to string value and then deserialized to a struct type using the from_json function, as explained in this article: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
Here is a code example (with confuscated parameters):
from pyspark.sql.functions import from_json, to_json
from pyspark.sql.types import DateType, StringType, StructType
EVENT_HUB_CONN_STRING = 'Endpoint=sb://myehnamespace.servicebus.windows.net/;SharedAccessKeyName=Listen;SharedAccessKey=xxx;EntityPath=myeh'
OUTPUT_DIR = '/mnt/DataLake/output'
CHECKPOINT_DIR = '/mnt/DataLake/checkpoint'
event_hub_conf = {
'eventhubs.connectionString' : EVENT_HUB_CONN_STRING
}
stream_data = spark \
.readStream \
.format('eventhubs') \
.options(**event_hub_conf) \
.option('multiLine', True) \
.option('mode', 'PERMISSIVE') \
.load()
schema = StructType() \
.add('FetchTimestampUtc', DateType()) \
.add('Username', StringType()) \
.add('Name', StringType()) \
.add('TweetedBy', StringType()) \
.add('Location', StringType()) \
.add('TweetText', StringType())
stream_data_body = stream_data \
.select(stream_data.body) \
.select(from_json('body', schema).alias('body')) \
.select(to_json('body').alias('body'))
# This works (bare string value, no deserialization):
# stream_data_body = stream_data.select(stream_data.body)
stream_data_body \
.writeStream \
.outputMode('append') \
.format('json') \
.option('path', OUTPUT_DIR) \
.option('checkpointLocation', CHECKPOINT_DIR) \
.start() \
.awaitTermination()
Here I am not actually doing any processing yet, just a trivial deserialization/serialization.
The above script does produce output to Data Lake, but the result JSON objects are empty. Here is an example of the output:
{}
{}
{}
The commented code in the script does produce output, but this is just the string value since we did not include deserialization:
{"body":"{\"FetchTimestampUtc\": 2018-10-16T09:21:40.6173187Z, \"Username\": ... }}
I was wondering if the backslashes should be doubled, as in the example given in the link above? This might be doable with the options parameter of the from_json function: "options to control parsing. accepts the same options as the json datasource." But I have not found documentation for the options format.
Any ideas why the deserialization/serialization is not working?
It appears that the input JSON must have a specific syntax. The field values must be strings, timestamps are not allowed (and perhaps the same goes for integers, floats etc.). The type conversion must be done inside the Databricks script.
I changed the input JSON so that the timestamp value is quoted. In the schema, I also changed DateType to TimestampType (which is more appropriate), NOT to StringType.
By using the following select expression:
stream_data_body = stream_data \
.select(from_json(stream_data.body.cast('string'), schema).alias('body')) \
.select(to_json('body').alias('body'))
the following output is produced in the output file:
{"body":"{\"FetchTimestampUtc\":\"2018-11-29T21:26:40.039Z\",\"Username\":\"xyz\",\"Name\":\"x\",\"TweetedBy\":\"xyz\",\"Location\":\"\",\"TweetText\":\"RT #z123: I just want to say thanks to everyone who interacts with me, whether they talk or they just silently rt or like, thats okay.…\"}"}
which is kind of the expected result, although the timestamp value is outputted as a string value. In fact, the whole body object is outputted as a string.
I didn't manage to get the ingestion working if the input format is proper JSON with native field types. The output of from_json is always null in that case.
EDIT:
This seems to have been confusion on my part. Date values should always be quoted in JSON, they are not "native" types.
I have tested that integer and float values can be passed without quotes so that it is possible to do calculations with them.

How to write just the `row` value of a DataFrame to a file in spark?

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?
The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

Resources