Processing a list of json strings in Spark Streaming - apache-spark

I'm trying to transform the input I get with spark streaming in order to create a dataframe out of it. Basically I receive a list of json strings from which I would want to extract the data.
Note: I reduced the json strings to just the coords objects which should be sufficient for the general concept.
The input I get:
["{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}", "{\"coord\":{"lon\":10.1169,\"lat\":53.6522}}", "{\"coord\":...."]
The dataframe I want to create in order to save it to a database:
+----------+----------+
|lon |lat |
+----------+----------+
| 10.0217| 53.5281|
| 10.1169| 53.6522|
| ... | ... |
+----------+----------+
So far I managed to replace the excaped quotes which leaves me with a array of strings.
I tried to flatten the array:
result = df \
.selectExpr("Cast(value AS STRING) as json") \
.withColumn("json", f.regexp_replace('json', '\\\\"', '"')) \
.withColumn("json", f.flatten(f.col("json"))) \
.select("json")
Error:
pyspark.sql.utils.AnalysisException: cannot resolve 'flatten(json)'
due to data type mismatch: The argument should be an array of arrays,
but 'json' is of string type.;;
Then I tried to load the array with json.loads, but I was not able to call this function from Spark streaming.
So how do I extract the data from this input?

With the array provided
arr = [
"{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}",
"{\"coord\":{\"lon\":10.1169,\"lat\":53.6522}}",
]
You can get the desired result with the following code
from pyspark.sql import functions, types
df = (df.withColumn("lon", functions.regexp_extract("value", "(?<=lon\"\:)[0-9]+.[0-9]+", 0))
.withColumn("lat", functions.regexp_extract("value", "(?<=lat\"\:)[0-9]+.[0-9]+", 0)))
df = df.select(df["lon"], df["lat"])
df.show()
+-------+-------+
| lon| lat|
+-------+-------+
|10.0217|53.5281|
|10.1169|53.6522|
+-------+-------+

Related

Parsing a Type 4 Nested Parquet and flattening/Explode JSON values in a column in pyspark

I am relatively new to Pyspark. And for orchestration I use Databricks.
[Just FYI: My source Parquet holds a SCD Type 4 dataset where the Current Snapshot and History of it is maintained in a Single row, where the Current Snapshot is in Parquet individual Columns while the History Snapshot is within a Columns as a JSON Array.]
Believe my solution could be the one used in the below link and just want to expand that solution to work for me (I am not able to comment on that post as i believe my problem even same is different)
https://stackoverflow.com/questions/56409454/casting-a-column-to-json-dict-and-flattening-json-[values-in-a-column-in-pyspark/56409889#56409889][1]
Reference courtesies : #Gingerbread,#Kafels
And tried to use the resolution in that one, but getting some error
Here's how my dataframe looks like:
|HISTORY
|-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|[{"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"FSDF CC 10 ml ","DATE":"2021-06-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB CC ","DATE":"2021-03-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB DD ","DATE":"2021-02-27"}]|
|[{"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAMES 50 ml ","DATE":"2021-03-02"}, {"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAS 50 ml ","DATE":"2021-02-02"}] |
|null |
The DataFrame Schema is
root
|-- HISTORY: array (nullable = true)
| |-- element: string (containsNull = true)
Desired output is just to flattening JSON values in a column in pyspark
|HASH_KEY |SOURCE_KEY|DATE |ATTR1 |
|:-----------------------|:--------:|:--------:|---------------:|
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-06-11|FSDF CC 10 ml |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-03-11|BBB CC |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-02-27|BBB DD |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-03-02|JAMES 50 ml |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-02-02|JAS 50 ml |
|CAsaZMe/1UTHsenUasasaW==|BBBBB |2021-09-11|null |
The code snippet i tried
schema = ArrayType(
StructType(
[
StructField("HASH_KEY1", StringType()),
StructField("SOURCE_KEY1", StringType()),
StructField("ATTR1X", StringType()),
StructField("DATE1", TimestampType())
]
)
)
#f.udf(returnType=schema)
def parse_col(column):
updated_values = []
for it in re.finditer(r'[.*?]', column):
parse = json.loads(it.group())
for key, values in parse.items():
for value in values:
value['HASH_KEY1'] = key
updated_values.append(value)
return updated_values
df = df \
.withColumn('tmp', parse_col(f.col('HISTORY'))) \
.withColumn('tmp', f.explode(f.col('tmp'))) \
.select(f.col('HASH_KEY'),
f.col('tmp').HASH_KEY1.alias('HASH_KEY1'),
f.col('tmp').SOURCE_KEY1.alias('SOURCE_KEY1'),
f.col('tmp').ATTR1X.alias('ATTR1X'),
f.col('tmp').DATE1.alias('DATE1'))
df.show()
The following is the result i got
|HASH_KEY1|SOURCE_KEY1|ATTR1X|DATE1|
|:-------:|:---------:|:----:|----:|
| | | | |
| | | | |
|:-------:|:---------:|:----:|----:|
I am having trouble in getting the expected output
Any help would be greatly appreciated. I am using Spark 2.0 + .
Thank you!
Undestood the usage of json_tuple and simplified my approach, where i can directly explode the array into a String and then use json_tuple function to convert into flattened columns
So answer snippet now looks as follow
from pyspark.sql.functions import col,json_tuple
DF_EXPLODE = df \
.withColumn('Expand', f.explode(f.col('HISTORY'))) \
.select(f.col('Expand'))
DF_FLATTEN =
DF_EXPLODE.select("*",json_tuple("Expand","HASH_KEY").alias("HASH_KEY")) \
.select("*",json_tuple("Expand","SOURCE_KEY").alias("SOURCE_KEY"))\
.select("*",json_tuple("Expand","DATE").alias("DATE"))\
.select("*",json_tuple("Expand","ATTR1").alias("ATTR1"))
Worked on my initial PySpark Looping approach and following is the code.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
DF_DIM2 = DF_DIM.withColumn("sizer",size(col('HISTORY'))).sort("sizer",ascending=False)
max_len = DF_DIM2.select('sizer').take(1)[0][0]
print(max_len)
expanded_df = DF_DIM.select(['*'] + [col('HISTORY')[i].alias(f'HISTORY_{i}') for i in range(max_len)])
original_cols = [i for i in expanded_df.columns if 'HISTORY_' not in i ]
cols_exp = [i for i in expanded_df.columns if 'HISTORY_' in i]
schema = StructType([
StructField("HASH_KEY",StringType(),True),
StructField("SOURCE_KEY",StringType(),True),
StructField("DATE", StringType(), True),
StructField("ATTR1",StringType(),True)
])
final_df = expanded_df.select([from_json(i,schema).alias(i) for i in cols_exp])
Did some use case testing where joined a 3.72Billion Fact Parquet with 390k Type4 Nested Dimension Parquet and it took 2.5 mins while the Explode option took over 4 mins.
The Explode Option is exploding each of Type4 records multiplied by the times dimension had its changes recorded in the History column. So on an averag if every dimension changed 10 times. Then 390k*10=3.9M records are used in memory to join with the fact leading to more processing times.

Not able to split the column into multiple columns in Spark Dataframe

Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+
You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps

Splitting Kafka Message Line by line in Spark Structured Streaming

I want to read a message from Kafka topic in my Spark Structured Streaming job into a data frame. but I am getting entire message in one offset so in data frame only this message is coming into one row instead of multiple rows. (in my case it is 3 rows)
When I print this message I am getting below output:
The message "Text1", "Text2" and "Text3" I want in 3 rows in data frame so that I can process further.
Please help me.
you can use a user defined function (UDF) to convert the message string into a sequence of strings, and then apply the explode function on that column, to create a new row for each element in the sequence:
As illustrated below (in scala, same principle applies to pyspark):
case class KafkaMessage(offset: Long, message: String)
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.explode
val df = sc.parallelize(List(KafkaMessage(1000, "Text1\nText2\nText3"))).toDF()
val splitString = udf { s: String => s.split('\n') }
df.withColumn("splitMsg", explode(splitString($"message")))
.select("offset", "splitMsg")
.show()
this will yield the following output:
+------+--------+
|offset|splitMsg|
+------+--------+
| 1000| Text1|
| 1000| Text2|
| 1000| Text3|
+------+--------+

How to convert column of MapType(StringType, StringType) into StringType?

So I have this streaming dataframe and I'm trying to cast this 'customer_ids' column to a simple string.
schema = StructType()\
.add("customer_ids", MapType(StringType(), StringType()))\
.add("date", TimestampType())
original_sdf = spark.readStream.option("maxFilesPerTrigger", 800)\
.load(path=source, ftormat="parquet", schema=schema)\
.select('customer_ids', 'date')
The intend to this conversion is to group by this column and agregate by max(date) like this
original_sdf.groupBy('customer_ids')\
.agg(max('date')) \
.writeStream \
.trigger(once=True) \
.format("memory") \
.queryName('query') \
.outputMode("complete") \
.start()
but I got this exception
AnalysisException: u'expression `customer_ids` cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.
How can I cast this kind of streaming DataFrame column or any other way to groupBy this column?
TL;DR Use getItem method to access the values per key in a MapType column.
The real question is what key(s) you want to groupBy since a MapType column can have a variety of keys. Every key can be a column with values from the map column.
You can access keys using Column.getItem method (or a similar python voodoo):
getItem(key: Any): Colum An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.
(I use Scala and am leaving converting it to pyspark as a home exercise)
val ds = Seq(Map("hello" -> "world")).toDF("m")
scala> ds.show(false)
+-------------------+
|m |
+-------------------+
|Map(hello -> world)|
+-------------------+
scala> ds.select($"m".getItem("hello") as "hello").show
+-----+
|hello|
+-----+
|world|
+-----+

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Resources