Pyspark transform key-value pairs into columns - apache-spark

I have a json file, which contains data that looks like this:
"Url": "https://sample.com", "Method": "POST", "Headers": [{"Key": "accesstoken", "Value": ["123"]}, {"Key": "id", "Value": ["abc"]}, {"Key": "context", "Value": ["sample"]}]
When reading the json, I explicitly define the schema as:
schema = StructType(
[
StructField('Url', StringType(), True),
StructField('Method', StringType(), True),
StructField("Headers",ArrayType(StructType([
StructField('Key', StringType(), True),
StructField("Value",ArrayType(StringType()),True),
]),True),True)
]
)
The goal is to read the Key-Value data as columns instead of rows.
Url
Method
accesstoken
id
context
https://sample.com
POST
123
abc
sample
Exploding the "Headers" column only transforms it into multiple rows. Another problem with the data is that, instead of having a literal key-value pair (e.g. "accesstoken": "123"), my key value pair value is stored in 2 separate pairs!
I tried to iterate over the values to create a map first, but I am not able to iterate through the "Headers" column.
df_map = df.withColumn('map', to_json(array(*[create_map(element.Key, element.Value) for element in df.Headers])))
I also tried to read the "Headers" column as MapType(StringType, ArrayType(StringType)), but it could not read then value. It shows null when I did that.
Is there anyway to achieve this? Do I have to read the data as a plain text and pre-process the data instead of the dataframe?

You were in the right way, but to concatenate your map must use reduce expression:
from pyspark.sql.types import *
import pyspark.sql.functions as f
# [...] Your dataframe initialization
df = df.select('Url', 'Method', f.explode(f.expr('REDUCE(Headers, cast(map() as map<string, array<string>>), (acc, el) -> map_concat(acc, map(el.Key, el.Value)))')))
# Transform key:value into columns
df_pivot = (df
.groupBy('Url', 'Method')
.pivot('key')
.agg(f.first('value')))
array_columns = [column for column, _type in df_pivot.dtypes if _type.startswith('array')]
df_pivot = (df_pivot
.withColumn('zip', f.explode(f.arrays_zip(*array_columns)))
.select('Url', 'Method', 'zip.*'))
df_pivot.show(truncate=False)
Output
+------------------+------+-----------+-------+---+
|Url |Method|accesstoken|context|id |
+------------------+------+-----------+-------+---+
|https://sample.com|POST |123 |sample |abc|
+------------------+------+-----------+-------+---+

Related

Create dataframe from a dictionary in python when the key is a byte?

I want to create a dataframe from spesific data inside a dictionary. The key is a byte, and I don't understand how to get the information "out of the byte in a usefull way". If I can get the data I need into a dataframe I would know how to handle it (sort, plot etc.)
I have this dictionary:
{'SequenceNumber': 2654504175, 'Offset': '67826126730624', 'EnqueuedTimeUtc': '7/10/2020 1:18:00 PM', 'SystemProperties': {}, 'Properties': {}, 'Body': b'{"id": "MicroSCADA OPC DA.S_M.APL.1.P.P_R_P.1", "ts": "2020-07-10T13:17:24.654000", "value": 1.1293551921844482, "status_code": 0}'}
It is a result from reading one datapoint in an avro file. The data I need is inside 'Body'.
I go:
x=my_dict.get("Body)
the result is:
b'{"id": "MicroSCADA OPC DA.S_M.APL.1.P.P_R_P.1", "ts": "2020-07-10T13:17:24.654000", "value": 1.1293551921844482, "status_code": 0}'
I would like to sort the data into a dataframe with coulmns "id", "ts", "value", and "status code". How can I do this?
I have also tried pandavro, but the byte is still "locking" the data I need together. I have tried converting the byte to string, but then the key and it's value dosen't naturally belong together any more.
How to solve this in a best possible way?
Convert the byte string to a string, evaluate it and convert it into a dataFrame:
dic = {'SequenceNumber': 2654504175, 'Offset': '67826126730624', 'EnqueuedTimeUtc': '7/10/2020 1:18:00 PM', 'SystemProperties': {}, 'Properties': {}, 'Body': b'{"id": "MicroSCADA OPC DA.S_M.APL.1.P.P_R_P.1", "ts": "2020-07-10T13:17:24.654000", "value": 1.1293551921844482, "status_code": 0}'}
d = eval(dic["Body"].decode("utf-8"))
df = pd.DataFrame([list(d.values())],columns = list(d.keys()))
df
Output:
id ts value status_code
0 MicroSCADA OPC DA.S_M.APL.1.P.P_R_P.1 2020-07-10T13:17:24.654000 1.129355 0

Spark dataframe to nested JSON

I have a dataframe joinDf created from joining the following four dataframes on userId:
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val joinDf = detailsDf
.join(emailDf, Seq("userId"))
.join(foodDf, Seq("userId"))
.join(gameDf, Seq("userId"))
User's food and game favorites should be ordered by score in the ascending order.
I am trying to create a result from this joinDf where the JSON looks like the following:
[
{
"userId": "123",
"firstName": "first123",
"address": "xyz",
"UserFoodFavourites": [
{
"foodName": "food1",
"isFavFood": "true",
"cuisine": "Mediterranean",
},
{
"foodName": "food2",
"isFavFood": "false",
"cuisine": "Italian",
},
{
"foodName": "food3",
"isFavFood": "true",
"cuisine": "American",
}
]
"UserEmail": [
"abc#gmail.com",
"def#gmail.com"
]
"UserGameFavourites": [
{
"gameName": "football",
"isOutdoor": "true"
},
{
"gameName": "chess",
"isOutdoor": "false"
}
]
}
]
Should I use joinDf.groupBy().agg(collect_set())?
Any help would be appreciated.
My solution is based on the answers found here and here
It uses the Window function. It shows how to create a nested list of food preferences for a given userid based on the food score. Here we are creating a struct of FoodDetails from the columns we have
val foodModifiedDf = foodDf.withColumn("FoodDetails",
struct("foodName","isFavFood", "cuisine","score"))
.drop("foodName","isFavFood", "cuisine","score")
println("Just printing the food detials here")
foodModifiedDf.show(10, truncate = false)
Here we are creating a Windowing function which will accumulate the list for a userId based on the FoodDetails.score in descending order. The windowing function when applied goes on accumulating the list as it encounters new rows with same userId. After we have done accumulating, we have to do a groupBy over the userId to select the largest list.
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userId").orderBy( desc("FoodDetails.score"))
val userAndFood = detailsDf.join(foodModifiedDf, "userId")
val newUF = userAndFood.select($"*", collect_list("FoodDetails").over(window) as "FDNew")
println(" UserAndFood dataframe after windowing function applied")
newUF.show(10, truncate = false)
val resultUF = newUF.groupBy("userId")
.agg(max("FDNew"))
println("Final result after select the maximum length list")
resultUF.show(10, truncate = false)
This is how the result looks like finally :
+------+-----------------------------------------------------------------------------------------+
|userId|max(FDNew) |
+------+-----------------------------------------------------------------------------------------+
|123 |[[food3, true, American, 3], [food2, false, Italian, 2], [food1, true, Mediterranean, 1]]|
+------+-----------------------------------------------------------------------------------------+
Given this dataframe, it should be easier to write out the nested json.
The main problem of joining before grouping and collecting lists is the fact that join will produce a lot of records for group by to collapse, in your example it is 12 records after join and before groupby, also you would need to worry about picking "firstName","address" out detailsDf out of 12 duplicates. To avoid both problems your could pre-process the food, email and game dataframes using struct and groupBy and join them to the detailsDf with no risk of exploding your data due to multiple records with the same userId in the joined tables.
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val emailGrp = emailDf.groupBy("userId").agg(collect_list("email").as("UserEmail"))
val foodGrp = foodDf
.select($"userId", struct("score", "foodName","isFavFood","cuisine").as("UserFoodFavourites"))
.groupBy("userId").agg(sort_array(collect_list("UserFoodFavourites")).as("UserFoodFavourites"))
val gameGrp = gameDf
.select($"userId", struct("gameName","isOutdoor","score").as("UserGameFavourites"))
.groupBy("userId").agg(collect_list("UserGameFavourites").as("UserGameFavourites"))
val result = detailsDf.join(emailGrp, Seq("userId"))
.join(foodGrp, Seq("userId"))
.join(gameGrp, Seq("userId"))
result.show(100, false)
Output:
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|userId|firstName|address|UserEmail |UserFoodFavourites |UserGameFavourites |
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|123 |first123 |xyz |[abc#gmail.com, def#gmail.com]|[[1, food1, true, Mediterranean], [2, food2, false, Italian], [3, food3, true, American]]|[[chess, false, 2], [football, true, 1]]|
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
As all groupBy are done on userId and joins as well, spark will optimise it quite well.
UPDATE 1: After #user238607 pointed out that I have missed the original requirement of food preferences being sorted by score, did a quick fix and placed the score column as first element of structure UserFoodFavourites and used sort_array function to arrange data in desired order without forcing extra shuffle operation. Updated the code and its output.

Defining DataFrame Schema for a table with 1500 columns in Spark

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
Using Case class
Using StructType.
Spark Version used is 1.4
For this type of requirements. I'd offer case class approach to prepare a dataframe
Yes, There are some limitations like productarity but we can overcome...
you can do like below example for < versions 2.11 :
prepare a case class which extends Product and overrides methods.
like...
productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:
canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
Example implementation you can refer this Student case class which has 33 fields in it
Example student dataset description here
Another option :
Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)
The options for reading a table with 1500 columns
1) Using Case class
Case class would not work because its limited to 22 fields( for scala version < 2.11).
2) Using StructType
You can use the StructType to define the schema and create the dataframe.
Third option
You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.
You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.
For example,
Your schema json be:
[
{
"columnType": "VARCHAR",
"columnName": "NAME",
"nullable": true
},
{
"columnType": "VARCHAR",
"columnName": "AGE",
"nullable": true
},
.
.
.
]
Now you can read the the json to parse it to some case class to form the StructType.
case class Field(name: String, dataType: String, nullable: Boolean)
You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.
val dataType = Map(
"VARCHAR" -> StringType,
"NUMERIC" -> LongType,
"TIMESTAMP" -> TimestampType,
.
.
.
)
def parseJsonForSchema(jsonFilePath: String) = {
val jsonString = Source.fromFile(jsonFilePath).mkString
val parsedJson = parse(jsonString)
val fields = parsedJson.extract[Field]
val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
StructType(schemaColumns)
}

Spark SQL expand array to multiple columns

I am storing json messages for each row update from a oracle source in S3.
json structure is as below
{
"tableName": "ORDER",
"action": "UPDATE",
"timeStamp": "2016-09-04 20:05:08.000000",
"uniqueIdentifier": "31200477027942016-09-05 20:05:08.000000",
"columnList": [{
"columnName": "ORDER_NO",
"newValue": "31033045",
"oldValue": ""
}, {
"columnName": "ORDER_TYPE",
"newValue": "N/B",
"oldValue": ""
}]
}
I am using spark sql to find the latest record for each key based on max value for unique identifier.
columnList is a array with list of columns for the table .i want to join multiple tables and fetch the records which are latest.
How can i join the columns from the json array of one table with columns from another table. Is there a way to explode the json array to multiple columns . For example above json will have ORDER_NO as one column and ORDER_TYPE as another column. How can i create a data frame with multiple columns based on columnName field
For eg: new RDD should have the columns (tableName, action, timeStamp, uniqueIdentifier, ORDER_NO, ORDER_NO)
Value of ORDER_NO and ORDER_NO field should be mapped from newValue field in json.
Have found a solution for this by programmatically creating the schema by using the RDD apis
Dataset<Row> dataFrame = spark.read().json(inputPath);
dataFrame.printSchema();
JavaRDD<Row> rdd = dataFrame.toJavaRDD();
SchemaBuilder schemaBuilder = new SchemaBuilder();
// get the schema column names in appended format
String columnNames = schemaBuilder.populateColumnSchema(rdd.first(), dataFrame.columns());
SchemaBuilder is a custom class created which takes the rdd details and return a delimiter separated column names.
Then using RowFactory.create call, json values are mapped to the schema.
Doc reference http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Spark dataframes convert nested JSON to seperate columns

I've a stream of JSONs with following structure that gets converted to dataframe
{
"a": 3936,
"b": 123,
"c": "34",
"attributes": {
"d": "146",
"e": "12",
"f": "23"
}
}
The dataframe show functions results in following output
sqlContext.read.json(jsonRDD).show
+----+-----------+---+---+
| a| attributes| b| c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+
How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f as seperate columns into a new dataframe, so I can have columns as a, b, c, attributes.d, attributes.e and attributes.f in the new dataframe?
If you want columns named from a to f:
df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
If you want columns named with attributes. prefix:
df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")
If names of your columns are supplied from an external source (e.g. configuration):
val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")
df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)
Using the attributes.d notation, you can create new columns and you will have them in your DataFrame. Look at the withColumn() method in Java.
Use Python
Extract the DataFrame by using the pandas Lib of python.
Change the data type from 'str' to 'dict'.
Get the values of each features.
Save the results to a new file.
import pandas as pd
data = pd.read_csv("data.csv") # load the csv file from your disk
json_data = data['Desc'] # get the DataFrame of Desc
data = data.drop('Desc', 1) # delete Desc column
Total, Defective = [], [] # setout list
for i in json_data:
i = eval(i) # change the data type from 'str' to 'dict'
Total.append(i['Total']) # append 'Total' feature
Defective.append(i['Defective']) # append 'Defective' feature
# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective
data.to_csv("result.csv") # save to the result.csv and check it

Resources