How to create a list json using Pyspark? - python-3.x

I am trying to create a json file with below structure using Pyspark.
Target Output:
[{
"Loaded_data": [{
"Loaded_numeric_columns": ["id", "val"],
"Loaded_category_columns": ["name", "branch"]
}],
"enriched_data": [{
"enriched_category_columns": ["country__4"],
"enriched_index_columns": ["id__1", "val__3"]
}]
}]
I could able to create list for each section . Please refer below code. I kind of stuck here, could you please help.
Sample data:
input_data=spark.read.csv("/tmp/test234.csv",header=True, inferSchema=True)
def is_numeric(data_type):
return data_type not in ('date', 'string', 'boolean')
def is_nonnumeric(data_type):
return data_type in ('string')
sub="__"
Loaded_numeric_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub not in name)]
print Loaded_numeric_columns
Loaded_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub not in name)]
print Loaded_category_columns
enriched_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub in name)]
print enriched_category_columns
enriched_index_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub in name)]
print enriched_index_columns

You can just create the new column type with struct and array :
from pyspark.sql import functions as F
df.show()
+---+-----+-------+------+----------+-----+-------+
| id| val| name|branch|country__4|id__1| val__3|
+---+-----+-------+------+----------+-----+-------+
| 1|67.87|Shankar| a| 1|67.87|Shankar|
+---+-----+-------+------+----------+-----+-------+
df.select(
F.struct(
F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
).alias("Loaded_data"),
F.struct(
F.array(F.col("country__4")).alias("enriched_category_columns"),
F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
).alias("enriched_data"),
).printSchema()
root
|-- Loaded_data: struct (nullable = false)
| |-- Loaded_numeric_columns: array (nullable = false)
| | |-- element: double (containsNull = true)
| |-- Loaded_category_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
|-- enriched_data: struct (nullable = false)
| |-- enriched_category_columns: array (nullable = false)
| | |-- element: long (containsNull = true)
| |-- enriched_index_columns: array (nullable = false)
| | |-- element: string (containsNull = true)

Related

Pyspark - Expand column with struct of arrays into new columns

I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:

How to convert a list with structure like (key1, list(key2, value)) into a dataframe in pyspark?

I have a list as shown below:
It is of the type as shown below:
[(key1, [(key11, value11), (key12, value12)]), (key2, [(key21, value21), (key22, value22)...])...]
A sample structure is shown below:
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])]
I want to convert this into a dataframe of the form
key1 key2 score
1052762305 1007819788 0.9206884810054885
1052762305 1005886801 0.913818268123084
1052762305 1003863766 0.9131746152849486
... ... ...
1064808607 1007804896 0.9984397647563017
1064808607 1003705572 0.9970498347406341
1064808607 1005879599 0.9951581013190172
... ... ...
How can we implement this in pyspark?
You can create a schema upfront with the input. Use explode and access the elements with in the value struct.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField,StringType,ArrayType, DoubleType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
schema = StructType([StructField("key1",StringType()), StructField("value",ArrayType(
StructType([ StructField("key2", StringType()),
StructField("score", DoubleType())])
)) ])
df = spark.createDataFrame(
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])
],schema
)
df.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[[1007819788, 0.9...|
|1064808607|[[1007804896, 0.9...|
+----------+--------------------+
df.printSchema()
root
|-- key1: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key2: string (nullable = true)
| | |-- score: double (nullable = true)
df1=df.select('key1',F.explode('value').alias('value'))
df1.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[1007819788, 0.92...|
|1052762305|[1005886801, 0.91...|
|1052762305|[1003863766, 0.91...|
|1052762305|[1007811435, 0.91...|
|1052762305|[1005879599, 0.91...|
|1052762305|[1003705572, 0.91...|
|1052762305|[1007804896, 0.90...|
|1052762305|[1005890270, 0.89...|
|1052762305|[1007806781, 0.87...|
|1052762305|[1003670458, 0.84...|
|1064808607|[1007804896, 0.99...|
|1064808607|[1003705572, 0.99...|
|1064808607|[1005879599, 0.99...|
|1064808607|[1007811435, 0.99...|
|1064808607|[1005886801, 0.99...|
|1064808607|[1003863766, 0.99...|
|1064808607|[1007819788, 0.98...|
|1064808607|[1005890270, 0.96...|
|1064808607|[1007806781, 0.92...|
|1064808607|[1003670458, 0.85...|
+----------+--------------------+
df1.printSchema()
root
|-- key1: string (nullable = true)
|-- value: struct (nullable = true)
| |-- key2: string (nullable = true)
| |-- score: double (nullable = true)
df1.select("key1", "value.key2","value.score").show()
+----------+----------+------------------+
| key1| key2| score|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
|1052762305|1003863766|0.9131746152849486|
|1052762305|1007811435|0.9128666156173751|
|1052762305|1005879599|0.9126368405937075|
|1052762305|1003705572|0.9122051062936369|
|1052762305|1007804896|0.9083424459788203|
|1052762305|1005890270|0.8982097535650703|
|1052762305|1007806781|0.8708761186829758|
|1052762305|1003670458|0.8452789033694487|
|1064808607|1007804896|0.9984397647563017|
|1064808607|1003705572|0.9970498347406341|
|1064808607|1005879599|0.9951581013190172|
|1064808607|1007811435|0.9934813787902085|
|1064808607|1005886801|0.9930572794622374|
|1064808607|1003863766|0.9928815742735568|
|1064808607|1007819788|0.9869723713790797|
|1064808607|1005890270|0.9642640856016443|
|1064808607|1007806781|0.9211558765137313|
|1064808607|1003670458|0.8519872445941068|
You basically need to do following:
create a dataframe from your list
promote the pairs from elements of array into a separate row by using explode
extract key & value from pair via select
This could be done by something like this (source data is in the variable called a):
from pyspark.sql.functions import explode, col
df = spark.createDataFrame(a, ['key1', 'val'])
df2 = df.select(col('key1'), explode(col('val')).alias('val'))
df3 = df2.select('key1', col('val')._1.alias('key2'), col('val')._2.alias('value'))
we can check that schema & data is matching:
>>> df3.printSchema()
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- value: double (nullable = true)
>>> df3.show(2)
+----------+----------+------------------+
| key1| key2| value|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
+----------+----------+------------------+
only showing top 2 rows
we can also check the schema for intermediate results:
>>> df.printSchema()
root
|-- key1: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
>>> df2.printSchema()
root
|-- key1: string (nullable = true)
|-- val: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: double (nullable = true)

Modify a struct column in Spark dataframe

I have a PySpark dataframe which contains a column "student" as follows:
"student" : {
"name" : "kaleem",
"rollno" : "12"
}
Schema for this in dataframe is:
structType(List(
name: String,
rollno: String))
I need to modify this column as
"student" : {
"student_details" : {
"name" : "kaleem",
"rollno" : "12"
}
}
Schema for this in dataframe must be:
structType(List(
student_details:
structType(List(
name: String,
rollno: String))
))
How to do this in Spark?
Use named_struct function to achieve this
1. Read the json as column
val data =
"""
| {
| "student": {
| "name": "kaleem",
| "rollno": "12"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(data).toDS())
df.show(false)
println(df.schema("student"))
Output
+------------+
|student |
+------------+
|[kaleem, 12]|
+------------+
StructField(student,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)
2. change the schema using named_struct
val processedDf = df.withColumn("student",
expr("named_struct('student_details', student)")
)
processedDf.show(false)
println(processedDf.schema("student"))
Output
+--------------+
|student |
+--------------+
|[[kaleem, 12]]|
+--------------+
StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)),false)
For python step#2 will work as is just remove val
With a library called spark-hats - This library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of arbitrary levels of nesting., you can do a lot of these transformations.
scala> import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- ID: string (nullable = true)
scala> val df2 = df.nestedMapColumn("ID", "ID", c => struct(c as alfa))
scala> df2.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: string (nullable = true)
scala> val df3 = df2.nestedMapColumn("ID.alfa", "ID.alfa", c => struct(c as "beta"))
scala> df3.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: struct (nullable = false)
| | |-- beta: string (nullable = true)
Your query would be
df.nestedMapColumn("student", "student", c => struct(c as "student_details"))
Spark 3.1+
To modify struct type columns, we can use withField and dropFields
F.col("Student").withField("student_details", F.col("student"))
F.col("Student").dropFields("name", "rollno")
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(("kaleem", "12"),)], "student struct<name:string,rollno:string>")
df.printSchema()
# root
# |-- student: struct (nullable = true)
# | |-- name: string (nullable = true)
# | |-- rollno: string (nullable = true)
Script:
df = df.withColumn("student", F.col("Student")
.withField("student_details", F.col("student"))
.dropFields("name", "rollno")
)
Result:
df.printSchema()
# root
# |-- student: struct (nullable = true)
# | |-- student_details: struct (nullable = true)
# | | |-- name: string (nullable = true)
# | | |-- rollno: string (nullable = true)

Slice array of structs using column values

I want to use Spark slice function with start and length defined as Column(s).
def slice(x: Column, start: Int, length: Int): Column
x looks like this:
`|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: double (nullable = true)
| | |-- b : double (nullable = true)
| | |-- c: double (nullable = true)
| | |-- d: string (nullable = true)
| | |-- e: double (nullable = true)
| | |-- f: double (nullable = true)
| | |-- g: long (nullable = true)
| | |-- h: double (nullable = true)
| | |-- i: double (nullable = true)
...
`
any idea on how to achieve this ?
Thanks !
You cannot use the built-in DataFrame DSL function slice for this (as it needs constant slice bounds), you can use an UDF for that. If df is your dataframe and you have a from und until column, then you can do:
val mySlice = udf(
(data:Seq[Row], from:Int, until:Int) => data.slice(from,until),
df.schema.fields.find(_.name=="x").get.dataType
)
df
.select(mySlice($"x",$"from",$"until"))
.show()
Alternatively, you can use the SQL-Expression in Spark SQL:
df
.select(expr("slice(x,from,until)"))
.show()

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

Resources