using clause where within nested structure - python-3.x

I'm creating a Dataframe of struct.
I want to create another 2 structs depending on the value of my field x2.field3 The idea is if x2.field3==4 Then my struct will be created("struct_1"), if x2.field3==3 Then my struct will be created("struct_2")
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2")).alias("struct_1"))\
.when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1")).alias("struct_2"))
I tried different solutions and didn't succeed because I have always the same error :
Py4JJavaError: An error occurred while calling o21058.withColumn. :
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(CAST(x2.field3 AS INT) = 4) THEN named_struct('index2',
x1.field1) WHEN (CAST(x2.field3 AS INT) = 3) THEN
named_struct('Index1', x1.field1) END' due to data type mismatch:
THEN and ELSE expressions should all be same type or coercible to a
common type;; 'Project [x1#5751, x2#5752, named_struct(gen1,
x1#5751.field1, gen2, x1#5751.field1, NamePlaceholder,
named_struct(gen3.1, x1#5751.field1, gen3.2, x1#5751.field1, gen3.3,
x1#5751.field1, gen3.4, x1#5751.field1, gen3.5, x1#5751.field1,
gen3.6, x1#5751.field1, NamePlaceholder, named_struct(gen3.7.1,
named_struct(gen3.7.1.1, 11, gen3.7.1.2, 40), col2, CASE WHEN
(cast(x2#5752.field3 as int) = 4) THEN named_struct(index2,
x1#5751.field1) WHEN (cast(x2#5752.field3 as int) = 3) THEN
named_struct(Index1, x1#5751.field1) END))) AS General#5772]
+- LogicalRDD [x1#5751, x2#5752], false
My entire code is below
schema = StructType(
[
StructField('x1',
StructType([
StructField('field1', IntegerType(),True),
StructField('field2', IntegerType(),True),
StructField('x12',
StructType([
StructField('field5', IntegerType(),True)
])
),
])
),
StructField('x2',
StructType([
StructField('field3', IntegerType(),True),
StructField('field4', BooleanType(),True)
])
)
])
df1 = sqlCtx.createDataFrame([Row(Row(1, 3, Row(23)), Row(3,True))], schema)
df1.printSchema()
df = df1.withColumn("General",
struct(
col("x1.field1").alias("gen1"),
col("x1.field1").alias("gen2"),
struct(col("x1.field1").alias("gen3.1"),
col("x1.field1").alias("gen3.2"),
col("x1.field1").alias("gen3.3"),
col("x1.field1").alias("gen3.4"),
col("x1.field1").alias("gen3.5"),
col("x1.field1").alias("gen3.6"),
struct(struct(lit(11).alias("gen3.7.1.1"),
lit(40).alias("gen3.7.1.2")).alias("gen3.7.1"),
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2")).alias("struct_1"))\
.when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1")).alias("struct_2"))
).alias("gen3.7")).alias("gen3")
)).drop('x1','x2')
df.printSchema()

Over, struct_1 and struct_2 are exclusive, so I recommend you the following piece of code :
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
schema = StructType(
[
StructField('x1',
StructType([
StructField('field1', IntegerType(),True),
StructField('field2', IntegerType(),True),
StructField('x12',
StructType([
StructField('field5', IntegerType(),True)
])
),
])
),
StructField('x2',
StructType([
StructField('field3', IntegerType(),True),
StructField('field4', BooleanType(),True)
])
)
])
df1 = sqlCtx.createDataFrame([Row(Row(1, 3, Row(23)), Row(1,True))], schema)
df = df1.withColumn("General",
struct(
col("x1.field1").alias("gen1"),
col("x1.field1").alias("gen2"),
struct(col("x1.field1").alias("gen3.1"),
col("x1.field1").alias("gen3.2"),
col("x1.field1").alias("gen3.3"),
col("x1.field1").alias("gen3.4"),
col("x1.field1").alias("gen3.5"),
col("x1.field1").alias("gen3.6"),
struct(struct(lit(11).alias("gen3.7.1.1"),
lit(40).alias("gen3.7.1.2")).alias("gen3.7.1"),
when(col("x2.field3").cast(IntegerType())== lit(4), struct(col("x1.field1").alias("index2"))).alias("struct_1"),
when(col("x2.field3").cast(IntegerType())==lit(3), struct(col("x1.field1").alias("Index1"))).alias("struct_2")
).alias("gen3.7")).alias("gen3")
)).drop('x1','x2')
df.printSchema()
Output :
root
|-- General: struct (nullable = false)
| |-- gen1: integer (nullable = true)
| |-- gen2: integer (nullable = true)
| |-- gen3: struct (nullable = false)
| | |-- gen3.1: integer (nullable = true)
| | |-- gen3.2: integer (nullable = true)
| | |-- gen3.3: integer (nullable = true)
| | |-- gen3.4: integer (nullable = true)
| | |-- gen3.5: integer (nullable = true)
| | |-- gen3.6: integer (nullable = true)
| | |-- gen3.7: struct (nullable = false)
| | | |-- gen3.7.1: struct (nullable = false)
| | | | |-- gen3.7.1.1: integer (nullable = false)
| | | | |-- gen3.7.1.2: integer (nullable = false)
| | | |-- struct_1: struct (nullable = true)
| | | | |-- index2: integer (nullable = true)
| | | |-- struct_2: struct (nullable = true)
| | | | |-- Index1: integer (nullable = true)

Related

Add a column to multilevel nested structure in pyspark

I have a pyspark dataframe with below structure.
Current Schema:
root
|-- ID
|-- Information
| |-- Name
| |-- Age
| |-- Gender
|-- Description
I would like to add first name and last name to Information.Name
Is there a way to add new columns so multi level struct types in pyspark?
Expected Schema:
root
|-- ID
|-- Information
| |-- Name
| | |-- firstName
| | |-- lastName
| |-- Age
| |-- Gender
|-- Description
Use withField, this would work:
df=df.withColumn('Information', F.col('Information').withField('Name', F.struct(*[F.col('Information.Name').alias('FName'), F.lit('').alias('LName')])))
Schema Before:
root
|-- Id: string (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- Age: integer (nullable = true)
Schema After:
root
|-- Id: string (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: struct (nullable = false)
| | |-- FName: string (nullable = true)
| | |-- LName: string (nullable = false)
| |-- Age: integer (nullable = true)
I initialized the value of Fname with the current value of Name, you can use substring if that is needed.
If all Names follow below pattern then you can split on whitespace.
FirstName LastName
Example code with data.
from pyspark.sql.types import *
import pyspark.sql.functions as sqlf
data = [{
"ID":1,
"Information":{
"Name":"Alice Wonderland",
"Age":20,
"Gender":"Female"
},
"Description":"Test data"
}]
schema = StructType([
StructField("Description", StringType(), True),
StructField("ID", IntegerType(), True),
StructField("Information",
StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Gender", StringType(), True)
]),True)
])
df = spark.createDataFrame(data,schema)
splitName = sqlf.split(df.Information.Name,' ')
df=df.withColumn('Information', sqlf.col('Information')
.withField('Name', sqlf.struct(splitName[0].alias('firstName'), splitName[1].alias('lastName'))))
df.printSchema()
root
|-- Description: string (nullable = true)
|-- ID: integer (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: struct (nullable = false)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| |-- Age: integer (nullable = true)
| |-- Gender: string (nullable = true)
df.show(truncate=False)
+-----------+---+---------------------------------+
|Description|ID |Information |
+-----------+---+---------------------------------+
|Test data |1 |{{Alice, Wonderland}, 20, Female}|
+-----------+---+---------------------------------+

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId: string (nullable = true)
| | |-- legalEntityId: string (nullable = true)
| | |-- customPartyId: string (nullable = true)
| | |-- partyIdScheme: string (nullable = true)
| | |-- customPartyIdScheme: string (nullable = true)
| | |-- bdrId: string (nullable = true)
Need to convert it to JSON type. Please suggest how to do it. Thanks in advance
Spark provides to_json function available for DataFrame operations:
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
List(
("key1", "party01", "cdrId01"),
("key2", "party02", "cdrId02"),
)
.toDF("key", "partyName", "cdrId")
.select(struct($"key", struct($"partyName", $"cdrId")).as("col1"))
.agg(map_from_entries(collect_set($"col1")).as("map_col"))
.select($"map_col", to_json($"map_col").as("json_col"))

Exploding struct type large number of columns to two columns of each column having keys and values in Pyspark

I have a pyspark df who's schema looks like this
root
|-- company: struct (nullable = true)
| |-- A: long(nullable = true)
| |-- A_timestamp: long(nullable = true)
| |-- B: long(nullable = true)
| |-- B_timestamp: long(nullable = true)
| |-- C: long(nullable = true)
| |-- C_timestamp: long(nullable = true)
| |-- D: long(nullable = true)
| |-- D_timestamp: long(nullable = true)
| |-- E: long(nullable = true)
| |-- E_timestamp: long(nullable = true)
I want the final format of this dataframe with 4 columns to look like this
Please help me to solve this using Pyspark.

How to convert a list with structure like (key1, list(key2, value)) into a dataframe in pyspark?

I have a list as shown below:
It is of the type as shown below:
[(key1, [(key11, value11), (key12, value12)]), (key2, [(key21, value21), (key22, value22)...])...]
A sample structure is shown below:
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])]
I want to convert this into a dataframe of the form
key1 key2 score
1052762305 1007819788 0.9206884810054885
1052762305 1005886801 0.913818268123084
1052762305 1003863766 0.9131746152849486
... ... ...
1064808607 1007804896 0.9984397647563017
1064808607 1003705572 0.9970498347406341
1064808607 1005879599 0.9951581013190172
... ... ...
How can we implement this in pyspark?
You can create a schema upfront with the input. Use explode and access the elements with in the value struct.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField,StringType,ArrayType, DoubleType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
schema = StructType([StructField("key1",StringType()), StructField("value",ArrayType(
StructType([ StructField("key2", StringType()),
StructField("score", DoubleType())])
)) ])
df = spark.createDataFrame(
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])
],schema
)
df.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[[1007819788, 0.9...|
|1064808607|[[1007804896, 0.9...|
+----------+--------------------+
df.printSchema()
root
|-- key1: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key2: string (nullable = true)
| | |-- score: double (nullable = true)
df1=df.select('key1',F.explode('value').alias('value'))
df1.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[1007819788, 0.92...|
|1052762305|[1005886801, 0.91...|
|1052762305|[1003863766, 0.91...|
|1052762305|[1007811435, 0.91...|
|1052762305|[1005879599, 0.91...|
|1052762305|[1003705572, 0.91...|
|1052762305|[1007804896, 0.90...|
|1052762305|[1005890270, 0.89...|
|1052762305|[1007806781, 0.87...|
|1052762305|[1003670458, 0.84...|
|1064808607|[1007804896, 0.99...|
|1064808607|[1003705572, 0.99...|
|1064808607|[1005879599, 0.99...|
|1064808607|[1007811435, 0.99...|
|1064808607|[1005886801, 0.99...|
|1064808607|[1003863766, 0.99...|
|1064808607|[1007819788, 0.98...|
|1064808607|[1005890270, 0.96...|
|1064808607|[1007806781, 0.92...|
|1064808607|[1003670458, 0.85...|
+----------+--------------------+
df1.printSchema()
root
|-- key1: string (nullable = true)
|-- value: struct (nullable = true)
| |-- key2: string (nullable = true)
| |-- score: double (nullable = true)
df1.select("key1", "value.key2","value.score").show()
+----------+----------+------------------+
| key1| key2| score|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
|1052762305|1003863766|0.9131746152849486|
|1052762305|1007811435|0.9128666156173751|
|1052762305|1005879599|0.9126368405937075|
|1052762305|1003705572|0.9122051062936369|
|1052762305|1007804896|0.9083424459788203|
|1052762305|1005890270|0.8982097535650703|
|1052762305|1007806781|0.8708761186829758|
|1052762305|1003670458|0.8452789033694487|
|1064808607|1007804896|0.9984397647563017|
|1064808607|1003705572|0.9970498347406341|
|1064808607|1005879599|0.9951581013190172|
|1064808607|1007811435|0.9934813787902085|
|1064808607|1005886801|0.9930572794622374|
|1064808607|1003863766|0.9928815742735568|
|1064808607|1007819788|0.9869723713790797|
|1064808607|1005890270|0.9642640856016443|
|1064808607|1007806781|0.9211558765137313|
|1064808607|1003670458|0.8519872445941068|
You basically need to do following:
create a dataframe from your list
promote the pairs from elements of array into a separate row by using explode
extract key & value from pair via select
This could be done by something like this (source data is in the variable called a):
from pyspark.sql.functions import explode, col
df = spark.createDataFrame(a, ['key1', 'val'])
df2 = df.select(col('key1'), explode(col('val')).alias('val'))
df3 = df2.select('key1', col('val')._1.alias('key2'), col('val')._2.alias('value'))
we can check that schema & data is matching:
>>> df3.printSchema()
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- value: double (nullable = true)
>>> df3.show(2)
+----------+----------+------------------+
| key1| key2| value|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
+----------+----------+------------------+
only showing top 2 rows
we can also check the schema for intermediate results:
>>> df.printSchema()
root
|-- key1: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
>>> df2.printSchema()
root
|-- key1: string (nullable = true)
|-- val: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: double (nullable = true)

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

Resources