Add a column to multilevel nested structure in pyspark - apache-spark

I have a pyspark dataframe with below structure.
Current Schema:
root
|-- ID
|-- Information
| |-- Name
| |-- Age
| |-- Gender
|-- Description
I would like to add first name and last name to Information.Name
Is there a way to add new columns so multi level struct types in pyspark?
Expected Schema:
root
|-- ID
|-- Information
| |-- Name
| | |-- firstName
| | |-- lastName
| |-- Age
| |-- Gender
|-- Description

Use withField, this would work:
df=df.withColumn('Information', F.col('Information').withField('Name', F.struct(*[F.col('Information.Name').alias('FName'), F.lit('').alias('LName')])))
Schema Before:
root
|-- Id: string (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- Age: integer (nullable = true)
Schema After:
root
|-- Id: string (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: struct (nullable = false)
| | |-- FName: string (nullable = true)
| | |-- LName: string (nullable = false)
| |-- Age: integer (nullable = true)
I initialized the value of Fname with the current value of Name, you can use substring if that is needed.

If all Names follow below pattern then you can split on whitespace.
FirstName LastName
Example code with data.
from pyspark.sql.types import *
import pyspark.sql.functions as sqlf
data = [{
"ID":1,
"Information":{
"Name":"Alice Wonderland",
"Age":20,
"Gender":"Female"
},
"Description":"Test data"
}]
schema = StructType([
StructField("Description", StringType(), True),
StructField("ID", IntegerType(), True),
StructField("Information",
StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Gender", StringType(), True)
]),True)
])
df = spark.createDataFrame(data,schema)
splitName = sqlf.split(df.Information.Name,' ')
df=df.withColumn('Information', sqlf.col('Information')
.withField('Name', sqlf.struct(splitName[0].alias('firstName'), splitName[1].alias('lastName'))))
df.printSchema()
root
|-- Description: string (nullable = true)
|-- ID: integer (nullable = true)
|-- Information: struct (nullable = true)
| |-- Name: struct (nullable = false)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| |-- Age: integer (nullable = true)
| |-- Gender: string (nullable = true)
df.show(truncate=False)
+-----------+---+---------------------------------+
|Description|ID |Information |
+-----------+---+---------------------------------+
|Test data |1 |{{Alice, Wonderland}, 20, Female}|
+-----------+---+---------------------------------+

Related

Pyspark - Expand column with struct of arrays into new columns

I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:

Load only struct from map's value from an avro file into a Spark Dataframe

Using PySpark, I need to load "Properties" object (map's value) from an avro file into its own Spark dataframe. Such that, "Properties" from my avro file will become a dataframe with its elements and values as columns and rows. Hence, struggling to find some clear examples accomplishing that.
Schema of the file:
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
The resulting "Properties" dataframe loaded from the above avro file needs to be like this:
member0
member1
member2
member3
value
value
value
value
map_values is your friend.
Collection function: Returns an unordered array containing the values of the map.
New in version 2.3.0.
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
Full example:
df = spark.createDataFrame(
[('a', 20, 4.5, 'r', b'8')],
['key', 'member0', 'member1', 'member2', 'member3'])
df = df.select(F.create_map('key', F.struct('member0', 'member1', 'member2', 'member3')).alias('Properties'))
df.printSchema()
# root
# |-- Properties: map (nullable = false)
# | |-- key: string
# | |-- value: struct (valueContainsNull = false)
# | | |-- member0: long (nullable = true)
# | | |-- member1: double (nullable = true)
# | | |-- member2: string (nullable = true)
# | | |-- member3: binary (nullable = true)
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
df_properties.show()
# +-------+-------+-------+-------+
# |member0|member1|member2|member3|
# +-------+-------+-------+-------+
# | 20| 4.5| r| [38]|
# +-------+-------+-------+-------+
df_properties.printSchema()
# root
# |-- member0: long (nullable = true)
# |-- member1: double (nullable = true)
# |-- member2: string (nullable = true)
# |-- member3: binary (nullable = true)

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId: string (nullable = true)
| | |-- legalEntityId: string (nullable = true)
| | |-- customPartyId: string (nullable = true)
| | |-- partyIdScheme: string (nullable = true)
| | |-- customPartyIdScheme: string (nullable = true)
| | |-- bdrId: string (nullable = true)
Need to convert it to JSON type. Please suggest how to do it. Thanks in advance
Spark provides to_json function available for DataFrame operations:
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
List(
("key1", "party01", "cdrId01"),
("key2", "party02", "cdrId02"),
)
.toDF("key", "partyName", "cdrId")
.select(struct($"key", struct($"partyName", $"cdrId")).as("col1"))
.agg(map_from_entries(collect_set($"col1")).as("map_col"))
.select($"map_col", to_json($"map_col").as("json_col"))

Making one dataframe out of two dataframes as separate subcolumns in pyspark

I want to put two data frames into one, so each one is sub column, it's not join of dataframes. So I have two dataframes, stat1_df and stat2_df and they look something like this:
root
|-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
+----------+-------------+------------------+
|max_scenes|median_scenes|avg_scenes |
+----------+-------------+------------------+
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
+----------+-------------+------------------+
root
|-- max: double (nullable = true)
|-- type: string (nullable = true)
+-----+-----------+
|max |type |
+-----+-----------+
|10.0 |small |
|25.0 |medium |
|50.0 |large |
|250.0|extra_large|
+-----+-----------+
, and I want the result_df to be as:
root
|-- some_statistics1: struct (nullable = true)
| |-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
|-- some_statistics2: struct (nullable = true)
| |-- max: double (nullable = true)
|-- type: string (nullable = true)
Is there any way to put those two as shown? stat1_df and stat2_df are simple dataframes, without arrays and nested columns.Final dataframe is written to mongodb. If there any additional questions I am here to answer.
Check below code.
Add id column in both DataFrame, move all columns into struct & then use join both DataFrame's
scala> val dfa = Seq(("10","8.9","7.9")).toDF("max_scenes","median_scenes","avg_scenes")
dfa: org.apache.spark.sql.DataFrame = [max_scenes: string, median_scenes: string ... 1 more field]
scala> dfa.show(false)
+----------+-------------+----------+
|max_scenes|median_scenes|avg_scenes|
+----------+-------------+----------+
|10 |8.9 |7.9 |
+----------+-------------+----------+
scala> dfa.printSchema
root
|-- max_scenes: string (nullable = true)
|-- median_scenes: string (nullable = true)
|-- avg_scenes: string (nullable = true)
scala> val mdfa = dfa.select(struct($"*").as("some_statistics1")).withColumn("id",monotonically_increasing_id)
mdfa: org.apache.spark.sql.DataFrame = [some_statistics1: struct<max_scenes: string, median_scenes: string ... 1 more field>, id: bigint]
scala> mdfa.printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfa.show(false)
+----------------+---+
|some_statistics1|id |
+----------------+---+
|[10,8.9,7.9] |0 |
+----------------+---+
scala> val dfb = Seq(("11.2","sample")).toDF("max","type")
dfb: org.apache.spark.sql.DataFrame = [max: string, type: string]
scala> dfb.printSchema
root
|-- max: string (nullable = true)
|-- type: string (nullable = true)
scala> dfb.show(false)
+----+------+
|max |type |
+----+------+
|11.2|sample|
+----+------+
scala> val mdfb = dfb.select(struct($"*").as("some_statistics2")).withColumn("id",monotonically_increasing_id)
mdfb: org.apache.spark.sql.DataFrame = [some_statistics2: struct<max: string, type: string>, id: bigint]
scala> mdfb.printSchema
root
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfb.show(false)
+----------------+---+
|some_statistics2|id |
+----------------+---+
|[11.2,sample] |0 |
+----------------+---+
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").show(false)
+----------------+----------------+
|some_statistics1|some_statistics2|
+----------------+----------------+
|[10,8.9,7.9] |[11.2,sample] |
+----------------+----------------+

Error while counting hashtags using Pyspark

I am working on a twitter dataset. I have the data in JSON format. The structure is:
root
|-- _id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- lang: string (nullable = true)
|-- place: struct (nullable = true)
| |-- bounding_box: struct (nullable = true)
| | |-- coordinates: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: array (containsNull = true)
| | | | | |-- element: double (containsNull = true)
| | |-- type: string (nullable = true)
| |-- country_code: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- place_type: string (nullable = true)
|-- retweeted_status: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- user: struct (nullable = true)
| | |-- followers_count: long (nullable = true)
| | |-- friends_count: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- lang: string (nullable = true)
| | |-- screen_name: string (nullable = true)
| | |-- statuses_count: long (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- followers_count: long (nullable = true)
| |-- friends_count: long (nullable = true)
| |-- id_str: string (nullable = true)
| |-- lang: string (nullable = true)
| |-- screen_name: string (nullable = true)
| |-- statuses_count: long (nullable = true)
The code I am using for counting hashtag is this:
non_retweets = tweets.where("retweeted_status IS NULL")
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ").filter(lambda x: x.startWith("#"))
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
hashtag.collect()
The error I am getting is this:
File "<ipython-input-112-11fd8cbc056d>",line 4
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
^
SyntaxError: Invalid syntax
I am not able to point out what my error is. Please help!
You forgot to add ) after filter. That why it is showing Invalid syntax.
Please check below code.
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ")).filter(lambda x: x.startWith("#"))

Resources