I have this schema and I would like to split the inside of result into columns in order to have col1: EventCode, col2: Message, etc... I'm using Pyspark, I tried the explode function but it doesn't seem to work on structType, is there a way to do this in Spark ?
root
|-- result: struct (nullable = true)
| |-- EventCode: string (nullable = true)
| |-- Message: string (nullable = true)
| |-- _bkt: string (nullable = true)
| |-- _cd: string (nullable = true)
| |-- _indextime: string (nullable = true)
| |-- _pre_msg: string (nullable = true)
| |-- _raw: string (nullable = true)
| |-- _serial: string (nullable = true)
| |-- _si: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- _sourcetype: string (nullable = true)
| |-- _time: string (nullable = true)
| |-- host: string (nullable = true)
| |-- index: string (nullable = true)
| |-- linecount: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sourcetype: string (nullable = true)
To divide rows of dataframe to simple rows is easy. All you have to do is select all columns from dataframe and assign it to another dataframe. Something like this:
simpleDF = df.select("result.*")
It will convert the above given schema into following schema:
simpleDF.printSchema
root
|-- EventCode: string (nullable = true)
|-- Message: string (nullable = true)
|-- _bkt: string (nullable = true)
|-- _cd: string (nullable = true)
|-- _indextime: string (nullable = true)
|-- _pre_msg: string (nullable = true)
|-- _raw: string (nullable = true)
|-- _serial: string (nullable = true)
|-- _si: array (nullable = true)
| |-- element: string (containsNull = true)
|-- _sourcetype: string (nullable = true)
|-- _time: string (nullable = true)
|-- host: string (nullable = true)
|-- index: string (nullable = true)
|-- linecount: string (nullable = true)
|-- source: string (nullable = true)
|-- sourcetype: string (nullable = true)
Related
I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:
Using PySpark, I need to load "Properties" object (map's value) from an avro file into its own Spark dataframe. Such that, "Properties" from my avro file will become a dataframe with its elements and values as columns and rows. Hence, struggling to find some clear examples accomplishing that.
Schema of the file:
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
The resulting "Properties" dataframe loaded from the above avro file needs to be like this:
member0
member1
member2
member3
value
value
value
value
map_values is your friend.
Collection function: Returns an unordered array containing the values of the map.
New in version 2.3.0.
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
Full example:
df = spark.createDataFrame(
[('a', 20, 4.5, 'r', b'8')],
['key', 'member0', 'member1', 'member2', 'member3'])
df = df.select(F.create_map('key', F.struct('member0', 'member1', 'member2', 'member3')).alias('Properties'))
df.printSchema()
# root
# |-- Properties: map (nullable = false)
# | |-- key: string
# | |-- value: struct (valueContainsNull = false)
# | | |-- member0: long (nullable = true)
# | | |-- member1: double (nullable = true)
# | | |-- member2: string (nullable = true)
# | | |-- member3: binary (nullable = true)
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
df_properties.show()
# +-------+-------+-------+-------+
# |member0|member1|member2|member3|
# +-------+-------+-------+-------+
# | 20| 4.5| r| [38]|
# +-------+-------+-------+-------+
df_properties.printSchema()
# root
# |-- member0: long (nullable = true)
# |-- member1: double (nullable = true)
# |-- member2: string (nullable = true)
# |-- member3: binary (nullable = true)
I am working on a twitter dataset. I have the data in JSON format. The structure is:
root
|-- _id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- lang: string (nullable = true)
|-- place: struct (nullable = true)
| |-- bounding_box: struct (nullable = true)
| | |-- coordinates: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: array (containsNull = true)
| | | | | |-- element: double (containsNull = true)
| | |-- type: string (nullable = true)
| |-- country_code: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- place_type: string (nullable = true)
|-- retweeted_status: struct (nullable = true)
| |-- _id: string (nullable = true)
| |-- user: struct (nullable = true)
| | |-- followers_count: long (nullable = true)
| | |-- friends_count: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- lang: string (nullable = true)
| | |-- screen_name: string (nullable = true)
| | |-- statuses_count: long (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- followers_count: long (nullable = true)
| |-- friends_count: long (nullable = true)
| |-- id_str: string (nullable = true)
| |-- lang: string (nullable = true)
| |-- screen_name: string (nullable = true)
| |-- statuses_count: long (nullable = true)
The code I am using for counting hashtag is this:
non_retweets = tweets.where("retweeted_status IS NULL")
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ").filter(lambda x: x.startWith("#"))
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
hashtag.collect()
The error I am getting is this:
File "<ipython-input-112-11fd8cbc056d>",line 4
hashtag = hashtag.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
^
SyntaxError: Invalid syntax
I am not able to point out what my error is. Please help!
You forgot to add ) after filter. That why it is showing Invalid syntax.
Please check below code.
hashtag = non_retweets.select('text').flatMap(lambda x: x.split(" ")).filter(lambda x: x.startWith("#"))
My Spark Dataframe current schema is as shown below, is there a way i can remove the outer Struct column(DTC_CAN_SIGNALS).
**Current Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = true)
| |-- SGNL: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- SN: string (nullable = true)
| | | |-- ST: long (nullable = true)
| | | |-- SV: double (nullable = true)
**Expected Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- SN: string (nullable = true)
| |-- ST: long (nullable = true)
| |-- SV: double (nullable = true)
Just select your column from struct, like
df.withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
or
df.select("DTC_CAN_SIGNALS.SGNL")
Code:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(
("DTC", 42L, "VIN")
).toDF("DTC", "DTCTS", "VIN")
val df = data.withColumn("DTC_CAN_SIGNALS", struct(array(struct(lit("sn1").as("SN"), lit(42L).as("ST"), lit(42.0D).as("SV"))).as("SGNL")))
df.show()
df.printSchema()
// alternatively
// val resDf = df
// .withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
// .drop("DTC_CAN_SIGNALS")
val resDf = df.select("DTC", "DTCTS", "VIN", "DTC_CAN_SIGNALS.SGNL")
resDf.show()
resDf.printSchema()
Output:
+---+-----+---+-------------------+
|DTC|DTCTS|VIN| DTC_CAN_SIGNALS|
+---+-----+---+-------------------+
|DTC| 42|VIN|[[[sn1, 42, 42.0]]]|
+---+-----+---+-------------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = false)
| |-- SGNL: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- SN: string (nullable = false)
| | | |-- ST: long (nullable = false)
| | | |-- SV: double (nullable = false)
+---+-----+---+-----------------+
|DTC|DTCTS|VIN| SGNL|
+---+-----+---+-----------------+
|DTC| 42|VIN|[[sn1, 42, 42.0]]|
+---+-----+---+-----------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- SN: string (nullable = false)
| | |-- ST: long (nullable = false)
| | |-- SV: double (nullable = false)
I have a data-frame which has schema like this:
root
|-- docId: string (nullable = true)
|-- Country: struct (nullable = true)
| |-- s1: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Gender: struct (nullable = true)
| |-- s1: string (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s5: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- YOB: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: long (containsNull = true)
I have a new data frame which has schema like this:
root
|-- docId: string (nullable = true)
|-- Country: struct (nullable = false)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Gender: struct (nullable = false)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- YOB: struct (nullable = false)
| |-- s6: array (nullable = true)
| | |-- element: integer (containsNull = true)
I want to join these data-frames and have the structure like:
root
|-- docId: string (nullable = true)
|-- Country: struct (nullable = true)
| |-- s1: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Gender: struct (nullable = true)
| |-- s1: string (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s5: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- YOB: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s5: array (nullable = true)
| | |-- element: long (containsNull = true)
But in-turn I am getting data frame after join like this:
root
|-- docId: string (nullable = true)
|-- Country: struct (nullable = true)
| |-- s1: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Country: struct (nullable = true)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Gender: struct (nullable = true)
| |-- s1: string (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- s5: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- Gender: struct (nullable = true)
| |-- s6: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- YOB: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s3: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- s4: array (nullable = true)
| | |-- element: long (containsNull = true)
|-- YOB: struct (nullable = true)
| |-- s6: array (nullable = true)
| | |-- element: long (containsNull = true)
What should be done?
I have done and outer join on the field docId and the above data frame is the one that I get.
The Dataframe is not 'joined incorrectly', as a JOIN operation is not supposed to sort Structs out. You get seemingly duplicate columns because the JOIN is taking the columns from both dataframes when combining. You have to do the combination explicitly:
Initialization
import pyspark
from pyspark.sql import types as T
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
First, the data (I added only some columns for reference, extending it to your full example is trivial):
Country_schema1 = T.StructField("Country", T.StructType([T.StructField("s1", T.StringType(), nullable=True)]), nullable=True)
Gender_schema1 = T.StructField("Gender", T.StructType([T.StructField("s1", T.StringType(), nullable=True),
T.StructField("s2", T.StringType(), nullable=True)]))
schema1 = T.StructType([T.StructField("docId", T.StringType(), nullable=True),
Country_schema1,
Gender_schema1
])
data1 = [("1",["1"], ["M", "X"])]
df1 = spark.createDataFrame(data1, schema=schema1)
df1.toJSON().collect()
Output:
['{"docId":"1","Country":{"s1":"1"},"Gender":{"s1":"M","s2":"X"}}']
Second dataframe:
Country_schema2 = T.StructField("Country", T.StructType([T.StructField("s6", T.StringType(), nullable=True)]), nullable=True)
Gender_schema2 = T.StructField("Gender", T.StructType([T.StructField("s6", T.StringType(), nullable=True),
T.StructField("s7", T.StringType(), nullable=True)]))
schema2 = T.StructType([T.StructField("docId", T.StringType(), nullable=True),
Country_schema2,
Gender_schema2
])
data2 = [("1",["2"], ["F", "Z"])]
df2 = spark.createDataFrame(data2, schema=schema2)
df2.toJSON().collect()
Output:
['{"docId":"1","Country":{"s6":"2"},"Gender":{"s6":"F","s7":"Z"}}']
Now the logic. I think this is easier if done using SQL. Create the tables first:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
This is the query to execute. It basically indicates which fields are to be SELECTed (instead of all of them) and wraps the ones from the StructFields in a new structure which combines them:
result = spark.sql("SELECT df1.docID, "
"STRUCT(df1.Country.s1 AS s1, df2.Country.s6 AS s6) AS Country, "
"STRUCT(df1.Gender.s2 AS s2, df2.Gender.s6 AS s6, df2.Gender.s7 AS s7) AS Gender "
"FROM df1 JOIN df2 ON df1.docID=df2.docID")
result.show()
Output:
+-----+-------+---------+
|docID|Country| Gender|
+-----+-------+---------+
| 1| [1, 2]|[X, F, Z]|
+-----+-------+---------+
It is better viewed in JSON:
result.toJSON().collect()
['{"docID":"1","Country":{"s1":"1","s6":"2"},"Gender":{"s2":"X","s6":"F","s7":"Z"}}']