Error when concat two arrays in Spark dataframe column

Error when concat two arrays in Spark dataframe column - apache-spark

I m working with dataframe that contains two arrays I want to get from this two arrays one array:
df.show()
root
|-- context_id: long (nullable = true)
|-- data1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
|-- data2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
I create udf that concat tow arrays and I provided the schema of the result
val schema=df.select("data1").schema
val concatArray = udf ({ (x: Seq[Row], y: Seq[Row]) => x ++ y}, schema)
when I apply my udf I get this error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>, array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>) => struct<data1:array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>>)
any suggestions please

The way you provide schema is incorrect. The schema for a single column DataFrame
df.select("data1").schema
is not the same as the schema for the column itself. Instead you should use schema of the field:
val schema = df.schema("data1").dataType

Related

Get only those field names which are not null

I have a PySpark dataframe df1. Its printSchema() shows as below.
df1.printSchema()
root
|-- parent: struct (nullable = true)
| |-- childa: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childb: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childc: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childd: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
df1.show(10,False)
----------------------------------------------------------------
|parent |
----------------------------------------------------------------
|[,[x_value, y_value, z_value], ,[x_value, y_value, z_value]] |
----------------------------------------------------------------
The df1.show() shows that childb and childd are not null.
I am able get all the child struct field names like (childa, childb, childc, childd).
And also I want to get only those child struct field names which are not null.
The below approach is giving me all the child struct field names into a list, which answered my above first requirement.
spark.sql("""select parent.* from df1""").schema.fieldNames()
Output:
[childa, childb, childc, childd]
Now I want to get only those child struct field names which are not null.
I am expecting only childb and childd into a list.
Expected Output: [childb, childd]

You can do check whether the fields are null using a filter and count:
non_null_fields = [
field
for field in df.select('parent.*').schema.fieldNames()
if df.filter('parent.%s is null' % field).count() == 0
]
which gives
['childb', 'childd']

how to explode a structs array

I have a schema like this
root
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Contact: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
and I would like to explode the three arrays(EmailInteractions,PhoneInteractions,WebInteractions) and group with CaseNumber and create three tables and execute this sql query
Select casenumber,CreatedOn
from EmailInteration
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from PhoneInteraction
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from WebInteraction
where Direction = 'Outgoing'
the code to retrieve the schema
val dl = spark.read.format("com.databricks.spark.avro").load("adl://power.azuredatalakestore.net/Se/eventhubspace/eventhub/0_2020_01_20_*_*_*.avro")
val dl1=dl.select($"body".cast("string")).map(_.toString())
val dl2=spark.read.json(dl1)
val dl3=dl2.select($"Content.*",$"RawProperties.UserProperties.*")
I am new to databricks, any help would be appreciated. thanks in advance.

How to convert string column to ArrayType in pyspark

I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?

You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))

Drop a column in a struct within an Array type

My schema looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
I have a df with the above data. I want to create a dataframe without fooId.
I cannot use drop since its a nested column.
The tricky part is results is an array and has content as a struct.
Inside of which there is fooId
What would be the cleanest way to accomplish this?

union two dataframes with nested different schemas

Dataframe1 looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Dataframe 2 look like below:
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Notice the differnce - there is no fooId in the second dataframe.
How can I union these two dataframes together?
I understand that the two schemas need to be the same to union. What is the best way to add fooId or remove fooId?(non trivial because of the structure of the schema) What is the recommended approach for doing union of this kind.
Thanks

As you considered two Dataframes let DF1 and DF2, You could remove the extra column in the DF1 and run a untion of both the dataframes
// this is to remove the extra column in the dataframe
DF1.drop("fooId")
Now both the DFs has the same number of columns so you can do a union
DF1.union(DF2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Error when concat two arrays in Spark dataframe column - apache-spark

The way you provide schema is incorrect. The schema for a single column DataFrame df.select("data1").schema is not the same as the schema for the column itself. Instead you should use schema of the field: val schema = df.schema("data1").dataType

Related

Get only those field names which are not null

how to explode a structs array

How to convert string column to ArrayType in pyspark

Drop a column in a struct within an Array type

union two dataframes with nested different schemas

Categories

Resources