I am not able to access the lambda variable in the inline SQL command in a Spark SQL Transform function.
I do not want to explode and perform a join. I was hoping to perform an inline join to fetch the value for the email.
Error.
AnalysisException: cannot resolve '`y.email`' given input columns: [lkp.id, lkp.display_name, lkp.email_address, lkp.uuid]
Code.
transform (
x.license_admin,
y -> struct (
get_uuid(lower(trim(y.email))) as email,
( select coalesce(lkp.uuid, get_uuid(lower(trim(y.email))))
from lkp_uuid_email_map lkp
where lkp.email_address = lower(trim(y.email))
) as email_02,
generic_hash(y.first_name) as first_name
)
) as license_admin,
Column Schema
|-- order_line_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- additional_notes: string (nullable = true)
| | |-- amount: double (nullable = true)
| | |-- discounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- discount_amount: double (nullable = true)
| | | | |-- discount_percentage: string (nullable = true)
| | | | |-- discount_reason: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- license_admin: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- email: string (nullable = true)
| | | | |-- first_name: string (nullable = true)
| | | | |-- last_name: string (nullable = true)
| | | | |-- license_admin_id: string (nullable = true)
Thanks in advance.
I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))
My schema looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
I have a df with the above data. I want to create a dataframe without fooId.
I cannot use drop since its a nested column.
The tricky part is results is an array and has content as a struct.
Inside of which there is fooId
What would be the cleanest way to accomplish this?
Dataframe1 looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Dataframe 2 look like below:
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Notice the differnce - there is no fooId in the second dataframe.
How can I union these two dataframes together?
I understand that the two schemas need to be the same to union. What is the best way to add fooId or remove fooId?(non trivial because of the structure of the schema) What is the recommended approach for doing union of this kind.
Thanks
As you considered two Dataframes let DF1 and DF2, You could remove the extra column in the DF1 and run a untion of both the dataframes
// this is to remove the extra column in the dataframe
DF1.drop("fooId")
Now both the DFs has the same number of columns so you can do a union
DF1.union(DF2)
I m working with dataframe that contains two arrays I want to get from this two arrays one array:
df.show()
root
|-- context_id: long (nullable = true)
|-- data1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
|-- data2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
I create udf that concat tow arrays and I provided the schema of the result
val schema=df.select("data1").schema
val concatArray = udf ({ (x: Seq[Row], y: Seq[Row]) => x ++ y}, schema)
when I apply my udf I get this error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>, array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>) => struct<data1:array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>>)
any suggestions please
The way you provide schema is incorrect. The schema for a single column DataFrame
df.select("data1").schema
is not the same as the schema for the column itself. Instead you should use schema of the field:
val schema = df.schema("data1").dataType