I have a schema like this
root
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Contact: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
and I would like to explode the three arrays(EmailInteractions,PhoneInteractions,WebInteractions) and group with CaseNumber and create three tables and execute this sql query
Select casenumber,CreatedOn
from EmailInteration
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from PhoneInteraction
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from WebInteraction
where Direction = 'Outgoing'
the code to retrieve the schema
val dl = spark.read.format("com.databricks.spark.avro").load("adl://power.azuredatalakestore.net/Se/eventhubspace/eventhub/0_2020_01_20_*_*_*.avro")
val dl1=dl.select($"body".cast("string")).map(_.toString())
val dl2=spark.read.json(dl1)
val dl3=dl2.select($"Content.*",$"RawProperties.UserProperties.*")
I am new to databricks, any help would be appreciated. thanks in advance.
I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))
My schema looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
I have a df with the above data. I want to create a dataframe without fooId.
I cannot use drop since its a nested column.
The tricky part is results is an array and has content as a struct.
Inside of which there is fooId
What would be the cleanest way to accomplish this?
I m working with dataframe that contains two arrays I want to get from this two arrays one array:
df.show()
root
|-- context_id: long (nullable = true)
|-- data1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
|-- data2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- k: struct (nullable = false)
| | | |-- v: string (nullable = true)
| | | |-- t: string (nullable = false)
| | |-- resourcename: string (nullable = true)
| | |-- criticity: string (nullable = true)
| | |-- v: string (nullable = true)
| | |-- vn: double (nullable = true)
I create udf that concat tow arrays and I provided the schema of the result
val schema=df.select("data1").schema
val concatArray = udf ({ (x: Seq[Row], y: Seq[Row]) => x ++ y}, schema)
when I apply my udf I get this error
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>, array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>) => struct<data1:array<struct<k:struct<v:string,t:string>,resourcename:string,criticity:string,v:string,vn:double>>>)
any suggestions please
The way you provide schema is incorrect. The schema for a single column DataFrame
df.select("data1").schema
is not the same as the schema for the column itself. Instead you should use schema of the field:
val schema = df.schema("data1").dataType
This question already has answers here:
Querying Spark SQL DataFrame with complex types
(3 answers)
How to flatten a struct in a Spark dataframe?
(14 answers)
Closed 4 years ago.
I have a Spark DS with the below structure, I want to flatten each and extract all the columns. I tried get_json_object like below but it isn't working.
root
|-- data: struct (nullable = true)
| |-- Level: long (nullable = true)
| |-- activityName: string (nullable = true)
| |-- activityRunId: string (nullable = true)
| |-- activityType: string (nullable = true)
| |-- category: string (nullable = true)
| |-- correlationId: string (nullable = true)
| |-- effectiveIntegrationRuntime: string (nullable = true)
| |-- end: string (nullable = true)
| |-- level: string (nullable = true)
| |-- location: string (nullable = true)
| |-- operationName: string (nullable = true)
| |-- pipelineName: string (nullable = true)
| |-- pipelineRunId: string (nullable = true)
| |-- properties: struct (nullable = true)
| | |-- Annotations: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- Error: struct (nullable = true)
| | | |-- errorCode: string (nullable = true)
val DS = logDS.select(get_json_object($"data", "$.Level").alias("level"))
Requirement is I want to extract as separate columns