I have a PySpark dataframe df1. Its printSchema() shows as below.
df1.printSchema()
root
|-- parent: struct (nullable = true)
| |-- childa: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childb: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childc: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childd: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
df1.show(10,False)
----------------------------------------------------------------
|parent |
----------------------------------------------------------------
|[,[x_value, y_value, z_value], ,[x_value, y_value, z_value]] |
----------------------------------------------------------------
The df1.show() shows that childb and childd are not null.
I am able get all the child struct field names like (childa, childb, childc, childd).
And also I want to get only those child struct field names which are not null.
The below approach is giving me all the child struct field names into a list, which answered my above first requirement.
spark.sql("""select parent.* from df1""").schema.fieldNames()
Output:
[childa, childb, childc, childd]
Now I want to get only those child struct field names which are not null.
I am expecting only childb and childd into a list.
Expected Output: [childb, childd]
You can do check whether the fields are null using a filter and count:
non_null_fields = [
field
for field in df.select('parent.*').schema.fieldNames()
if df.filter('parent.%s is null' % field).count() == 0
]
which gives
['childb', 'childd']
I have a schema like this
root
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Contact: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
and I would like to explode the three arrays(EmailInteractions,PhoneInteractions,WebInteractions) and group with CaseNumber and create three tables and execute this sql query
Select casenumber,CreatedOn
from EmailInteration
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from PhoneInteraction
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from WebInteraction
where Direction = 'Outgoing'
the code to retrieve the schema
val dl = spark.read.format("com.databricks.spark.avro").load("adl://power.azuredatalakestore.net/Se/eventhubspace/eventhub/0_2020_01_20_*_*_*.avro")
val dl1=dl.select($"body".cast("string")).map(_.toString())
val dl2=spark.read.json(dl1)
val dl3=dl2.select($"Content.*",$"RawProperties.UserProperties.*")
I am new to databricks, any help would be appreciated. thanks in advance.
I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))
My schema looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
I have a df with the above data. I want to create a dataframe without fooId.
I cannot use drop since its a nested column.
The tricky part is results is an array and has content as a struct.
Inside of which there is fooId
What would be the cleanest way to accomplish this?
Dataframe1 looks like this
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- fooId: integer (nullable = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Dataframe 2 look like below:
root
|-- source: string (nullable = true)
|-- results: array (nullable = true)
| |-- content: struct (containsNull = true)
| | |-- ptype: string (nullable = true)
| | |-- domain: string (nullable = true)
| | |-- verb: string (nullable = true)
| | |-- foobar: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
|-- date: string (nullable = false)
|-- hour: string (nullable = false)
Notice the differnce - there is no fooId in the second dataframe.
How can I union these two dataframes together?
I understand that the two schemas need to be the same to union. What is the best way to add fooId or remove fooId?(non trivial because of the structure of the schema) What is the recommended approach for doing union of this kind.
Thanks
As you considered two Dataframes let DF1 and DF2, You could remove the extra column in the DF1 and run a untion of both the dataframes
// this is to remove the extra column in the dataframe
DF1.drop("fooId")
Now both the DFs has the same number of columns so you can do a union
DF1.union(DF2)