Find the value in intervals arrays in pyspark

Find the value in intervals arrays in pyspark - apache-spark

I have the follow df:
state interval limits cep_ori
PA >=5, <=10 >10 [{111, 222}, {333, 444}]
SP >=6, <=10 <8 [{333, 444}, {555, 666}]
Follow the Schema:
root
|-- state: string (nullable = true)
|-- interval: string (nullable = true)
|-- limits: string (nullable = true)
|-- cep_ori: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- inicial: integer (nullable = true)
| | |-- final: integer (nullable = true)
I have another df (df2):
seller_id cep
12222 114
33332 43
33344 338
I need to inner join if the field 'cep' contains in any interval (inicial, final) of field 'cep_ori'.
So, the df_final I need:
state interval limits cep_ori seller_id
PA >=5, <=10 >10 [{111, 222}, {333, 444}] 12222
PA >=5, <=10 >10 [{111, 222}, {333, 444}] 33344
SP >=6, <=10 <8 [{333, 444}, {555, 666}] 33344
I tried:
df.filter(sf.size(sf.array_intersect(df.cep_ori.inicial, sf.array(df2.cep)) != 0))
.filter(sf.size(sf.array_intersect(df.cep_ori.final,sf.array(df2.cep)) != 0))
But, don't worked.
Can someone help me?

You can convert cep_ori column into array which includes all integer values in mentioned range by using transform and sequence functions.
Then join with second dataframe and filter the values where cep exists in the range specified in cep_ori column.
df1 = # first dataframe
df1=df1.withColumn("cep_ori_array", expr("flatten(transform(cep_ori, x->sequence(x.inicial,x.final)))"))
df2=spark.createDataFrame([(12222,114),(33332,43),(33344,338)],['seller_id', 'cep'])
df2.join(df1).where(array_contains("cep_ori_array", col("cep").cast("bigint"))==True).drop("cep_ori_array").show(truncate=False)
+---------+---+------------------------+---------+------+-----+
|seller_id|cep|cep_ori |interval |limits|state|
+---------+---+------------------------+---------+------+-----+
|12222 |114|[{222, 111}, {444, 333}]|>=5, <=10|>10 |PA |
|33344 |338|[{222, 111}, {444, 333}]|>=5, <=10|>10 |PA |
|33344 |338|[{444, 333}, {666, 555}]|>=6, <=10|<8 |SP |
+---------+---+------------------------+---------+------+-----+

Related

How to print nested data structure in a presentable way

I have below sample data & structure and trying to play around to better understand SparkSQL,Pyspark commands.
schemaTest="`id` BIGINT NOT NULL,`name` STRING,`address` STRUCT<`number`: INT, `road`: STRING,
`city`: STRUCT<`name`: STRING, `postcode`: BIGINT>>,`numbers` ARRAY<INT>"
data = [(1,"Smith",(1200,"North Custer RD",("Sugar Land TX",75034)),[2815,2133])]
this is what I get from printSchema:
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
|-- address: struct (nullable = true)
| |-- number: integer (nullable = true)
| |-- road: string (nullable = true)
| |-- city: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- postcode: long (nullable = true)
|-- numbers: array (nullable = true)
| |-- element: integer (containsNull = true)
when I query the df , this is how it's represented and I am trying to re-format the "address" column for a better representation:
+---+-----+-----------------------------------------------+------------+
|id |name |address |numbers |
+---+-----+-----------------------------------------------+------------+
|1 |Smith|{1200, North Custer RD, {Sugar Land TX, 75034}}|[2815, 2133]|
+---+-----+-----------------------------------------------+------------+
I want it to be more like this:
+---+-----+------------------------------------------+------------+
|id |name |address |numbers |
+---+-----+------------------------------------------+------------+
|1 |Smith|1200 North Custer RD, Sugar Land TX, 75034|[2815, 2133]|
+---+-----+------------------------------------------+------------+
I tried explode to see if I can extract but it says mismatch (I am assuming cannot perform explode on structType).
can someone give me an example using withColumn how to reformat the "Address" column?. or if you have any other approach?

You can use concat built-in function to create a string from several columns, as follows:
from pyspark.sql import functions as F
result = input_df.withColumn(
'address',
F.concat(
F.col('address.number'),
F.lit(' '),
F.col('address.road'),
F.lit(', '),
F.col('address.city.name'),
F.lit(', '),
F.col('address.city.postcode')
)
)

Way to concatenate Array of structs

I have a column that contains array of structs. It looks like this:
|-- Network: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Code: string (nullable = true)
| | |-- Signal: string (nullable = true)
This is just a small sample, there are many more columns inside the struct than this. Is there a way to take the arrays in the column for each row, concatenate them and make them into one string? For example, we could have something like this:
[["example", 2], ["example2", 3]]
Is there a way to make into:
"example2example3"?

Assuming having a dataframe df with the following schema:
df.printSchema
df with sample data:
df.show(false)
You need to first explode the Network array to select the struct elements Code and signal.
var myDf = df.select(explode($"Network").as("Network"))
Then you need to concat the two columns using the concat() function and then pass the output to the collect_list() function which will aggregate all rows into one row of type array<string>
myDf = myDf.select(collect_list(concat($"Network.code",$"Network.signal")).as("data"))
Finally, you need to concat into the required format which can be done using concat_ws() function which takes two arguments, the first being the separator to be placed between two string and the second argument being a column with array<string> type which is our output from our previous step. As per your use case, we don't need any separator to be placed between two concatenates strings hence we keep the separator argument as an empty quote.
myDf = myDf.select(concat_ws("",$"data").as("data"))
All the above steps can be done in one line
myDf= myDf.select(explode($"Network").as("Network")).select(concat_ws("",collect_list(concat($"Network.code",$"Network.signal"))).as("data")).show(false)
If you want the output directly into a String variable then use:
val myStr = myDf.first.get(0).toString
print(myStr)

There is a library called spark-hats (Github, small article) that you might find very useful in these situations.
With its use, you can map the array easily and output the concatenation next to the elements or even somewhere else if you provide a fully qualified name.
Setup
import org.apache.spark.sql.functions._
import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
scala> df.show(false)
+---------------------------+------------------------------+
|info |teamName |
+---------------------------+------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb], [Monster, mt]]|
+---------------------------+------------------------------+
Command you are looking for
scala> val dfOut = df.nestedMapColumn(inputColumnName = "teamName", outputColumnName = "nextElementInArray", expression = a => concat(a.getField("team1"), a.getField("team2")) )
dfOut: org.apache.spark.sql.DataFrame = [info: struct<drivers: struct<carName: string, carNumbers: string ... 1 more field>>, teamName: array<struct<team1:string,team2:string,nextElementInArray:string>>]
Output
scala> dfOut.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
| | |-- nextElementInArray: string (nullable = true)
scala> dfOut.show(false)
+---------------------------+----------------------------------------------------+
|info |teamName |
+---------------------------+----------------------------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb, Redbullrb], [Monster, mt, Monstermt]]|
+---------------------------+----------------------------------------------------+

How to concat all elements of a STRUCT with more than 1000 elements in Spark

I have a spark dataframe as shown below with a struct field.
val arrayStructData = Seq(
Row("James",Row("Java","XX",120)),
Row("Michael",Row("Java","",200)),
Row("Robert",Row("Java","XZ",null)),
Row("Washington",Row("","XX",120))
)
val arrayStructSchema = new StructType().add("name",StringType).add("my_struct", new StructType().add("name",StringType).add("author",StringType).add("pages",IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
root
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- pages: integer (nullable = true)
df.show(false)
+----------+---------------+
|name |my_struct |
+----------+---------------+
|James |[Java, XX, 120]|
|Michael |[Java, , 200] |
|Robert |[Java, XZ,] |
|Washington|[, XX, 120] |
+----------+---------------+
I want to construct an output column called final_list which shows me absence or presence of elements in the struct. The problem is, that the struct elements are just limited to 3 in this example but in actual data there are a 1,000 elements in the struct and every record may or may not contain values in each element.
Here is how I want to construct the column -
val cleaned_df = spark.sql(s"""select name, case when my_struct.name = "" then "" else "name" end as name_present
, case when my_struct.author = "" then "" else "author" end as author_present
, case when my_struct.pages = "" then "" else "pages" end as pages_present
from df""")
cleaned_df.createOrReplaceTempView("cleaned_df")
cleaned_df.show(false)
+----------+------------+--------------+-------------+
|name |name_present|author_present|pages_present|
+----------+------------+--------------+-------------+
|James |name |author |pages |
|Michael |name | |pages |
|Robert |name |author |pages |
|Washington| |author |pages |
+----------+------------+--------------+-------------+
So I write a case statement for every column to capture its presence or absence. And then I do the concat like below to get final output -
val final_df = spark.sql(s"""
select name, concat_ws("," , name_present, author_present, pages_present) as final_list
from cleaned_df
""")
final_df.show(false)
+----------+-----------------+
|name |final_list |
+----------+-----------------+
|James |name,author,pages|
|Michael |name,,pages |
|Robert |name,author,pages|
|Washington|,author,pages |
+----------+-----------------+
I cannot write a giant case statement to capture this for a 1,000 element struct. Is there a smarter way to do this ? Perhaps a UDF ?
I am using Spark 2.4.3. I dont know if there is any higher order functions that support this. But the schema of my real dataframe looks like below -
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- element3: integer (nullable = true)
| |-- element4: string (nullable = true)
| |-- element5: double (nullable = true)
.....
.....
| |-- element1000: string (nullable = true)

You mentioned already an UDF. With an UDF you can iterate over all fields of my_struct and collect the flags:
def availableFields = (in:Row) => {
val ret = scala.collection.mutable.ListBuffer.empty[String]
for( i <- Range(0, in.size)) {
if( !in.isNullAt(i) && in.get(i) != "") {
ret += in.schema.fields(i).name
}
}
ret.mkString(",")
}
val availableFieldsUdf = udf(availableFields)
df.withColumn("final_list", availableFieldsUdf(col("my_struct")) ).show(false)
prints
+----------+---------------+-----------------+
|name |my_struct |final_list |
+----------+---------------+-----------------+
|James |[Java, XX, 120]|name,author,pages|
|Michael |[Java, , 200] |name,pages |
|Robert |[Java, XZ,] |name,author |
|Washington|[, XX, 120] |author,pages |
+----------+---------------+-----------------+

Without UDF.
Schema
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- author: string (nullable = true)
| |-- pages: integer (nullable = true)
Constructing Expression
scala>
val expr = df
.select("my_struct.*") // Extracting struct columns.
.columns
.map(c => (c, trim(col(s"my_struct.${c}")))) // Constructing tupes ("name", trim(col("my_struct.name")))
.map(c => when(c._2.isNotNull and c._2 =!= "",lit(c._1))) // Checking Not Null & Empty values.
expr: Array[org.apache.spark.sql.Column] = Array(CASE WHEN ((trim(my_struct.name) IS NOT NULL) AND (NOT (trim(my_struct.name) = ))) THEN name END, CASE WHEN ((trim(my_struct.author) IS NOT NULL) AND (NOT (trim(my_struct.author) = ))) THEN author END, CASE WHEN ((trim(my_struct.pages) IS NOT NULL) AND (NOT (trim(my_struct.pages) = ))) THEN pages END)
Applying Expression to DataFrame
scala> df.withColumn("final_list",concat_ws(",",expr:_*)).show
+----------+---------------+-----------------+
| name| my_struct| final_list|
+----------+---------------+-----------------+
| James|[Java, XX, 120]|name,author,pages|
| Michael| [Java, , 200]| name,pages|
| Robert| [Java, XZ,]| name,author|
|Washington| [, XX, 120]| author,pages|
+----------+---------------+-----------------+

how to check if an array has colums in a schema?

I have a schema and I would like to check the array if it has columns inside before exploding it. my schema looks like this
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- CreatedBy: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- EntityAction: string (nullable = true)
I would like to check if "EmailInteractions" has elements under it before I run the job that will explode it,
I have edited the question for clarity
1. check if email interactions array exist and check if it has columns, if both true, explode the array and finish, if one of the conditions is false, pass to step 2
2.check if phone interactions array exist and check if it has columns, if both true, explode the array and finish, if one of the conditions is false, pass to step 3
3.check if web interactions exist and check if it has columns, if both true, explode the array and finish, if one of the conditions is false, finish
I am new to coding and data bricks, please help on this.

This is one way of doing it. Unfortunately information under StructField is not searchable easily, I converted it to a string and search the filed by field name or keyword "StructField" to know that it contains a field.
val jsonWithNull = """{"a": 123,"b":null,"EmailInteractions":[{"CreatedBy":"test"}]}"""
val jsonWithoutNull = """{"a": 123,"b":3,"EmailInteractions":[{"CreatedBy":"test1"}]}"""
import spark.implicits._
val df = spark.read.json(Seq(jsonWithNull,jsonWithoutNull).toDS)
df.printSchema()
val field = df.schema.filter { f =>
if (f.dataType.typeName == "array" && f.toString().contains("CreatedBy")){
true
}
else{
false
}
}
println(field)
// check for field value being not null then explode
Result List(StructField(EmailInteractions,ArrayType(StructType(StructField(CreatedBy,StringType,true)),true),true))

Distinct counts using Apache Spark DataFrame or SQL

My Schema looks like below:
scala> airing.printSchema()
root
|-- program: struct (nullable = true)
| |-- detail: struct (nullable = true)
| | |-- contributors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contributorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- order: long (nullable = true)
I need to count based on the unique Actors, to find the most popular actors.
My code is as:
val castCounts = airing.groupBy("program.detail.contributors.name").count().sort(desc("count")).take(10)
To my shock, I am getting duplicates as shown in the below snapshot. I expected each individual actor to occur once, with a distinct count:
Printing the results below:
[WrappedArray(),4344]
[WrappedArray(Matt Smith),16]
[WrappedArray(Phil Keoghan),15]
[WrappedArray(Don Adams, Barbara Feldon, Edward Platt),10]
[WrappedArray(Edward Platt, Don Adams, Barbara Feldon),10]

There are 2 steps
use explode function to make your data flat so each row of data only have 1 contributor.
val df = airing.withColumn("contributor", explode(col("program.detail.contributors"))))
Get result from new df which contributor has been exploded.
val castCounts = df.groupBy("contributor.name").count().sort(desc("count")).take(10)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find the value in intervals arrays in pyspark - apache-spark

Related

How to print nested data structure in a presentable way

Way to concatenate Array of structs

How to concat all elements of a STRUCT with more than 1000 elements in Spark

how to check if an array has colums in a schema?

Distinct counts using Apache Spark DataFrame or SQL

Categories

Resources