Is spark smart enough to avoid redundant values while performing aggregation?

Is spark smart enough to avoid redundant values while performing aggregation? - apache-spark

I have the following Dataset
case class Department(deptId:String,locations:Seq[String])
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| d1|[] |
| dp2|[] |
| dp2|[hyderabad] |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Department itself
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| dp2|[hyderabad] |
+-------+--------------------+
I do the following
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("locations")).as("locations")
My question is, is Spark smart enough not to shuffle around empty locations ie [] ?
PS: I am not sure if this is a stupid question.

Yes and no:
Yes - collect_list performs map-side aggregation, so if there are multiple values per grouping key, data will be merged before shuffle.
No - because an empty list is not the same as the missing data. If that's not the desired behavior you should filter the data first
ds.filter(size($"location") > 0).groupBy("deptId").agg(...)
but keep in mind that it will yield different result if there are only empty arrays for deptId.

Related

How to split a dictionary in a Pyspark dataframe into multiple rows?

I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.

As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.
So, you can first convert the flags to MapType and use map_keys to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
+-----+--------+
| id| flags|
+-----+--------+
|v_001|[93, 83]|
|v_002|[45, 76]|
+-----+--------+
Then you can use explode on the flags.
.select('id', F.explode('flags'))
+-----+---+
| id|col|
+-----+---+
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
+-----+---+
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).

Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

Conditional Explode in Spark Structured Streaming / Spark SQL

I'm trying to do a conditional explode in Spark Structured Streaming.
For instance, my streaming dataframe looks like follows (totally making the data up here). I want to explode the employees array into separate rows of arrays when contingent = 1. When contingent = 0, I need to let the array be as is.
|----------------|---------------------|------------------|
| Dept ID | Employees | Contingent |
|----------------|---------------------|------------------|
| 1 | ["John", "Jane"] | 1 |
|----------------|---------------------|------------------|
| 4 | ["Amy", "James"] | 0 |
|----------------|---------------------|------------------|
| 2 | ["David"] | 1 |
|----------------|---------------------|------------------|
So, my output should look like (I do not need to display the contingent column:
|----------------|---------------------|
| Dept ID | Employees |
|----------------|---------------------|
| 1 | ["John"] |
|----------------|---------------------|
| 1 | ["Jane"] |
|----------------|---------------------|
| 4 | ["Amy", "James"] |
|----------------|---------------------|
| 2 | ["David"] |
|----------------|---------------------|
There are a couple challenges I'm currently facing:
Exploding Arrays conditionally
exploding arrays into arrays (rather than strings in this case)
In Hive, there was a concept of UDTF (user-defined table functions) that would allow me to do this. Wondering if there is anything comparable to it?

Use flatMap to explode and specify whatever condition you want.
case class Department (Dept_ID: String, Employees: Array[String], Contingent: Int)
case class DepartmentExp (Dept_ID: String, Employees: Array[String])
val ds = df.as[Department]
ds.flatMap(dept => {
if (dept.Contingent == 1) {
dept.Employees.map(emp => DepartmentExp(dept.Dept_ID, Array(emp)))
} else {
Array(DepartmentExp(dept.Dept_ID, dept.Employees))
}
}).as[DepartmentExp]

Spark (or pyspark) columns content shuffle with GroupBy

I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.

Performance: Group by a subset of previous grouping columns

I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?

Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is spark smart enough to avoid redundant values while performing aggregation? - apache-spark

Related

How to split a dictionary in a Pyspark dataframe into multiple rows?

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

Conditional Explode in Spark Structured Streaming / Spark SQL

Spark (or pyspark) columns content shuffle with GroupBy

Performance: Group by a subset of previous grouping columns

Categories

Resources