I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.
Related
I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.
I have a dataframe with multiple columns as such:
| ID | Grouping | Field_1 | Field_2 | Field_3 | Field_4 |
|----|----------|---------|---------|---------|---------|
| 1 | AA | A | B | C | M |
| 2 | AA | D | E | F | N |
I want to create 2 new columns and store an list of of existing columns in new fields with the use of a group by on an existing field. Such that my new dataframe would look like this:
| ID | Grouping | Group_by_list1 | Group_by_list2 |
|----|----------|----------------|----------------|
| 1 | AA | [A,B,C,M] | [D,E,F,N] |
Does Pyspark have a way of handling this kind of wrangling with a dataframe to create this kind of an expected result?
Added inline comments, Check below code.
df \
.select(F.col("id"),F.col("Grouping"),F.array(F.col("Field_1"),F.col("Field_2"),F.col("Field_3"),F.col("Field_4")).as("grouping_list"))\ # Creating array of required columns.
.groupBy(F.col("Grouping"))\ # Grouping based on Grouping column.
.agg(F.first(F.col("id")).alias("id"),F.first(F.col("grouping_list")).alias("Group_by_list1"),F.last(F.col("grouping_list")).alias("Group_by_list2"))\ # first value from id, first value from grouping_list list, last value from grouping_list
.select("id","Grouping","Group_by_list1","Group_by_list2")\ # selecting all columns.
.show(false)
+---+--------+--------------+--------------+
|id |Grouping|Group_by_list1|Group_by_list2|
+---+--------+--------------+--------------+
|1 |AA |[A, B, C, M] |[D, E, F, N] |
+---+--------+--------------+--------------+
Note: This solution will give correct result only if DataFrame has two rows.
I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.
/* My question is language-agnostic I think, but I'm using PySpark if it matters. */
Situation
I currently have two Spark DataFrames:
One with per-minute data (1440 rows per person and day) of a person's heart rate per minute:
| Person | date | time | heartrate |
|--------+------------+-------+-----------|
| 1 | 2018-01-01 | 00:00 | 70 |
| 1 | 2018-01-01 | 00:01 | 72 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 11:32 | 123 |
| ... | ... | ... | ... |
And another DataFrame with daily data (1 row per person and day), of daily metadata, including the results of a clustering of days, i.e. which cluster day X of person Y fell into:
| Person | date | cluster | max_heartrate |
|--------+------------+---------+----------------|
| 1 | 2018-01-01 | 1 | 180 |
| 1 | 2018-01-02 | 4 | 166 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 1 | 147 |
| ... | ... | ... | ... |
(Note that clustering is done separately per person, so cluster 1 for person 1 has nothing to do with person 2's cluster 1.)
Goal
I now want to compute, say, the mean heart rate per cluster and per person, that is, each person gets different means. If I have three clusters, I am looking for this DF:
| Person | cluster | mean_heartrate |
|--------+---------+----------------|
| 1 | 1 | 123 |
| 1 | 2 | 89 |
| 1 | 3 | 81 |
| 2 | 1 | 80 |
| ... | ... | ... |
How do I best do this? Conceptually, I want to group these two DataFrames per person and send two DF chunks into an apply function. In there (i.e. per person), I'd group and aggregate the daily DF per day, then join the daily DF's cluster IDs, then compute the per-cluster mean values.
But grouping/applying multiple DFs doesn't work, right?
Ideas
I have two ideas and am not sure which, if any, make sense:
Join the daily DF to the per-minute DF before grouping, which would result in highly redundant data (i.e. the cluster ID replicated for each minute). In my "real" application, I will probably have per-person data too (e.g. height/weight), which would be a completely constant column then, i.e. even more memory wasted. Maybe that's the only/best/accepted way to do it?
Before applying, transform the DF into a DF that can hold complex structures, e.g. like
.
| Person | dataframe | key | column | value |
|--------+------------+------------------+-----------+-------|
| 1 | heartrates | 2018-01-01 00:00 | heartrate | 70 |
| 1 | heartrates | 2018-01-01 00:01 | heartrate | 72 |
| ... | ... | ... | ... | ... |
| 1 | clusters | 2018-01-01 | cluster | 1 |
| ... | ... | ... | ... | ... |
or maybe even
| Person | JSON |
|--------+--------|
| 1 | { ...} |
| 2 | { ...} |
| ... | ... |
What's the best practice here?
But grouping/applying multiple DFs doesn't work, right?
No, AFAIK this does not work not in pyspark nor pandas.
Join the daily DF to the per-minute DF before grouping...
This is the way to go in my opinion. You don't need to merge all redundant columns but only those requrired for your groupby-operation. There is no way to avoid redundancy for your groupby-columns as they will be needed for the groupby-operation.
In pandas, it is possible to provide an extra groupby-column as a pandas Series specifically but it requires to have the exact same shape as the to be grouped dataframe. However, in order create the groupby-column, you will need a merge anyway.
Before applying, transform the DF into a DF that can hold complex structures
Performance and memory wise, I would not go with this solution unless you have multiple required groupby operations which will benefit from more complex data structures. In fact, you will need to put in some effort to actually create the data structure in the first place.
I'm trying to do a conditional explode in Spark Structured Streaming.
For instance, my streaming dataframe looks like follows (totally making the data up here). I want to explode the employees array into separate rows of arrays when contingent = 1. When contingent = 0, I need to let the array be as is.
|----------------|---------------------|------------------|
| Dept ID | Employees | Contingent |
|----------------|---------------------|------------------|
| 1 | ["John", "Jane"] | 1 |
|----------------|---------------------|------------------|
| 4 | ["Amy", "James"] | 0 |
|----------------|---------------------|------------------|
| 2 | ["David"] | 1 |
|----------------|---------------------|------------------|
So, my output should look like (I do not need to display the contingent column:
|----------------|---------------------|
| Dept ID | Employees |
|----------------|---------------------|
| 1 | ["John"] |
|----------------|---------------------|
| 1 | ["Jane"] |
|----------------|---------------------|
| 4 | ["Amy", "James"] |
|----------------|---------------------|
| 2 | ["David"] |
|----------------|---------------------|
There are a couple challenges I'm currently facing:
Exploding Arrays conditionally
exploding arrays into arrays (rather than strings in this case)
In Hive, there was a concept of UDTF (user-defined table functions) that would allow me to do this. Wondering if there is anything comparable to it?
Use flatMap to explode and specify whatever condition you want.
case class Department (Dept_ID: String, Employees: Array[String], Contingent: Int)
case class DepartmentExp (Dept_ID: String, Employees: Array[String])
val ds = df.as[Department]
ds.flatMap(dept => {
if (dept.Contingent == 1) {
dept.Employees.map(emp => DepartmentExp(dept.Dept_ID, Array(emp)))
} else {
Array(DepartmentExp(dept.Dept_ID, dept.Employees))
}
}).as[DepartmentExp]