How to find an optimized join between 2 different dataframes in spark - apache-spark

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes
I saw these solution bellow but I didn't find a good way for my case.
Efficient string suffix detection
Efficient string suffix detection
Efficient string matching in Apache Spark
Today, I found a funny solution :) I'm not sure if it will be work, but let's try.
I add a new column in df_1 to be contain numbering of lines.
Example df_1:
name | id
----------------
abc | 1232
----------------
azerty | 87564
----------------
google | 374856
----------------
new df_1:
name | id | new_id
----------------------------
abc | 1232 | 1
----------------------------
azerty | 87564 | 2
----------------------------
google | 374856 | 3
----------------------------
explorer| 84763 | 4
----------------------------
The same for df_2:
Example df_2:
adress |
-----------
UK |
-----------
USA |
-----------
EUROPE |
-----------
new df_2:
adress | new_id
-------------------
UK | 1
-------------------
USA | 2
-------------------
EUROPE | 3
-------------------
Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key.
My question, is this solution efficient ?
How can I add new_id columns in each dataframe with numbering of line ?

As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered .
So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.
e.g. df3 = spark.createDataframe([df1.select(''),df2.select('')])
Upvote if works

Related

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

Replacing values that have duplicates in a pandas dataframe column

Suppose i have the following table that lists the maker of a unit.
import pandas as pd
df = pd.DataFrame({'Maker': ['Company1ID', 'SusanID', 'CeramiCorpID', 'PeterID', 'SaraID', 'CeramiCorpID', 'Company1ID']})
print(df)
Now consider that i have a much larger table with multiple Person and Corp ID's and i want to reclassify these into two categories, Person and Corporation has shown in the Expected Column. ID's are much more complex than what is shown (eg: f00568ab456b) and are unique for Each person or company but only companies show up in different rows.
| Maker | Expected |
|--------------|----------|
| Company1ID | Corp |
| SusanID | Person |
| CeramiCorpID | Corp |
| PeterID | Person |
| SaraID | Person |
| CeramiCorpID | Corp |
| Company1ID | Corp |
I am basically stuck trying to understand if i need to use either .apply(lamba x) or .replace with some kind of condition on .duplicated(keep=False). I'm unsure how to go about it either way.
Help appreciated!
Im not really sure this is what you want, but you could create the 'Expected' column like this:
df['Expected'] = ['Corp' if 'Corp' in maker else 'Person' for maker in df['Maker']]
EDIT:
If you want them to be classified on the amount of occurences:
df['Expected'] = ['Corp' if len(df[df['Maker'] == maker]) > 1 else 'Person' for maker in df['Maker']]
That would assume that there is no Corp which only occurs once. But if that can be the case then I how do you know if it's a Person or a Corp?

PySpark compare two dataframes and find the match count

I have 2 pyspark dataframes, after some manipulation consisting of 1 column each, but both are different length. dataframe 1 is an ingredient name, dataframe 2 contains rows of long strings of ingredients.
DATAFRAME 1:
ingcomb.show(10,truncate=False)
+---------------------------------+
|products |
+---------------------------------+
|rebel crunch granola |
|creamed honey |
|mild cheddar with onions & chives|
|berry medley |
|sweet relish made with sea salt |
|spanish peanuts |
|stir fry seasoning mix |
|swiss all natural cheese |
|yellow corn meal |
|shredded wheat |
+---------------------------------+
only showing top 10 rows
DATAFRAME 2:
reging.show(10, truncate=30)
+------------------------------+
| ingredients|
+------------------------------+
|apple bean cookie fruit kid...|
|bake bastille day bon appét...|
|dairy fennel gourmet new yo...|
|bon appétit dairy free dinn...|
|bake bon appétit california...|
|bacon basil bon appétit foo...|
|asparagus boil bon appétit ...|
|cocktail party egg fruit go...|
|beef ginger gourmet quick &...|
|dairy free gourmet ham lunc...|
+------------------------------+
only showing top 10 rows
I need to create a loop (any other suggestions are welcome too!) to loop through dataframe 1 and compare the values to dataframe strings via "like" and give me total count of matches.
Desired outcome:
+--------------------+-----+
| ingredients|count|
+--------------------+-----+
|rebel crunch granola| 183|
|creamed honey | 87|
|berry medley | 67|
|spanish peanuts | 10|
+--------------------+-----+
I know that the following code works:
reging.filter("ingredients like '%sugar%'").count()
and was trying to implement something like
for i in ingcomb:
x = reging.select("ingredients").filter("ingredients like '%i%'").count()
But cannot get pyspark to consider 'i' as a value from ingcomb instead of the character i.
I have tried the solutions from
Spark Compare two dataframe and find the match count
but unfortunately they do not work.
I am running this in GCP and get an error when I try to run toPandas - due to permissions cannot install pandas.
We were actually able to do a work around, where we will get counts within the dataframe first and then match with a join later. Please feel free to give better suggestions. Newbies to coding here.
counts= reging.select(f.explode("array(Ingredients)").alias('col'))
.groupBy('col').count().orderBy("count", ascending=False)

Get first and last item without using two joins

Currently I have two dataset, one is parent, and one is child. Child dataset contain "parentId" column that can link to parent table. Child dataset hold data about actions of a person, and parent table hold data about person. I want to get a dataset contain person info and his first/last action.
Dataset look like this:
Parent:
id | name | gender
111| Alex | Male
222| Alice| Female
Child:
parentId | time | Action
111 | 12:01| Walk
111 | 12:03| Run
222 | 12:04| Walk
111 | 12:05| Jump
111 | 12:06| Run
The dataset I want to produce is:
id | name | gender | firstAction | lastAction |
111| Alex | Male | Walk | Run |
222| Alice| Female | Walk | Walk |
Currently I can achieve this using two window functions, something like:
WindowSepc w1 = Window.partitionBy("parentId").orderBy(col("time").asc())
WindowSepc w2 = Window.partitionBy("parentId").orderBy(col("time").desc())
and apply the windowSpec to child table using row_number().over(), like:
child.withColumn("rank1", row_numbers().over(w1))
.withColumn("rank2", row_numbers().over(w2))
The issue I have is that later, when I need to join with the parent table, I need to join two times, one for parentId=id && rank1=1, and another one for parentId=id && rank2=1
I wonder if there is a way to only join once, which will be much more efficient.
Or I used the Window function incorrectly, and there is a better way to do it?
Thanks
You could join first and then use groupBy instead of window-functions, this could work (not tested as no programmatic dataframe is provided):
parent
.join(child,$"parentId"===$"id")
.groupBy($"parentId",$"name",$"gender")
.agg(
min(struct($"time",$"action")).as("firstAction"),
max(struct($"time",$"action")).as("lastAction")
)
.select($"parentId",
$"name",
$"gender",
$"firstAction.action".as("firstAction"),
$"lastAction.action".as("lastAction")
)

excel address as lookup array

first of all, thank you in advance.
the problem I am facing is I have two different values I need to combine when I lookup against a different table, however I do not know which columns those two combinations will be, and they can be different per row. hopefully, the example will help
look up table
ID | Benefit | Option | Tier | Benefit | Option | Tier
123| 1 | 1 | 3 | 2 | 7 |3
456| 2 |3 |1 |1 |3 |2
current table
ID | Benefit |
123 | 1
123 | 2
456 | 1
456 | 2
the example i am giving there is only two posibility it can be in but my actual program is it could be in maybe 20 different location. the one positive i have is that it will always be under the benefit column, so what i was thinking is concat benefit & 04 and using the index match. i would like to dynamically concat based on the row my lookup is on
here is what i got so far but its not working
=INDEX(T3:X4,MATCH(N4,$S$3:$S$4,0),MATCH($O$3&O4,T2:X2&ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),20):ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),24),0))
where
ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),20) does return T3
and ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),24) returns x3
so i was hoping it would combine benefit&1 and it would see its a match on t 3
I guess you are trying to find a formula to put in P4 to P7 ?
=INDEX($S$2:$X$4,MATCH(N4,$S$2:$S$4,0),SUMPRODUCT(($S$2:$X$2="wtwben")*(OFFSET($S$2:$X$2,MATCH(N4,$S$3:$S$4,0),0)=O4)*(COLUMN($S$2:$X$2)-COLUMN($S$2)+1))+1)
If the values to return are always numeric and there is only one match for each ID/Benefit combination (as it appears in your sample) then you can get the Option value with this formula in P4 copied down
=SUMPRODUCT((S$3:S$4=N4)*(T$2:W$2="Benefit")*(T$3:W$4=O4),U$3:X$4)
[assumes the headers are per the first table shown in your question, i.e. where T2 value is "Benefit"]
Notice how the ranges change
....or to return text values.....or if the ID/Benefit combination repeats this will give you the "first" match, where "first" means by row.
=INDIRECT(TEXT(AGGREGATE(15,6,(ROW(U$3:X$4)*1000+COLUMN(U$3:X$4))/(S$3:S$4=N4)/(T$2:W$2="Benefit")/(T$3:W$4=O4),1),"R0C000"),FALSE)

Resources