Use a dataframe as lookup for another dataframe - python-3.x

I’ve two dataframes df_1 and df_2
df_1 is my master dataframe and df_2 is a lookup dataframe.
I want to test if the value in df_1[‘col_c1’] contains any of the the values from df_2[‘col_a2’].
If this is true (can be multiple matches !);
add the value(s) from df_2[‘col_b2’] to df_1[‘col_d1’]
add the value(s) from df_2[‘col_c2’] to df_1[‘col_e1’]
How can i achieve this?
I’ve really no idea and therefore I can’t share any code for this.
Sample df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
----------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | |
1_002 | zzzzz | ggggjjjjjkkkkk | |
1_003 | pppp | qqqqffffgggg | |
1_004 | sss | wwwcccyyy | |
1_005 | eeeeee | eecccffffll | |
1_006 | tttt | hhggeeuuuuu | |
Sample df_2
col_a2 | col_b2 | col_c2
------------------------------
ccc | 2_001 | some_data_c
jjj | 2_002 | some_data_j
fff | 2_003 | some_data_f
Desired output df_1
col_a1 | col_b1 | col_c1 | col_d1 | col_e1
------------------------------------------------------------------------------
1_001 | aaaaaa | bbbbccccdddd | 2_001 | some_data_c
1_002 | zzzzz | ggggjjjjjkkkkk | 2_002 | some_data_j
1_003 | pppp | qqqqffffgggg | 2_003 | some_data_f
1_004 | sss | wwwcccyyy | 2_001 | some_data_c
1_005 | eeeeee | eecccffffll | 2_001;2_003 | some_data_c; some_data_f
1_006 | tttt | hhggeeuuuuu | |
df_1 has approx 45.000 rows and df_2 approx. 16.000 rows. (Also added a non matching row)
I've been struggling with this for hours, but I really have no idea.
I don't think merging is an option because there's no exact match.
Your help is greatly appreciated.

Use:
#exctract values by df_2["col_a2"] to new column
s = (df_1['col_c1'].str.extractall(f'({"|".join(df_2["col_a2"])})')[0].rename('new')
.reset_index(level=1, drop=True))
#repeat rows with duplicated match
df_1 = df_1.join(s)
#add new columns by map
df_1['col_d1'] = df_1['new'].map(df_2.set_index('col_a2')['col_b2'])
df_1['col_e1'] = df_1['new'].map(df_2.set_index('col_a2')['col_c2'])
#aggregate join
cols = df_1.columns.difference(['new','col_d1','col_e1']).tolist()
df = df_1.drop('new', axis=1).groupby(cols).agg(','.join).reset_index()
print (df)
col_a1 col_b1 col_c1 col_d1 col_e1
0 1_001 aaaaaa bbbbccccdddd 2_001 some_data_c
1 1_002 zzzzz ggggjjjjjkkkkk 2_002 some_data_j
2 1_003 pppp qqqqffffgggg 2_003 some_data_f
3 1_004 sss wwwcccyyy 2_001 some_data_c
4 1_005 eeeeee eecccffffll 2_001,2_003 some_data_c,some_data_f

this will solve it
df['col_d1'] = df.apply(lambda x: ','.join([df2['col_b2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
df['col_e1'] = df.apply(lambda x: ','.join([df2['col_c2'][i] for i in range(len(df2)) if df2['col_a2'][i] in x.col_c1]), axis=1)
Output
col_a1 col_b1 col_c1 col_d1 \
0 1_001 aaaaaa bbbbccccdddd 2_001
1 1_002 zzzzz ggggjjjjjkkkkk 2_002
2 1_003 pppp qqqqffffgggg 2_003
3 1_004 sss wwwcccyyy 2_001
4 1_005 eeeeee eecccffffll 2_001 , 2_003
col_e1
0 some_data_c
1 some_data_j
2 some_data_f
3 some_data_c
4 some_data_c; some_data_f

Related

Create a dataframe base on X days backward observation

Consedering that I have the following DF:
|-----------------|
|Date | Cod |
|-----------------|
|2022-08-01 | A |
|2022-08-02 | A |
|2022-08-03 | A |
|2022-08-04 | A |
|2022-08-05 | A |
|2022-08-01 | B |
|2022-08-02 | B |
|2022-08-03 | B |
|2022-08-04 | B |
|2022-08-05 | B |
|-----------------|
And considering that I have a backward observation of 2 days, how can I generate the following output DF
|------------------------------|
|RefDate | Date | Cod
|------------------------------|
|2022-08-03 | 2022-08-01 | A |
|2022-08-03 | 2022-08-02 | A |
|2022-08-03 | 2022-08-03 | A |
|2022-08-04 | 2022-08-02 | A |
|2022-08-04 | 2022-08-03 | A |
|2022-08-04 | 2022-08-04 | A |
|2022-08-05 | 2022-08-03 | A |
|2022-08-05 | 2022-08-04 | A |
|2022-08-05 | 2022-08-05 | A |
|2022-08-03 | 2022-08-01 | B |
|2022-08-03 | 2022-08-02 | B |
|2022-08-03 | 2022-08-03 | B |
|2022-08-04 | 2022-08-02 | B |
|2022-08-04 | 2022-08-03 | B |
|2022-08-04 | 2022-08-04 | B |
|2022-08-05 | 2022-08-03 | B |
|2022-08-05 | 2022-08-04 | B |
|2022-08-05 | 2022-08-05 | B |
|------------------------------|
I know that I can use loops to generate this output DF, but loops doesn't have a good performance since I can't cache the DF on memory (My original DF has approx 6 billion lines). So, what is the best way to get this output?
MVCE:
data_1=[
("2022-08-01","A"),
("2022-08-02","A"),
("2022-08-03","A"),
("2022-08-04","A"),
("2022-08-05","A"),
("2022-08-01","B"),
("2022-08-02","B"),
("2022-08-03","B"),
("2022-08-04","B"),
("2022-08-05","B")
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Cod", StringType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
You could try a self join. My thoughts - If your cluster and session are configured optimally, it should work with 6B rows.
data_sdf.alias('a'). \
join(data_sdf.alias('b'),
[func.col('a.cod') == func.col('b.cod'),
func.datediff(func.col('a.date'), func.col('b.date')).between(0, 2)],
'inner'
). \
drop(func.col('a.cod')). \
selectExpr('cod', 'a.date as ref_date', 'b.date as date'). \
show()
# +---+----------+----------+
# |cod| ref_date| date|
# +---+----------+----------+
# | B|2022-08-01|2022-08-01|
# | B|2022-08-02|2022-08-01|
# | B|2022-08-02|2022-08-02|
# | B|2022-08-03|2022-08-01|
# | B|2022-08-03|2022-08-02|
# | B|2022-08-03|2022-08-03|
# | B|2022-08-04|2022-08-02|
# | B|2022-08-04|2022-08-03|
# | B|2022-08-04|2022-08-04|
# | B|2022-08-05|2022-08-03|
# | B|2022-08-05|2022-08-04|
# | B|2022-08-05|2022-08-05|
# | A|2022-08-01|2022-08-01|
# | A|2022-08-02|2022-08-01|
# | A|2022-08-02|2022-08-02|
# | A|2022-08-03|2022-08-01|
# | A|2022-08-03|2022-08-02|
# | A|2022-08-03|2022-08-03|
# | A|2022-08-04|2022-08-02|
# | A|2022-08-04|2022-08-03|
# +---+----------+----------+
# only showing top 20 rows
This will generate records for the initial 2 dates as well which can be discarded.

Removing unwanted characters in python pandas

I have a pandas dataframe column like below :
| ColumnA |
+-------------+
| ABCD(!) |
| <DEFG>(23) |
| (MNPQ. ) |
| 32.JHGF |
| "QWERT" |
Aim is to remove the special characters and produce the output as below :
| ColumnA |
+------------+
| ABCD |
| DEFG |
| MNPQ |
| JHGF |
| QWERT |
Tried using the replace method like below, but without success :
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_]+", "", regex=True)
print(df)
So, how can I replace the special characters using replace method in pandas?
Your solution is also for get numbers \d and _, so it remove only:
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "")
print (df)
ColumnA
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
regrex should be r'[^a-zA-Z]+', it means keep only the characters that are from A to Z, a-z
import pandas as pd
# | ColumnA |
# +-------------+
# | ABCD(!) |
# | <DEFG>(23) |
# | (MNPQ. ) |
# | 32.JHGF |
# | "QWERT" |
# create a dataframe from a list
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])
# | ColumnA |
# +------------+
# | ABCD |
# | DEFG |
# | MNPQ |
# | JHGF |
# | QWERT |
# keep only the characters that are from A to Z, a-z
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z]+', '')
print(df['ColumnB'])
Result:
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z]+" If this doesn't work, then maybe try to update your pandas;
import pandas as pd
d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF', '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "", regex=True)
print(df)
Output
Column A
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

Trim addtional whitespace between the names in PySpark

How to trim the additional spaces present between the names in PySpark dataframe?
Below is my dataframe
+----------------------+----------+
|name |account_id|
+----------------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+----------------------+----------+
Output I want
+-------------+----------+
|name |account_id|
+-------------+----------+
| abc xyz pqr | 1 |
| pqm rst | 2 |
+-------------+----------+
I tried using regex_replace, but it trims the space completely. Is there any other way to implement this ? Thanks a lot!
I tried using 'regexp_replace(,'\s+',' ')' and I got the output.
df=df.withColumn("name",regexp_replace(col("name"),'\s+',' '))
Output
+-----------+----------+
| name |account_id|
+-----------+----------+
|abc xyz pqr| 1 |
| pqm rst| 2 |
+-----------+----------+

Get max length in column for each column in a dataframe [duplicate]

This question already has answers here:
In spark iterate through each column and find the max length
(3 answers)
Closed 3 years ago.
I have a Spark dataframe like that
+-----------------+---------------+----------+-----------+
| column1 | column2 | column3 | column4 |
+-----------------+---------------+----------+-----------+
| a | bbbbb | cc | >dddddddd |
| >aaaaaaaaaaaaaa | bb | c | dddd |
| aa | >bbbbbbbbbbbb | >ccccccc | ddddd |
| aaaaa | bbbb | ccc | d |
+-----------------+---------------+----------+-----------+
I would like to find a length of the longest element in each column to obtain something like that
+---------+-----------+
| column | maxLength |
+---------+-----------+
| column1 | 14 |
| column2 | 12 |
| column3 | 7 |
| column4 | 8 |
+---------+-----------+
I know how to do it column by column but don't know how to tell Spark - Do it for all columns .
I am using Scala Spark.
You can use agg function max and length function to achieve it as
val x = df.columns.map(colName => {
(colName, df.agg(max(length(col(colName)))).head().getAs[Integer](0))
}).toSeq.toDF("column", "maxLength")
Output:
+-------+---------+
|column |maxLength|
+-------+---------+
|column1|14 |
|column2|13 |
|column3|8 |
|column4|9 |
+-------+---------+
Other way is
df.select(df.columns.map(c => max(length(col(c))).as(s"max_${c}")): _*)
Output:
+-----------+-----------+-----------+-----------+
|max_column1|max_column2|max_column3|max_column4|
+-----------+-----------+-----------+-----------+
|14 |13 |8 |9 |
+-----------+-----------+-----------+-----------+

Getting a column as concatenated column from a reference table and primary id's from a Dataset

I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.
Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+

Resources