Pandas discard items in set using a different set - python-3.x

I have two columns in a pandas dataframe; parents and cte. Both columns are made up of sets. I want to use the cte column to discard overlapping items in the parents column. The dataframe is made up of over 6K rows. Some of the cte rows have empty sets.
Below is a sample:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'dets', 'dets2', 'channel_partner'}
,{'seed', 'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
I've used .discard(cte) previously but I can't figure out how to get it to work.
I would like the output to look like the following:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'channel_partner'}
,{'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
NOTE: dets, dets2 and seed have been removed from the corresponding parents cell.
Once the cte is compared to the parents, I don't need data from that row again. The next row will only compare data on that row and so on.

You need to use a loop here.
A list comprehension will likely be the fastest:
df['parents'] = [P.difference(C) for P,C in zip(df['parents'], df['cte'])]
output:
parents cte
0 {channel_partner, select, opportunity, loan_ag... {dets, dets2}
1 {dw_salesforce.sf_dw_partner_application} {seed}

Related

Pandas combining rows as header info

This is how I am reading and creating the dataframe with pandas
def get_sheet_data(sheet_name='SomeName'):
df = pd.read_excel(f'{full_q_name}',
sheet_name=sheet_name,
header=[0,1],
index_col=0)#.fillna(method='ffill')
df = df.swapaxes(axis1="index", axis2="columns")
return df.set_index('Product Code')
printing this tabularized gives me(this potentially will have hundreds of columns):
I cant seem to add those first two rows into the header, I've tried:
python:pandas - How to combine first two rows of pandas dataframe to dataframe header?https://stackoverflow.com/questions/59837241/combine-first-row-and-header-with-pandas
and I'm failing at each point. I think its because of the multiindex, not necessarily the axis swap? But using: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html is kind of going over my head right now. Please help me add those two rows into the header?
The output of df.columns is massive so Ive cut it down alot:
Index(['Product Code','Product Narrative\nHigh-level service description','Product Name','Huawei Product ID','Type','Bill Cycle Alignment',nan,'Stackable',nan,
and ends with:
nan], dtype='object')
We Create new column names and set them to df.columns, the new column names are generated by joining the 3 Multindex headers and the 1st row of the DataFrame.
df.columns = ['_'.join(i) for i in zip(df.columns.get_level_values(0).tolist(), df.columns.get_level_values(1).tolist(), df.iloc[0,:].replace(np.nan,'').tolist())]

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to compare multiple columns in two tables and find out the duplicates?

I have two dataframe
Dataframe 1
Dataframe 2
ID column is not unique in the two tables. I want to compare all the columns in both the tables except ID's and print the unique rows
Expected output
I tried 'isin' function, but not working. Each dataframe size is 150000 and I removed duplicates in both the tables. Please advise how to do that?
You can use df.append to combine the dataframe, then use df.duplicated which will flag the duplicates.
df3 = df1.append(df, ignore_index=True)
df4 = df3.duplicated(subset=['Team', 'name', 'Country', 'Token'], keep=False)

Get n rows based on column filter in a Dataframe pandas

I have a dataframe df as below.
I want the final dataframe to be like this as follows. i.e, for each unique Name only last 2 rows must be present in the final output.
i tried the following snippet but its not working.
df = df[df['Name']].tail(2)
Use GroupBy.tail:
df1 = df.groupby('Name').tail(2)
Just one more way to solve this using GroupBy.nth:
df1 = df.groupby('Name').nth([-1,-2]) ## this will pick the last 2 rows

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Resources