I want to check if any of the columns in checklst is one of the column names in a df and if not, create an empty col with the name in checklst. Not sure what best approach is for this.
checklst= ["a","b","c","d","e","f"]
for i in checklst:
if not checklst' in df.columns:
df = df.withColumn("checklst", F.lit(None))
checklst= ["a","b","c","d","e","f"]
for x in checklst:
if x not in df.columns:
df = df.withColumn(x, F.lit(None))
Related
I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -
This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)
I have two dfs:- df1 and df2.:-
dfs=[df1,df2]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Serial_Nbr'), dfs)
I want to select only one column apart from the merge column Serial_Nbr in df1while doing the merge.
how do i do this..?
Filter column in df1:
dfs=[df1[['Serial_Nbr']],df2]
Or if only 2 DataFrames remove reduce:
df_final = pd.merge(df1[['Serial_Nbr']], df2, on='Serial_Nbr')
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"]
dataframe
id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result
1 12 23 45 54 56 Fail
Final output is id and Result
using regex
import re
pat = re.compile('|'.join(col_exclusions),flags=re.IGNORECASE)
final_cols = [c for c in df.columns if not re.search(pat,c)]
#out:
['id', 'Result']
print(df[final_cols])
id Result
0 1 Fail
if you want to drop
df = df.drop([c for c in df.columns if re.search(pat,c)],axis=1)
or the pure pandas approach thanks to #Anky_91
df.loc[:,~df.columns.str.contains('|'.join(col_exclusions),case=False)]
You can be explicit and use del for columns that contain the suffixes in your input list:
for column in df.columns:
if any([column.endswith(suffix) for suffix in col_exclusions]):
del df[column]
You can also use the following approach where the column names are splitted then matched with col_exclusions
df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True)
print(df.head())
I have 2 DataFrames : df0 and df1 and df1.shape[0] > df1.shape[0].
df0 and df1 have the exact same columns.
Most of the rows of df0 are in df1.
The indices of df0 and df1 are
df0.index = range(df0.shape[0])
df1.index = range(df1.shape[0])
I then created dft
dft = pd.concat([df0, df1], axis=0, sort=False)
and removed duplicated rows with
dft.drop_duplicates(subset='this_col_is_not_index', keep='first', inplace=True)
I have some duplicates on the index of dft. For example :
dft.loc[3].shape
returns
(2, 38)
My aim is to change the index of the second row returned to have a unique index 3.
This second row should be indexed dft.index.sort_values()[-1]+1.
I would like to apply this operation on all duplicates.
References :
Python Pandas: Get index of rows which column matches certain value
Pandas: Get duplicated indexes
Redefining the Index in a Pandas DataFrame object
Add parameter ignore_index=True to concat for avoid duplicated index values:
dft = pd.concat([df0, df1], axis=0, sort=False, ignore_index=True)
Use reset_index(drop = True)
dft.reset_index(drop=True)