I want to write a function that updates the column names of a df based on the name of the df.
I have a number of dfs with identical columns. I need eventually to merge these dfs into one df . To identify where the data has originally come from once merged I want to update the column names by appending an identifier to the column name in each separate df first
I have tried using a dictionary (dict) within the function to update the columns but have been unable to get this to work
I have attempted the following function:
def update_col(input):
dict = {'df1': 'A'
,'df2': 'B'
}
input.rename(columns= {'Col1':'Col1-' + dict[input]
,'Col2':'Col2-' + dict[input]
},inplace= True)
My test df are
df1:
Col1 Col2
foo bah
foo bah
df2:
Col1 Col2
foo bah
foo bah
Running the function as follows I wish to get:
update_col(df1)
df1:
Col1-A Col2-A
foo bah
foo bah
I think better way would be:
mydict = {'df1': 'A'
,'df2': 'B'
}
d={'df'+str(e+1):i for e,i in enumerate([df1,df2])} #create a dict of dfs
final_d={k:v.add_suffix('-'+v1) for k,v in d.items() for k1,v1 in mydict.items() if k==k1}
print(final_d)
{'df1': Col1-A Col2-A
0 foo bah
1 foo bah, 'df2': Col1-B Col2-B
0 foo bah
1 foo bah}
you can then access the dfs as final_d['df1'] etc.
Note: Please dont use dict as a dictionary name as it is a builtin python function
Related
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
I need to find columns names if they contain one of these words COMPLETE, UPDATED and PARTIAL
This is my code, not working.
import pandas as pd
df=pd.DataFrame({'col1': ['', 'COMPLETE',''],
'col2': ['UPDATED', '',''],
'col3': ['','PARTIAL','']},
)
print(df)
items=["COMPLETE", "UPDATED", "PARTIAL"]
if x in items:
print (df.columns)
this is the desired output:
I tried to get inspired by this question Get column name where value is something in pandas dataframe but I couldn't wrap my head around it
We can do isin and stack and where:
s=df.where(df.isin(items)).stack().reset_index(level=0,drop=True).sort_index()
s
col1 COMPLETE
col2 UPDATED
col3 PARTIAL
dtype: object
Here's one way to do it.
# check each column for any matches from the items list.
matched = df.isin(items).any(axis=0)
# produce a list of column labels with a match.
matches = list(df.columns[matched])
I am sure i must be missing something basic here. But as far as I know you can create a dataframe from a dict with pd.DataFrame.from_dict(). But I am not sure how it can be set that key-values pairs in a dict can be put it as rows in a dataframe.
For instance, given this example
d = {'a':1,'b':2}
the desired output would be:
col1 col2
0 a 1
1 b 2
I know that the index might be a problem but that can be handle it with a simple index = [0]
Duplicate of Convert Python dict into a dataframe.
Simple answer for python 3.
import pandas as pd
d = {'a':1,'b':2, 'c':3}
df = pd.DataFrame(list(d.items()), columns = ['cola','colb'])
This code should help you.
d = {k: [l] for k, l in d.items()}
pd.DataFrame(d).T.reset_index().rename(columns={'index': 'col1', 0: 'col2'})
I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?
It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)
df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...
all!
I have a dataframe. One column contains strings like this: 'Product1, Product2, foo, bar'.
I've splitted them by ',' and now I have a column containing lists of product names.
How can I get a set of unique product names?
First flatten list of lists, then apply set and last convert to list:
df = pd.DataFrame(data = {'a':['Product1,Product1,foo,bar','Product1,foo,foo,bar']})
print (df)
a
0 Product1,Product1,foo,bar
1 Product1,foo,foo,bar
a=list(set([item for sublist in df['a'].str.split(',').values.tolist() for item in sublist]))
print (a)
['bar', 'foo', 'Product1']
If want unique values per rows:
df = df['a'].str.split(',').apply(lambda x: list(set(x)))
print (df)
0 [bar, foo, Product1]
1 [bar, foo, Product1]
Name: a, dtype: object