I have a dataframe that has a value that looks as belows
colA desired
[0] 0
[3, 1] 3,1
[3, 1, 2] 3,1,2
[3, 1] 3,1
The type for colA is object.
Is there a way to do it?
Thanks
without the lambda as that will be slower for larger data sets, you can simply cast the list to a string type then strip unwanted characters.
import pandas as pd
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1]]})
df['desired'] = df.colA.astype(str).str.replace('\[|\]|\'', '')
df
Output:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
Try:
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1], [4]]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, x)))
OUTPUT:
colA desired
0 [0] 0
1 [3, 1] 3,1
2 [3, 1, 2] 3,1,2
3 [3, 1] 3,1
4 [4] 4
If colA is obj:
from ast import literal_eval
df = pd.DataFrame(data={'colA':["[0]", "[3,1]","[3,1,2]","[3,1]", "[4]"]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, literal_eval(x))))
You can use str.replace:
df['desired'] = df['colA'].str.replace(r'[][]', '', regex=True)
Prints:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
You can use the regex demo to play with it.
I have Pandas dataframe like this:
df =
A B
[1, 5, 8, 10] [str1, str_to_be_removed*, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7, 9, 15] [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
... ...
Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.
How to do this?
Desired Output:
df =
A B
[1, 5, 10] [str1, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7] [str6, str2, str_to_be_removed*]
EDIT:
So, this is the sample df:
df =
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
]})
Desired Output:
df =
A B
[1] [str_to_be_removed_bla]
[2,3] [str_to_be_removed_bla, b]
[3,4,5] [b, c, d]
[10, 12] [str_to_be_removed_bla_bla,f]
steps:
use zip to combine relate elements together.
unnest the zip list, use explode
drop duplicates, keep first
goupby index, agg as list.
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5],[10, 11, 12]
],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b', 'c', 'd'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
]})
# zip and explode step
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
# axis = 1
# ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]
update with #Joe Ferndz's solution, simplify the steps.
# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()
# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})
result
A B
index
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5, 6],[10, 11, 12]
],
'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'],
['str_to_be_removed1_bla','b'],
['b', 'b', 'c', 'd'],
['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
]})
print(df)
# A B
# 0 [1, 2] [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 11, 12] [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)
# index A B tag
# 0 0 1 str_to_be_removed1_bla str_to_be_removed1
# 1 0 2 str_to_be_removed1_bla str_to_be_removed1
# 2 1 2 str_to_be_removed1_bla str_to_be_removed1
# 3 1 3 b NaN
# 4 2 3 b NaN
# 5 2 4 b NaN
# 6 2 5 c NaN
# 7 2 6 d NaN
# 8 3 10 str_to_be_removed2_bla_bla str_to_be_removed2
# 9 3 11 str_to_be_removed2_bla str_to_be_removed2
# 10 3 12 f NaN
cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
# A B
# index
# 0 [1] [str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 12] [str_to_be_removed2_bla_bla, f]
Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.
import pandas as pd
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b','c','d'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)
search_str = 'str_to_be_removed'
#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())
#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)
#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()
#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]
#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})
del dfTemp
print (df_final)
The output of this is:
Original dataframe:
A B
0 [1, 2] [str_to_be_removed_bla, str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 11, 12] [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]
search string: 'str_to_be_removed'
Final dataframe:
A B
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
Let us try dict
out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]] for x , y in zip(df.A,df.B)])
out
0 1
0 [1] [a]
1 [2, 2] [a, b]
Sample dataframe
df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})
Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:
def remove_dupes(a, b, dupe_pattern):
seen_dupe = False
_a, _b = [], []
for x, y in zip(a, b):
if dupe_pattern in y:
if not seen_dupe:
seen_dupe = True
_a.append(x)
_b.append(y)
else:
_a.append(x)
_b.append(y)
return _a, _b
_A, _B = [], []
for a, b in zip(df.A, df.B):
_a, _b = remove_dupes(a, b, 'str_to_be_removed')
_A.append(_a)
_B.append(_b)
df['A'], df['B'] = _A, _B
print(df)
# A B
#0 [1] [str_to_be_removed_bla]
#1 [2, 3] [str_to_be_removed_bla, b]
#2 [3, 4, 5] [b, c, d]
#3 [10, 12] [str_to_be_removed_bla_bla, f]
Try run it here.
I have two excel files which I read by pandas. I am comparing the index in file 1 with the index in file 2 (not the same length (ex: 10,100) and if they match, the row[index] in the second file will be zeros and else will not change. I am using for and if loops for this, but the more data I want to process(1e3,5e3), the run time becomes longer. So, is there a better way to perform such a comparison?. Here's an example of what I am using.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df1 = pd.DataFrame([['w'], ['y' ], ['z']],
index=[4, 5, 1])
for j in df1.index:
for i in df.index:
if i == j:
df.loc[i, :] = 0
else:
df.loc[i, :] = df.loc[i, :]
print(df)
Here loops are not necessary, you can set values to 0 per rows by DataFrame.mask with Series.isin (necessary convert index to Series for avoid ValueError: Array conditional must be same shape as self):
df = df.mask(df.index.to_series().isin(df1.index), 0)
Or with Index.isin and numpy.where if want improve performance:
arr = np.where(df.index.isin(df1.index)[:, None], 0, df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
A B C
4 0 0 0
5 0 0 0
6 10 20 30