How to remove square brackets from entire dataframe if not every row and column have square brackets? - python-3.x

I have df that looks like this (with many more columns):
col1 col2 col3
[1] 4
[2] 5 [6]
[3]
How do I remove all square brackets from the df if not every row and column have square brackets and the dataframe is too big to specify column by column ?
I can remove the brackets using this line of code, but the dataframe has to many columns:
df['col1].str.get(0)
df['col1].apply(lambda x: x.replace ('[','').replace(']','')
New df should look like this:
col1 col2 col3
1 4
2 5 6
3

You can cast your df to str, replace the brackets and then cast back to float:
df.astype(str).replace({"\[":"", "\]":""}, regex=True).astype(float)

You could use applymap to apply your function to each cell, although you would want to be a bit careful about types. For example:
df.applymap(lambda x: x.replace('[','').replace(']','') if isinstance(x, str) else x)
Produces:
col1 col2 col3
0 1 4.0 None
1 2 5.0 6
2 3 NaN None

In your case check strip
out = df.apply(lambda x : x.str.strip('[|]'))

Related

turn rows into columns and coping and keeping the same index infrond of all previous together columns

the inverse of
Transpose columns to rows keeping first 3 columns the same
turn:
id col1 col2 col3
1 A B
2 X Y Z
into:
id
1 A
1 B
2 X
2 Y
2 Z
I'm trying unpivot() but from the solution, I cited I need to use .unstack() ?

Python pandas ranking dataframe columns

I am trying to use rank function on two columns in my dataframe.
Problem:
One of the column contains blank values which is not allowing me to do groupby before ranking.
ERROR: ValueError: Length mismatch: Expected axis has 1122 elements, new values have 1814 elements
df_source['col1'] = df_source['col1'].apply(lambda \
x:x.strip()).replace('',np.nan)
df_source['Rank'] = df_source.groupby(by=['col0','col1']) \
['col1'].transform(lambda x: x.rank(na_option='bottom'))
**Actual:**
col0 col1
98630 a
a
90211 a
31111 a
b
23323 c
**Expected**
col0 col1 Rank
98630 a 1
a 2
90211 a 1
31111 a 1
b 1
23323 c 1
This code gives the expected result. I have tried to avoid groupby function for columns with null values.
df['col0'] = df['col0'].replace('', np.nan)
df_int = df.loc[df['col0'].notnull(), 'col1'].unique()
df = df[~(df['col0'].isin(df_int) & df['col1'].isnull())]

How to remove duplicates rows by same values in different order in dataframe by pandas

How to remove the duplicates in the df? df only has 1 column. In this case "60,25" and "25,60" is a pair of duplicated rows. The output should be the new df. For each pair of duplicated row, the kept row in format "A,B" where A < B, the removed row should be the one A>B. In this case, "25,60" and "80,123" should be kept. For unique row, it should stay whatever it is.
IIUC, using get_dummies with duplicated
df[~df.A.str.get_dummies(sep=',').duplicated()]
Out[956]:
A
0 A,C
1 A,B
4 X,Y,Z
Data input
df
Out[957]:
A
0 A,C
1 A,B
2 C,A
3 B,A
4 X,Y,Z
5 Z,Y,X
Update op change the question totally to different question
newdf=df.A.str.get_dummies(sep=',')
newdf[~newdf.duplicated()].dot(newdf.columns+',').str[:-1]
Out[976]:
0 25,60
1 123,37
dtype: object
I'd do a combination of things.
Use pandas.Series.str.split to split by commas
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='last'
df[~df.A.str.split(',').apply(frozenset).duplicated(keep='last')]
A
1 123,17
3 80,123
4 25,60
5 25,42
Addressing comments
df.A.apply(
lambda x: tuple(sorted(map(int, x.split(','))))
).drop_duplicates().apply(
lambda x: ','.join(map(str, x))
)
0 25,60
1 17,123
2 80,123
5 25,42
Name: A, dtype: object
Setup
df = pd.DataFrame(dict(
A='60,25 123,17 123,80 80,123 25,60 25,42'.split()
))

Pandas: delete cells if last six characters contain any integer below threshold

Please forgive me for the title; I had a hard time summarizing a complex question.
I have a pandas dataframe of values that looks like this:
col1 col2 col3 col4
10_Q999999 111_Q4987666 110_Q277778 111_Q999999
Let's say the threshold is 7. I need to take that dataframe and delete each cell where any of the digits after _Q fall below the threshold of 7. For cells where each digit >= 7, I only want to keep the portion of the string before "_Q".
The desired output would look like this:
col1 col2 col3 col4
10 111
I'm trying to figure out some way to split each column by "_Q", convert the last piece to a list of integers, take the minimum and then compare the minimum with the threshold, finally deleting the list of integers, but I'm stuck in the middle of a disgustingly nested list comprehension:
[[[int(z) for z in y[-3:] if (z != '') and "Q" not in z ] for y in chunk[x].astype(str).str.split("_") if y != ''] for x in chunk[cols] if x != '']
Solution:
s=~chunk.apply(lambda x :
x.str.split('_Q').str[1].str.contains('[0:6]', na=False))
chunk = chunk.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')
You can using split with contains
s=~df.apply(lambda x : x.str.split('_Q').str[1].str.contains('1|2|3|4|5|6'))
df.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')
Out[549]:
col1 col2 col3 col4
0 10 111
I dislike apply, so I outline an alternative involving stack, str.split, and np.where for (hopefully) better performance.
v = df.stack()
sp = v.str.split('_Q')
i, j = sp.str[0], sp.str[1]
v[:] = np.where(j.str.contains('[0-6]'), '', i)
v.unstack()
col1 col2 col3 col4
0 10 111

min() function in pandas column

I have a dataframe like the following (df1):
col1 val
0 A AX
1 A 2
2 A 11
3 A 13
4 A BX
5 A 20
I want to pick the row with minimum value. Hence I wrote the following:
df2 = df1.groupby(['col1'])['val'].min()
The output I get from this is,
col1
A 11
Name: Level, dtype: object
It seems like the values AX, BX is causing it to read it as object. Hence, it is doing the sort and find '11' as minimum. How to modify it, so that it can do numerical sorting and outputs ?
A 2
Thanks in advance.
You need convert column to numeric first, because min working with strings nice and return characters having lowest ASCII value:
df2 = pd.to_numeric(df1['val'], errors='coerce').groupby(df1['col1']).min().astype(int)
print (df2)
col1
A 2
Name: val, dtype: int32
More information about min in strings is here.

Resources