Please forgive me for the title; I had a hard time summarizing a complex question.
I have a pandas dataframe of values that looks like this:
col1 col2 col3 col4
10_Q999999 111_Q4987666 110_Q277778 111_Q999999
Let's say the threshold is 7. I need to take that dataframe and delete each cell where any of the digits after _Q fall below the threshold of 7. For cells where each digit >= 7, I only want to keep the portion of the string before "_Q".
The desired output would look like this:
col1 col2 col3 col4
10 111
I'm trying to figure out some way to split each column by "_Q", convert the last piece to a list of integers, take the minimum and then compare the minimum with the threshold, finally deleting the list of integers, but I'm stuck in the middle of a disgustingly nested list comprehension:
[[[int(z) for z in y[-3:] if (z != '') and "Q" not in z ] for y in chunk[x].astype(str).str.split("_") if y != ''] for x in chunk[cols] if x != '']
s=~chunk.apply(lambda x :
x.str.split('_Q').str[1].str.contains('[0:6]', na=False))
chunk = chunk.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')

You can using split with contains
s=~df.apply(lambda x : x.str.split('_Q').str[1].str.contains('1|2|3|4|5|6'))
df.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')
col1 col2 col3 col4
0 10 111

I dislike apply, so I outline an alternative involving stack, str.split, and np.where for (hopefully) better performance.
v = df.stack()
sp = v.str.split('_Q')
i, j = sp.str[0], sp.str[1]
v[:] = np.where(j.str.contains('[0-6]'), '', i)
col1 col2 col3 col4
0 10 111


replace values with randomly selected values in a pandas dataframe

Python 3.6, Pandas 1.1.5 on windows 10
Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.
For example, from:
col1 col2 col3
0 a a b c a c
1 a b c a
2 a b c b b
col1 col2 col3
0 a b c
1 a c a
2 b b b
So far:
df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'],
'col2':['a b c', 'c', 'b'],
'col3':['a c', 'a', 'b'], })
# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []
# randomly select a single value from each data point
for v in vals:
v = v.split(' ')
a = np.random.choice(len(v), 1)[0]
v = v[a]
gf = pd.DataFrame(np.array(vals_).reshape(df.shape),
index = df.index,
columns =df.columns)
This is not fast on a large dataset. Any lead will be appreciated.
Defining a function and applying it to the entire Pandas dataframe via
The function could be implemented via
def rndVal(x:str):
if len(x) > 1:
x = x.split(' ')
a = np.random.choice(len(x), 1)[0]
return x[a]
return x
and is applicable with
Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.
Pandas fast approach
Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape
.groupby(level=[0, 1]).sample(1).unstack()
col1 col2 col3
0 a a c
1 b c a
2 a b b

How to remove square brackets from entire dataframe if not every row and column have square brackets?

I have df that looks like this (with many more columns):
col1 col2 col3
[1] 4
[2] 5 [6]
How do I remove all square brackets from the df if not every row and column have square brackets and the dataframe is too big to specify column by column ?
I can remove the brackets using this line of code, but the dataframe has to many columns:
df['col1].apply(lambda x: x.replace ('[','').replace(']','')
New df should look like this:
col1 col2 col3
1 4
2 5 6
You can cast your df to str, replace the brackets and then cast back to float:
df.astype(str).replace({"\[":"", "\]":""}, regex=True).astype(float)
You could use applymap to apply your function to each cell, although you would want to be a bit careful about types. For example:
df.applymap(lambda x: x.replace('[','').replace(']','') if isinstance(x, str) else x)
col1 col2 col3
0 1 4.0 None
1 2 5.0 6
2 3 NaN None
In your case check strip
out = df.apply(lambda x : x.str.strip('[|]'))

turn rows into columns and coping and keeping the same index infrond of all previous together columns

the inverse of
Transpose columns to rows keeping first 3 columns the same
id col1 col2 col3
1 A B
2 X Y Z
1 A
1 B
2 X
2 Y
2 Z
I'm trying unpivot() but from the solution, I cited I need to use .unstack() ?

Python pandas ranking dataframe columns

I am trying to use rank function on two columns in my dataframe.
One of the column contains blank values which is not allowing me to do groupby before ranking.
ERROR: ValueError: Length mismatch: Expected axis has 1122 elements, new values have 1814 elements
df_source['col1'] = df_source['col1'].apply(lambda \
df_source['Rank'] = df_source.groupby(by=['col0','col1']) \
['col1'].transform(lambda x: x.rank(na_option='bottom'))
col0 col1
98630 a
90211 a
31111 a
23323 c
col0 col1 Rank
98630 a 1
a 2
90211 a 1
31111 a 1
b 1
23323 c 1
This code gives the expected result. I have tried to avoid groupby function for columns with null values.
df['col0'] = df['col0'].replace('', np.nan)
df_int = df.loc[df['col0'].notnull(), 'col1'].unique()
df = df[~(df['col0'].isin(df_int) & df['col1'].isnull())]

Transpose Excel Row Data into columns based on Unique Identifier

I have excel table in below format.
Sr. No. Column 1 (X) Column 2(Y) Column 3(Z)
1 X Y Z
2 Y Z
3 Y
4 X Y
5 X
I want to tranpose it in following format in MS Excel.
Sr. No. Value
1 X
1 Y
1 Z
2 Y
2 Z
3 Y
4 X
4 Y
5 X
Actual data contains more than 30 columns which needs to be transposed into 2 columns.
Please guide me.
Select complete table data and then name it as SourceData using
Formula>Name Manager
Now implement following formula for getting first column:
And for second column:
Copy and paste special values and then delete blanks / zeroes.
You will get result as required.
If you were using other databases, there might be a formal unpivot operator/function available. But in MySQL, this is not a possibility. However, one approach which should work here would be to just take a union of the three columns:
SELECT 1 AS sr_no, col1 AS value WHERE col1 IS NOT NULL
ORDER BY sr_no;
