Pandas: delete cells if last six characters contain any integer below threshold - python-3.x

Please forgive me for the title; I had a hard time summarizing a complex question.
I have a pandas dataframe of values that looks like this:
col1 col2 col3 col4
10_Q999999 111_Q4987666 110_Q277778 111_Q999999
Let's say the threshold is 7. I need to take that dataframe and delete each cell where any of the digits after _Q fall below the threshold of 7. For cells where each digit >= 7, I only want to keep the portion of the string before "_Q".
The desired output would look like this:
col1 col2 col3 col4
10 111
I'm trying to figure out some way to split each column by "_Q", convert the last piece to a list of integers, take the minimum and then compare the minimum with the threshold, finally deleting the list of integers, but I'm stuck in the middle of a disgustingly nested list comprehension:
[[[int(z) for z in y[-3:] if (z != '') and "Q" not in z ] for y in chunk[x].astype(str).str.split("_") if y != ''] for x in chunk[cols] if x != '']
Solution:
s=~chunk.apply(lambda x :
x.str.split('_Q').str[1].str.contains('[0:6]', na=False))
chunk = chunk.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')

You can using split with contains
s=~df.apply(lambda x : x.str.split('_Q').str[1].str.contains('1|2|3|4|5|6'))
df.apply(lambda x : x.str.split('_Q').str[0])[s].fillna('')
Out[549]:
col1 col2 col3 col4
0 10 111

I dislike apply, so I outline an alternative involving stack, str.split, and np.where for (hopefully) better performance.
v = df.stack()
sp = v.str.split('_Q')
i, j = sp.str[0], sp.str[1]
v[:] = np.where(j.str.contains('[0-6]'), '', i)
v.unstack()
col1 col2 col3 col4
0 10 111

Related

replace values with randomly selected values in a pandas dataframe

Python 3.6, Pandas 1.1.5 on windows 10
Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.
For example, from:
col1 col2 col3
0 a a b c a c
1 a b c a
2 a b c b b
to:
col1 col2 col3
0 a b c
1 a c a
2 b b b
So far:
df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'],
'col2':['a b c', 'c', 'b'],
'col3':['a c', 'a', 'b'], })
# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []
# randomly select a single value from each data point
for v in vals:
v = v.split(' ')
a = np.random.choice(len(v), 1)[0]
v = v[a]
vals_.append(v)
gf = pd.DataFrame(np.array(vals_).reshape(df.shape),
index = df.index,
columns =df.columns)
This is not fast on a large dataset. Any lead will be appreciated.
Defining a function and applying it to the entire Pandas dataframe via
The function could be implemented via
def rndVal(x:str):
if len(x) > 1:
x = x.split(' ')
a = np.random.choice(len(x), 1)[0]
return x[a]
else:
return x
and is applicable with
df.applymap(rndVal)
returning
Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.
Pandas fast approach
Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape
(
df.stack().str.split().explode()
.groupby(level=[0, 1]).sample(1).unstack()
)
col1 col2 col3
0 a a c
1 b c a
2 a b b

How to remove square brackets from entire dataframe if not every row and column have square brackets?

I have df that looks like this (with many more columns):
col1 col2 col3
[1] 4
[2] 5 [6]
[3]
How do I remove all square brackets from the df if not every row and column have square brackets and the dataframe is too big to specify column by column ?
I can remove the brackets using this line of code, but the dataframe has to many columns:
df['col1].str.get(0)
df['col1].apply(lambda x: x.replace ('[','').replace(']','')
New df should look like this:
col1 col2 col3
1 4
2 5 6
3
You can cast your df to str, replace the brackets and then cast back to float:
df.astype(str).replace({"\[":"", "\]":""}, regex=True).astype(float)
You could use applymap to apply your function to each cell, although you would want to be a bit careful about types. For example:
df.applymap(lambda x: x.replace('[','').replace(']','') if isinstance(x, str) else x)
Produces:
col1 col2 col3
0 1 4.0 None
1 2 5.0 6
2 3 NaN None
In your case check strip
out = df.apply(lambda x : x.str.strip('[|]'))

turn rows into columns and coping and keeping the same index infrond of all previous together columns

the inverse of
Transpose columns to rows keeping first 3 columns the same
turn:
id col1 col2 col3
1 A B
2 X Y Z
into:
id
1 A
1 B
2 X
2 Y
2 Z
I'm trying unpivot() but from the solution, I cited I need to use .unstack() ?

Python pandas ranking dataframe columns

I am trying to use rank function on two columns in my dataframe.
Problem:
One of the column contains blank values which is not allowing me to do groupby before ranking.
ERROR: ValueError: Length mismatch: Expected axis has 1122 elements, new values have 1814 elements
df_source['col1'] = df_source['col1'].apply(lambda \
x:x.strip()).replace('',np.nan)
df_source['Rank'] = df_source.groupby(by=['col0','col1']) \
['col1'].transform(lambda x: x.rank(na_option='bottom'))
**Actual:**
col0 col1
98630 a
a
90211 a
31111 a
b
23323 c
**Expected**
col0 col1 Rank
98630 a 1
a 2
90211 a 1
31111 a 1
b 1
23323 c 1
This code gives the expected result. I have tried to avoid groupby function for columns with null values.
df['col0'] = df['col0'].replace('', np.nan)
df_int = df.loc[df['col0'].notnull(), 'col1'].unique()
df = df[~(df['col0'].isin(df_int) & df['col1'].isnull())]

Transpose Excel Row Data into columns based on Unique Identifier

I have excel table in below format.
Sr. No. Column 1 (X) Column 2(Y) Column 3(Z)
1 X Y Z
2 Y Z
3 Y
4 X Y
5 X
I want to tranpose it in following format in MS Excel.
Sr. No. Value
1 X
1 Y
1 Z
2 Y
2 Z
3 Y
4 X
4 Y
5 X
Actual data contains more than 30 columns which needs to be transposed into 2 columns.
Please guide me.
Select complete table data and then name it as SourceData using
Formula>Name Manager
Now implement following formula for getting first column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),1)
And for second column:
=INDEX(SourceData,CEILING(ROWS($A$1:A1)/(COLUMNS(SourceData)-1),1),MOD(ROWS($A$1:A1)-1,COLUMNS(SourceData)-1)+2)
Copy and paste special values and then delete blanks / zeroes.
You will get result as required.
If you were using other databases, there might be a formal unpivot operator/function available. But in MySQL, this is not a possibility. However, one approach which should work here would be to just take a union of the three columns:
SELECT 1 AS sr_no, col1 AS value WHERE col1 IS NOT NULL
UNION ALL
SELECT 2, col2 WHERE col2 IS NOT NULL
UNION ALL
SELECT 3, col3 WHERE col3 IS NOT NULL
ORDER BY sr_no;

Resources