Python pandas ranking dataframe columns - python-3.x

I am trying to use rank function on two columns in my dataframe.
Problem:
One of the column contains blank values which is not allowing me to do groupby before ranking.
ERROR: ValueError: Length mismatch: Expected axis has 1122 elements, new values have 1814 elements
df_source['col1'] = df_source['col1'].apply(lambda \
x:x.strip()).replace('',np.nan)
df_source['Rank'] = df_source.groupby(by=['col0','col1']) \
['col1'].transform(lambda x: x.rank(na_option='bottom'))
**Actual:**
col0 col1
98630 a
a
90211 a
31111 a
b
23323 c
**Expected**
col0 col1 Rank
98630 a 1
a 2
90211 a 1
31111 a 1
b 1
23323 c 1

This code gives the expected result. I have tried to avoid groupby function for columns with null values.
df['col0'] = df['col0'].replace('', np.nan)
df_int = df.loc[df['col0'].notnull(), 'col1'].unique()
df = df[~(df['col0'].isin(df_int) & df['col1'].isnull())]

Related

replace values with randomly selected values in a pandas dataframe

Python 3.6, Pandas 1.1.5 on windows 10
Trying to optimize the below for better performance on large dataset.
Purpose: randomly select a single value if the data contains several values separated by a space.
For example, from:
col1 col2 col3
0 a a b c a c
1 a b c a
2 a b c b b
to:
col1 col2 col3
0 a b c
1 a c a
2 b b b
So far:
df = pd.DataFrame({'col1': ['a', 'a b', 'a b c'],
'col2':['a b c', 'c', 'b'],
'col3':['a c', 'a', 'b'], })
# make data into a flat np.array
vals = list(itertools.chain.from_iterable(df.values))
vals_ = []
# randomly select a single value from each data point
for v in vals:
v = v.split(' ')
a = np.random.choice(len(v), 1)[0]
v = v[a]
vals_.append(v)
gf = pd.DataFrame(np.array(vals_).reshape(df.shape),
index = df.index,
columns =df.columns)
This is not fast on a large dataset. Any lead will be appreciated.
Defining a function and applying it to the entire Pandas dataframe via
The function could be implemented via
def rndVal(x:str):
if len(x) > 1:
x = x.split(' ')
a = np.random.choice(len(x), 1)[0]
return x[a]
else:
return x
and is applicable with
df.applymap(rndVal)
returning
Regarding Performance. Running your attempt and applymap on a dataframe with 300,000 rows requires the former 18.6 s while this solution only takes 8.4 s.
Pandas fast approach
Stack to reshape then split and explode the strings then groupby on multiindex and draw a sample of size 1 per group, then unstack back to reshape
(
df.stack().str.split().explode()
.groupby(level=[0, 1]).sample(1).unstack()
)
col1 col2 col3
0 a a c
1 b c a
2 a b b

Increase the values in a column values based on values in other column in pandas

I have my source data in the form of csv file as below:
id,col1,col2
123,11|22|33||||||,val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7
I need to add a new column(fnlsrc) which will have the values based on values in Col2 and Col1, i.e if col1 has 9 values(separated with pipe) and col2 has 3 values(separated with pipe), then in fnlsrc column I have to load 9 values(separated with pipe) 3 set of col2(val1|val3|val2|val1|val3|val2|val1|val3|val2). Please refer the output below, which will help in understanding the requirement easily:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7
I have tried following code, but its adding only the one set:
zipped = zip(df['col1'], df['col2'])
for s,t in zipped:
count = int((s.count('|') + 1)/(t.count('|') + 1))
for val in range(count):
df['fnlsrc'] = t
As the new column is based on the other two, I would use panda's apply() function. I defined a function that calculates the new column value based on the other two columns, which is then applied to each row:
def new_value(x):
# Find out number of values in both columns
col1_numbers = x['col1'].count('|') + 1
col2_numbers = x['col2'].count('|') + 1
# Calculate how many times col2 should appear in the new column
repetition = int(col1_numbers/col2_numbers)
# Create list of strings containing the values of the new column
values = [x['col2']]*repetition
# Join the list of strings with pipes
return '|'.join(values)
# Apply the function on every row
df['fnlsrc'] = df.apply(lambda x:new_value(x), axis=1)
df
Output:
id col1 col2 fnlsrc
0 123 11|22|33|||||| val1|val3|val2 val1|val3|val2|val1|val3|val2|val1|val3|val2
1 456 99||77|||88|||||||||6| val4|val5|val6|val7 val4|val5|val6|val7|val4|val5|val6|val7|val4|v...
Full output in your input format:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7

How to remove square brackets from entire dataframe if not every row and column have square brackets?

I have df that looks like this (with many more columns):
col1 col2 col3
[1] 4
[2] 5 [6]
[3]
How do I remove all square brackets from the df if not every row and column have square brackets and the dataframe is too big to specify column by column ?
I can remove the brackets using this line of code, but the dataframe has to many columns:
df['col1].str.get(0)
df['col1].apply(lambda x: x.replace ('[','').replace(']','')
New df should look like this:
col1 col2 col3
1 4
2 5 6
3
You can cast your df to str, replace the brackets and then cast back to float:
df.astype(str).replace({"\[":"", "\]":""}, regex=True).astype(float)
You could use applymap to apply your function to each cell, although you would want to be a bit careful about types. For example:
df.applymap(lambda x: x.replace('[','').replace(']','') if isinstance(x, str) else x)
Produces:
col1 col2 col3
0 1 4.0 None
1 2 5.0 6
2 3 NaN None
In your case check strip
out = df.apply(lambda x : x.str.strip('[|]'))

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources