Remove a character from a pandas dataframe columns - python-3.x

I have a dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
df
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
I want to remove all 0's after character 'L'.
My expected output:
col1
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC

In [114]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame({'col1':['AA_L8_ZZ', 'AA_L08_YY', 'AA_L800_XX', 'AA_L0008_CC']})
...: df
Out[114]:
col1
0 AA_L8_ZZ
1 AA_L08_YY
2 AA_L800_XX
3 AA_L0008_CC
In [115]: df.col1.str.replace("L([0]*)","L")
Out[115]:
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
Name: col1, dtype: object

Pandas string replace suffices for this. The code below looks for any 0, preceded by L, and replaces the 0 with an empty string :
df.col1.str.replace(r"(?<=L)0+", "")
0 AA_L8_ZZ
1 AA_L8_YY
2 AA_L800_XX
3 AA_L8_CC
If you need more speed, you could go down into plain Python with list comprehension:
import re
df["cleaned"] = [re.sub(r"(?<=L)0+", "", entry) for entry in df.col1]
df
col1 cleaned
0 AA_L8_ZZ AA_L8_ZZ
1 AA_L08_YY AA_L8_YY
2 AA_L800_XX AA_L800_XX
3 AA_L0008_CC AA_L8_CC

Related

Pandas Adding Column Maximum to the Original Dataframe [duplicate]

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

How to change the format for values in a dataframe?

I need to change the format for values in a column in a dataframe. If I have a dataframe in that format:
df =
sector funding_total_usd
1 NaN
2 10,00,000
3 3,90,000
4 34,06,159
5 2,17,50,000
6 20,00,000
How to change it to that format:
df =
sector funding_total_usd
1 NaN
2 10000.00
3 3900.00
4 34061.59
5 217500.00
6 20000.00
This is my code:
for row in df['funding_total_usd']:
dt1 = row.replace (',','')
print (dt1)
This is the error that I got "AttributeError: 'float' object has no attribute 'replace'"
I need really to your help in how to do that?
Here's the way to get the decimal places:
import pandas as pd
import numpy as np
df= pd.DataFrame({'funding_total_usd': [np.nan, 1000000, 390000, 3406159,21750000,2000000]})
print(df)
df['funding_total_usd'] /= 100
print(df)
funding_total_usd
0 NaN
1 1000000.0
2 390000.0
3 3406159.0
4 21750000.0
funding_total_usd
0 NaN
1 10000.00
2 3900.00
3 34061.59
4 217500.00
To solve your comma problem, please run this as your first command before you print. It will remove all your commas for the float values.
pd.options.display.float_format = '{:.2f}'.format

Sort pandas dataframe by a column

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'A' :[1,1,1,1,2,2,2,2],
'B' :[2,3,1,5,7,7,1,6]}
# Create DataFrame
df = pd.DataFrame(data)
df
I want to sort 'B' by each group of 'A'
Expected Output:
A B
0 1 1
1 1 2
2 1 3
3 1 5
4 2 1
5 2 6
6 2 7
7 2 7
You can sort a dataframe using the sort_values command. This command will sort your dataframe with priority on A and then B as requested.
df.sort_values(by=['A', 'B'])
Docs

Generate df with all possible combinations of values

I have 3 columns (A,B,C), the value of which can vary from 0 to 100, in increments of 0.1. How to generate df with all possible combinations of values ​​of these columns? :
A B C
0 0 0
0 0 0.01
0 0 0.02
… … … and so on
Edit: It's not combinations but a product
You could use combinations_with_replacement
import itertools
import pandas as pd
import numpy as np
# Python range does not work with floats
my_range = np.arange(0, 1.01, 0.01)
combinations = itertools.product(my_range, repeat=3)
df = pd.DataFrame(combinations)

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources