How to extract an arithmetic operation from a string with Pandas - python-3.x

In a Pandas DataFrame
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 rt#-[ 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 rt#-[ 2.97
...
I would like to extract only the numeric values. Column C I can do with, e.g.,
>> extr = df['C'].str.extract(r'(\d+\.\d+)', expand=False)
>> df['C'] = pd.to_numeric(extr)
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97
...
but I have problems with the B column. How can I extract the + operations, as well as the floats? I tried
>> extr = df['B'].str.extract(r'(\d+\.\d+)\+(\d+\.\d+)\+(\d+\.\d+)', expand=False)
which I was hoping would give me something like
0
0 '0.00+0.07+0.93'
1 '0.02+0.16+0.82'
...
but instead it gives me three columns with NaN values in them:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
...
So how could I extract the whole arithmetic operations?
(Only the + operations are needed, and any other characters, such as -, can be ignored.)

An alternate approach using Series.str.findall:
df['B'] = df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
# print(df)
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
timeit comparision of all the solutions:
df.shape
(20000, 4)
%%timeit -n100 #Shubham solution
df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
31.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Rakesh solution
df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
32.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Sammy solution
["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
36.8 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #JudV solution
df['B'].str.replace(r'[^\d.+]', '')
59.7 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

One way is to run a str join on the extracted data, using + as the delimiter
import re
df = pd.read_clipboard(sep='\s{2,}')
df['extract'] = ["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
A B C extract
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01 0.00+0.07+0.93
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97 0.02+0.16+0.82

This is one approach using str.findall & .str.join("+")
Ex:
df = pd.DataFrame({"B": ["""â#0.00 + "s=?0.07 + 'due0.93""", """â#0.02 + "s=?0.16 + 'due0.82"""]})
df["Z"] = df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
print(df)
Output:
B Z
0 â#0.00 + "s=?0.07 + 'due0.93 0.00+0.07+0.93
1 â#0.02 + "s=?0.16 + 'due0.82 0.02+0.16+0.82

Python is not my forte but I'd use replace instead and do the operation for both columns, maybe look into:
df[['B', 'C']] = df[['B','C']].replace(r'[^\d.+]', '', regex=True)
print(df)
Result:
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
If it's just column B you are after than maybe simply use:
extr = df['B'].str.replace(r'[^\d.+]', '')

Related

How to make a calculation in a pandas daframe depending on a value of a certain column

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.
You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0
You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas difference of successive elements [duplicate]

This question already has answers here:
How to calculate differences between consecutive rows in pandas data frame?
(2 answers)
Closed 4 months ago.
Assume I have a data frame like so
df = pd.DataFrame(data=np.random.random(10,10))
I need to create a dataframe(call it diff) such that for every i in diff meets the following criteria
diff[i] = df[i]-df[i-1]
I can do this iteratively but that doesn't scale well. How would you do this in pandas with super fast speed.
IIUC use DataFrame.diff:
np.random.seed(2022)
df = pd.DataFrame(data=np.random.random((3,3)))
print(df)
0 1 2
0 0.009359 0.499058 0.113384
1 0.049974 0.685408 0.486988
2 0.897657 0.647452 0.896963
df1 = df.diff(-1)
print(df1)
0 1 2
0 -0.040615 -0.186350 -0.373604
1 -0.847683 0.037956 -0.409975
2 NaN NaN NaN
df2 = df.diff()
print(df2)
0 1 2
0 NaN NaN NaN
1 0.040615 0.186350 0.373604
2 0.847683 -0.037956 0.409975
Numpy alternatives for improve performance with numpy.diff and DataFrame constructor:
df1 = pd.DataFrame(np.diff(-df, axis=0, append=np.nan),
index=df.index, columns=df.columns)
print(df1)
0 1 2
0 -0.040615 -0.186350 -0.373604
1 -0.847683 0.037956 -0.409975
2 NaN NaN NaN
df2 = pd.DataFrame(np.diff(df, axis=0, prepend=np.nan),
index=df.index, columns=df.columns)
print(df2)
0 1 2
0 NaN NaN NaN
1 0.040615 0.186350 0.373604
2 0.847683 -0.037956 0.409975
Performance:
np.random.seed(2022)
df = pd.DataFrame(data=np.random.random((3000,3000)))
In [75]: %timeit df.diff()
142 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [76]: %timeit pd.DataFrame(np.diff(df, axis=0, prepend=np.nan), index=df.index, columns=df.columns)
77.1 ms ± 469 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Error : 'Series' object has no attribute 'sort' [duplicate]

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2

Applying RMS formula over three columns pandas

I am trying to apply a RMS function for Accelero-meter data which is in 3 dimension. Also, I have a time stamp column at the beginning which I have kept in days count. So the dataframe is as follows:
0 1 2 3
0 1.963 -12.0 -71.0 -2.0
1 1.963 -11.0 -71.0 -3.0
2 1.963 -14.0 -67.0 -6.0
3 1.963 -16.0 -63.0 -7.0
4 1.963 -18.0 -60.0 -8.0
column '0' is Days, and all the other columns are the 3-axis data of accelero-meter. Right now I am using this approach to compute the RMS value to a new column and drop the existing 3-axis data :
def rms_detrend(x):
return np.sqrt(np.mean(x[1]**2 + x[2]**2 + x[3]**2))
accdf =pd.read_csv(ACC_files[1],header=None)
accdf['ACC_RMS'] = accdf.apply(rms_detrend,axis=1)
accdf = accdf.drop([1,2,3],axis=1)
accdf.columns = accdf['Days','ACC_RMS']
However, I have 70 such files of Accelerometer data each with about 4000+ rows. So is there a better and quicker(pythonic) way to do this ? Thanks.
The code above I have done for just one file and its very slow.
A method from pandas
(df.iloc[:,1:]**2).sum(1).pow(1/2)
Out[26]:
0 72.034714
1 71.909666
2 68.709534
3 65.375837
4 63.150614
dtype: float64
Use:
accdf['ACC_RMS'] = np.sqrt(accdf.pop(1)**2 + accdf.pop(2)**2 + accdf.pop(3)**2)
print (accdf)
0 ACC_RMS
0 1.963 72.034714
1 1.963 71.909666
2 1.963 68.709534
3 1.963 65.375837
4 1.963 63.150614
Numpy solution for improve performance:
#[50000 rows x 4 columns]
accdf = pd.concat([accdf] * 10000, ignore_index=True)
In [27]: %timeit (accdf.iloc[:,1:]**2).sum(1).pow(1/2)
1.97 ms ± 89.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: %timeit np.sqrt(np.sum(accdf.to_numpy()[:,1:]**2, axis=1))
202 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Unfortunately my solution return error for testing, but I guess it is slowier like numpy only solution.

Resources