Applying RMS formula over three columns pandas - python-3.x

I am trying to apply a RMS function for Accelero-meter data which is in 3 dimension. Also, I have a time stamp column at the beginning which I have kept in days count. So the dataframe is as follows:
0 1 2 3
0 1.963 -12.0 -71.0 -2.0
1 1.963 -11.0 -71.0 -3.0
2 1.963 -14.0 -67.0 -6.0
3 1.963 -16.0 -63.0 -7.0
4 1.963 -18.0 -60.0 -8.0
column '0' is Days, and all the other columns are the 3-axis data of accelero-meter. Right now I am using this approach to compute the RMS value to a new column and drop the existing 3-axis data :
def rms_detrend(x):
return np.sqrt(np.mean(x[1]**2 + x[2]**2 + x[3]**2))
accdf =pd.read_csv(ACC_files[1],header=None)
accdf['ACC_RMS'] = accdf.apply(rms_detrend,axis=1)
accdf = accdf.drop([1,2,3],axis=1)
accdf.columns = accdf['Days','ACC_RMS']
However, I have 70 such files of Accelerometer data each with about 4000+ rows. So is there a better and quicker(pythonic) way to do this ? Thanks.
The code above I have done for just one file and its very slow.

A method from pandas
(df.iloc[:,1:]**2).sum(1).pow(1/2)
Out[26]:
0 72.034714
1 71.909666
2 68.709534
3 65.375837
4 63.150614
dtype: float64

Use:
accdf['ACC_RMS'] = np.sqrt(accdf.pop(1)**2 + accdf.pop(2)**2 + accdf.pop(3)**2)
print (accdf)
0 ACC_RMS
0 1.963 72.034714
1 1.963 71.909666
2 1.963 68.709534
3 1.963 65.375837
4 1.963 63.150614
Numpy solution for improve performance:
#[50000 rows x 4 columns]
accdf = pd.concat([accdf] * 10000, ignore_index=True)
In [27]: %timeit (accdf.iloc[:,1:]**2).sum(1).pow(1/2)
1.97 ms ± 89.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: %timeit np.sqrt(np.sum(accdf.to_numpy()[:,1:]**2, axis=1))
202 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Unfortunately my solution return error for testing, but I guess it is slowier like numpy only solution.

Related

Pandas difference of successive elements [duplicate]

This question already has answers here:
How to calculate differences between consecutive rows in pandas data frame?
(2 answers)
Closed 4 months ago.
Assume I have a data frame like so
df = pd.DataFrame(data=np.random.random(10,10))
I need to create a dataframe(call it diff) such that for every i in diff meets the following criteria
diff[i] = df[i]-df[i-1]
I can do this iteratively but that doesn't scale well. How would you do this in pandas with super fast speed.
IIUC use DataFrame.diff:
np.random.seed(2022)
df = pd.DataFrame(data=np.random.random((3,3)))
print(df)
0 1 2
0 0.009359 0.499058 0.113384
1 0.049974 0.685408 0.486988
2 0.897657 0.647452 0.896963
df1 = df.diff(-1)
print(df1)
0 1 2
0 -0.040615 -0.186350 -0.373604
1 -0.847683 0.037956 -0.409975
2 NaN NaN NaN
df2 = df.diff()
print(df2)
0 1 2
0 NaN NaN NaN
1 0.040615 0.186350 0.373604
2 0.847683 -0.037956 0.409975
Numpy alternatives for improve performance with numpy.diff and DataFrame constructor:
df1 = pd.DataFrame(np.diff(-df, axis=0, append=np.nan),
index=df.index, columns=df.columns)
print(df1)
0 1 2
0 -0.040615 -0.186350 -0.373604
1 -0.847683 0.037956 -0.409975
2 NaN NaN NaN
df2 = pd.DataFrame(np.diff(df, axis=0, prepend=np.nan),
index=df.index, columns=df.columns)
print(df2)
0 1 2
0 NaN NaN NaN
1 0.040615 0.186350 0.373604
2 0.847683 -0.037956 0.409975
Performance:
np.random.seed(2022)
df = pd.DataFrame(data=np.random.random((3000,3000)))
In [75]: %timeit df.diff()
142 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [76]: %timeit pd.DataFrame(np.diff(df, axis=0, prepend=np.nan), index=df.index, columns=df.columns)
77.1 ms ± 469 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to extract an arithmetic operation from a string with Pandas

In a Pandas DataFrame
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 rt#-[ 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 rt#-[ 2.97
...
I would like to extract only the numeric values. Column C I can do with, e.g.,
>> extr = df['C'].str.extract(r'(\d+\.\d+)', expand=False)
>> df['C'] = pd.to_numeric(extr)
>> df.head()
A B C
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97
...
but I have problems with the B column. How can I extract the + operations, as well as the floats? I tried
>> extr = df['B'].str.extract(r'(\d+\.\d+)\+(\d+\.\d+)\+(\d+\.\d+)', expand=False)
which I was hoping would give me something like
0
0 '0.00+0.07+0.93'
1 '0.02+0.16+0.82'
...
but instead it gives me three columns with NaN values in them:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
...
So how could I extract the whole arithmetic operations?
(Only the + operations are needed, and any other characters, such as -, can be ignored.)
An alternate approach using Series.str.findall:
df['B'] = df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
# print(df)
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
timeit comparision of all the solutions:
df.shape
(20000, 4)
%%timeit -n100 #Shubham solution
df['B'].str.findall(r'(\d+(?:.\d+)?)').agg('+'.join)
31.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Rakesh solution
df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
32.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #Sammy solution
["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
36.8 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100 #JudV solution
df['B'].str.replace(r'[^\d.+]', '')
59.7 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
One way is to run a str join on the extracted data, using + as the delimiter
import re
df = pd.read_clipboard(sep='\s{2,}')
df['extract'] = ["+".join(re.findall("(\d+\.?\d+)",entry)) for entry in df.B]
A B C extract
0 1 â#0.00 + "s=?0.07 + 'due0.93 3.01 0.00+0.07+0.93
1 2 â#0.02 + "s=?0.16 + 'due0.82 2.97 0.02+0.16+0.82
This is one approach using str.findall & .str.join("+")
Ex:
df = pd.DataFrame({"B": ["""â#0.00 + "s=?0.07 + 'due0.93""", """â#0.02 + "s=?0.16 + 'due0.82"""]})
df["Z"] = df["B"].str.findall(r"(\d+\.\d+)").str.join("+")
print(df)
Output:
B Z
0 â#0.00 + "s=?0.07 + 'due0.93 0.00+0.07+0.93
1 â#0.02 + "s=?0.16 + 'due0.82 0.02+0.16+0.82
Python is not my forte but I'd use replace instead and do the operation for both columns, maybe look into:
df[['B', 'C']] = df[['B','C']].replace(r'[^\d.+]', '', regex=True)
print(df)
Result:
A B C
0 1 0.00+0.07+0.93 3.01
1 2 0.02+0.16+0.82 2.97
If it's just column B you are after than maybe simply use:
extr = df['B'].str.replace(r'[^\d.+]', '')

Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list)

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
i.e.,
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).
Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
Perfromance:
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
...:
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

how can I transform the following data into pandas dataframe [duplicate]

This question already has answers here:
Unnest (explode) a Pandas Series
(8 answers)
Closed 4 years ago.
I have this type of data I want this each list of each id in seperate column
id data
2 [1.81744912347, 1.96313966807, 1.79290908923]
3 [0.87738744314, 0.154642653196, 0.319845728764]
4 [1.12289279512, 1.16105905267, 1.14889626137]
5 [1.65093687407, 1.65010263863, 1.65614839538]
6 [0.103623262651, 0.46093367049, 0.549343505693]
7 [0.122299243819, 0.355964399805, 0.40010681636]
8 [3.08321032223, 2.92526466342, 2.6504125359, 2]
9 [0.287041436848, 0.264107869667, 0.29319302508]
10 [0.673829091668, 0.632715325748, 0.47099544284]
11 [3.04589375431, 2.19130582148, 1.68173686657]
how can I transform the data into the pandas DataFrame
I want it as the following data
id data
1 1.61567967235
1 1.55256213176
1 1.16904355984
...
10 0.673829091668
10 0.632715325748
and so on
its large amount of data, if I use the loop to transform it, it kills the notebook, is there any other way to process this data,
sample image of the data
IIUC, from
col
0 [1, 2, 3]
1 [4, 5, 6]
can do
df.col.apply(pd.Series).stack().reset_index(drop=True)
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
or
pd.Series([z for x in df.col.values for z in x])
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
Times:
%timeit df.col.apply(pd.Series).stack().reset_index(drop=True)
1.15 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Series([z for x in df.col.values for z in x])
89.2 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Error : 'Series' object has no attribute 'sort' [duplicate]

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2

Resources