Estimating discretization step from noisy timeseries - statistics

We have a sensor that measures some physical process that sends the value via PWM. As the sensor discretize the data, when reading the PWM signal we should only see the value in discrete steps. For example, when the signal varies from 200 to 400, we will not see all possible values between 200 and 400, but rather values like 200 225 250 ... 375 400 indicating the discretization step of 25.
However our PWM reader is noisy resulting in values like 198 224 261 275 etc. What is the correct way to estimate the discretization step and parameters of the noise assuming the noise is Gaussian?
I suppose we can just enumerate all the possible steps, then for each step calculate reminders of values divided by the step. Then the discretization step should be the step when the remainders produce the best fit to Gaussian. For the best fit I suppose we can take the least sum of minimal squares from the Gaussian based on reminders.
The obvious problem with this naive approach is that both step and step / 2 may produce the same fit. That can be taken care I suppose by reducing a possible step range based on some guess. But perhaps this naive approach based on remainders is not the proper way to deal with the problem?
In our case typical histogram of data points is given bellow. As we suspect that the step and parameters of the noise depend on environmental conditions we prefer not to mix data from different runs limiting the data amount.
value count
682 2
775 8
776 14
807 7
838 3
868 1
869 2
900 1
931 3
993 2
1024 2
1055 2
1117 3
1179 2
1210 1
1241 1
1272 2
1303 1
1334 1
1365 2
1397 1
1427 1
1428 2
1458 1
1521 2
1551 1
1552 1
1583 1
1614 1
1645 1
1676 1
1707 3
1738 1
1769 1
1800 2
1831 1
1862 3
1893 1
1924 2
1955 1
1986 2
2047 1
2048 2
2079 1
2111 1
2141 1
2142 1
2173 1
2204 2
2234 1
2235 2
2266 1
2297 1
2325 1
2359 2
2390 3
2483 3
2514 2
2545 1
2575 1
2607 2
2638 3
2669 1
2731 4
2762 2
2794 2
2823 1
2825 1
2854 1
2858 1
2889 1
2918 2
2980 3
3011 2
3042 2
3071 1
3073 1
3104 2
3134 2
3135 1
3197 3
3228 1
3259 1
3287 1
3289 1
3290 2
3321 2
3351 1
3352 2
3383 1
3414 1
3445 4
3506 1
3508 3
3539 1
3570 3
3601 2
3630 1
3632 1
3661 1
3663 2
3694 2
3723 1
3725 2
3754 2
3756 2
3785 16
3787 7
3818 45
3820 1
3822 1
3825 2
3849 3
3880 2
3911 1
3942 3
3971 1
3973 2
4002 1
4004 2
4035 5
4095 1
4097 3
4128 2
4152 1
4157 2
4191 2
4220 1
4222 1
4251 1
4253 3
4276 1
4313 1
4315 4
4344 1
4346 3
4375 3
4377 4

Related

How to replace a specific column data to its z-scores in a data set in python?

I have a numeric dataset and I want to calculate the z score for 'KM' column and replace the original values with the z score values. I'm new to python and please help.
KM CC Doors Gears Quarterly_Tax Weight Guarantee_Period
46986 2000 3 5 210 1165 3
72937 2000 3 5 210 1165 3
38500 2000 3 5 210 1170 3
31461 1800 3 6 100 1185 12
32189 1800 3 6 100 1185 3
23000 1800 3 6 100 1185 3
18739 1800 3 6 100 1185 3
34000 1800 3 5 100 1185 3
21716 1600 3 5 85 1105 18
64359 1600 3 5 85 1105 3
67660 1600 3 5 85 1105 3
43905 1600 3 5 100 1170 3
Something like this should do it for you
from scipy import stats
df["KM"] = df["KM"].apply(stats.zscore)

Column level parsing in pandas data frame

Currently I am working with 20M records with 5 columns. My data frame looks like -
tran_id id code
123 1 1759#1#83#0#1362#0.2600#25.7400#2.8600#1094#1#129.6#14.4
254 1 1356#0.4950#26.7300#2.9700
831 2 1354#1.78#35.244#3.916#1101#2#40#0#1108#2#30#0
732 5 1430#1#19.35#2.15#1431#3#245.62#60.29#1074#12#385.2#58.8#1109
141 2 1809#8#75.34#292.66#1816#4#24.56#95.44#1076#47#510.89#1110.61
Desired output -
id new_code
1 1759
1 1362
1 1094
1 1356
2 1354
2 1101
2 1108
5 1430
5 1431
5 1074
5 1109
2 1809
2 1816
2 1076
What I have done so far -
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dd= pd.DataFrame({'col' : d["code"].apply(lambda x: re.split('[# # ]', x))})
dd.head()
s = dd['col'].str[:]
dd= pd.DataFrame(s.values.tolist())
dd.head()
cols = range(len(list(dd)))
num_cols = len(list(dd))
new_cols = ['col' + str(i) for i in cols]
dd.columns = new_cols[:num_cols]
Just remember the size of the data is huge...20 million.Can't do any looping.
Thanks in advance
You can use Series.str.findall for extract integers with length 4 between separators:
#https://stackoverflow.com/a/55096994/2901002
s = df['code'].str.findall(r'(?<![^#])\d{4}(?![^#])')
#alternative
#s = df['code'].str.replace('[##]', ' ').str.findall(r'(?<!\S)\d{4}(?!\S)')
And then create new DataFrame by numpy.repeat with str.len and flaten by chain.from_iterable:
from itertools import chain
df = pd.DataFrame({
'id' : df['id'].values.repeat(s.str.len()),
'new_code' : list(chain.from_iterable(s.tolist()))
})
print (df)
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076
An alternative approach using Series.str.extractall with a different regex pattern:
(df.set_index('id').code.str.extractall(r'(?:[^\.]|^)(?P<new_code>\d{4})')
.reset_index(0)
.reset_index(drop=True)
)
[out]
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076
14 2 1110

GroupBy dataframe and find out max number of occurrences of another column

I have to use groupby() on a dataframe in python 3.x. Column name is Origin, then based upon the origin, I have to find out the destination with maximum occurrences.
Sample df is like:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay origin dest
0 2013 1 1 517 515 2 830 819 11 EWR IAH
1 2013 1 1 533 529 4 850 830 20 LGA IAH
2 2013 1 1 542 540 2 923 850 33 JFK MIA
3 2013 1 1 544 545 -1 1004 1022 -18 JFK BQN
4 2013 1 1 554 600 -6 812 837 -25 LGA ATL
5 2013 1 1 554 558 -4 740 728 12 EWR ORD
6 2013 1 1 555 600 -5 913 854 19 EWR FLL
7 2013 1 1 557 600 -3 709 723 -14 LGA IAD
8 2013 1 1 557 600 -3 838 846 -8 JFK MCO
9 2013 1 1 558 600 -2 753 745 8 LGA ORD
You can use the following to find out the max number of occurrences of another column:
df.groupby(['origin'])['dest'].size().reset_index()
origin dest
0 EWR 3
1 JFK 3
2 LGA 4
you can use aggregate functions to make your life simpler and plot graphs onto it as well.
fun={'dest':{'Count':'count'}
df= df.groupby(['origin','dest']).agg(fun).reset_index()
df.columns=df.columns.droplevel(1)
df

Transposing multi index dataframe in pandas

HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22
Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0
Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Return a median value in excel where there are multiple rows of data

In Excel how do I return a median value for a set of data where there are multiple rows and columns? I have a set of data where the first column contains a reference number and the second column contains a list of readings over a number of days. How do I calculate the median value for each reference number using a formula?
number volume
1 3072
1 2304
1 2016
1 2496
1 2144
1 2528
1 3312
1 3360
1 2976
1 2768
1 2688
1 3040
1 3008
1 2560
2 574
2 574
2 574
2 574
2 576
2 574
2 575
2 574
2 576
2 574
2 574
2 574
2 574
2 574
3 2880
3 2880
3 2912
3 2976
3 1536
3 288
3 2976
3 2944
3 2880
3 1536
3 2976
3 1536
3 2880
3 2880
4 2267
4 2267
4 2267
4 2267
4 2267
4 2267
4 2268
4 2267
4 2267
4 2267
4 2267
4 2267
5 800
5 800
5 1984
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 416
5 1984
6 800
6 832
6 832
6 832
6 800
6 832
6 832
6 832
6 832
6 832
6 832
6 832
6 832
6 832
The reference number is Column A and the reading is Column B. In this example I have used just six reference numbers but my real data has several hundred.
Consider the array formula:
=MEDIAN(IF($A$2:$A$83=ROWS($1:1),$B$2:$B$83))
pick a cell, enter the formula and copy down:
Array formulas must be entered with Ctrl + Shift + Enter rather than just the Enter key.
Try this array formula:
=MEDIAN(IF(A:A=1,B:B))
This is an array formula and must be confirmed with Ctrl-Shift-Enter.
For a non CSE Array formula, one entered normally, if you have 2010 or later then use this:
=AGGREGATE(17,6,(B:B/(A:A=1)),2)
Where 1 is the reference number. You can make it dynamic by adding a cell reference, so as that cell changes so will the answer.

Resources