How to remove values lesser & greater than 1st& 99th percentiles after multiple imputation in R? - subset

I want to remove data lesser& greater than 1st 99th percentiles.
I am using the following code
library(mice)
data(nhanes)
imp <- mice(nhanes)
I tried the following code but it does not work.
imp1<-with(imp, imp[imp$SDMVSTRA > quantile(imp$SDMVSTRA, p = 0.99) |
imp$SDMVSTRA< quantile(imp$SDMVSTRA, p = 0.01)])
I want to remove data lesser& greater than 1st 99th percentiles according to SDMVSTRA variable in this dataset
Thank you in advance

Related

How to measure and store specific distances of Price to a Moving Average

I'm a beginner with Python and coding.
Say I have a Moving Average and I want to set up conditions to store information between the duration where Price crosses above the MA then below.
The info I want to store is:
the Peak distance from Price to MA
the distance from the highest Price point to the Price point where Price crosses below MA.
https://i.imgur.com/st9cc7D.png
The data I use is a pandas dataframe with 4 columns, 2 cols are shift(1) of the other: data['Close' , 'Close Yesterday', 'MA', 'MA Yesterday']
Since I need to find the Peak distance and the Peak-to-Trough distance of each of the events-(Price crosses above and below MA), I have to clear the arrays each time an event has ended. That's where I'm struggling with.
Using for-loop then either nested if-statements or while then nested if's when the set condition is no longer valid to store the extrema values didn't seem to work because of the arrays are already empty.
Here's an example:
# Up Prices
price = []
# Up Price-MA diffs
diff= []
# Peak
peak= []
# Peaks to Trough-(Price crosses below MA)
peak_trough = []
for i in range(len(data)):
row = data.iloc[i:i+1,:]
if row["Close Yesterday"].values[0] > row["MA Yesterday"].values[0]:
if row["Close"].values[0] > row["MA"].values[0]:
# store Up Prices
price.append(row["Close Yesterday"].values[0])
# take & store the Price-MA diffs
diff.append(row["Close Yesterday"].values[0] - row["MA Yesterday"].values[0])
if row["Close Yesterday"].values[0] > row["MA Yesterday"].values[0]:
if row["Close"].values[0] < row["MA"].values[0]:
# store the Peak
peak.append(np.max(diff))
# store the Peak to Trough
peak_trough.append(np.max(price) - row["Close"].values[0])
# reset Prices
price = []
# reset Price-MA diffs
diff= []
price and diff lists are aready empty when the 2nd nested if-statement is run.
How do you go about coding this? Much appreciated!
I think as per above code we are working on two opposite condition use in 'if' and 'elif' inside a same iteration. If we say 'if' condition is cond-1 and 'elif' condition is cond-2. so for a row only one condition will be true either cond-1 or cond-2. so updating price and diff in a row in cond-1 and use in cond-2 is not looking correct.
so there are below approach to achieve above is -
need to run iteration twice independently first iteration is for cond-1 and inside it updation of price and diff list will happen with unique key identifier for example row index if row matters.
in second iteration run for cond-2 and use price and diff list updated in above iteration.
please note - no need to reset price and diff.
in a same iteration both condition is not possible.

Matplotlib: Changing the limits of an axis based on the range

I would like to limit the y-axis boundaries based on the general range of my data, avoiding spikes but not removing them.
I am producing many sets of graphs comparing two sets of data. Both sets contain data over a year and have been read into dataframes with pandas and the graphs are produced via loop for each month. One of the sets has interment spikes which causes the range on the y-axis to be plotted much too large, resulting in an unreadable chart. Setting a fixed boundary with pyplot.ylim() doesn't help as the general range of the data (for example within one month) changes from chart to chart and applying a hard limit reduces the readability of many of the charts.
For example: one month may have data which generally does not go higher than a value of 300,000 but has several spikes which go way over 500,000 (and below -500,000), but another month may also have large spikes but data which does otherwise not go higher than a value of 150,000.
I've also tried setting values which are too large to nan using df2 = df[df.y < 500000] = np.nan based on this answer but the breaks in the line graph are too small to see and the fact that the spikes occur gets lost.
Is there some way to figure out what the general maximum and minimum range of the data is so that the y-axis limits can be set in a sensible way?
As I was writing this question something occurred to me and I solved it by making a copy of the dataframe, removing the very large values, then checking what the max and min values of the remaining data were.
def check_min_max(selected, selected2):
max_test = selected2.copy(deep=True)
#remove very large values
max_test[(max_test[measurements_col] > 500000) | (max_test[measurements_col] < -500000)] = np.nan
#get new max and min y-values
measurements_y_max = max_test[measurements_col].max()
measurements_y_min = max_test[measurements_col].min()
results_y_max = selected[results_col].max()
results_y_min = selected[results_col].min()
if measurements_y_max > results_y_max:
y_max = measurements_y_max
else:
y_max = results_y_max
if measurements_y_min > 0 or results_y_min > 0:
y_min = 0 - (y_max * 0.01)
elif measurements_y_min < results_y_min:
y_min = measurements_y_min
else:
y_min = results_y_min
return(y_min + (y_min * 0.05), y_max + (y_max * 0.05)) # add 5% to range for readability
I'm also aware that there was no need to copy the dataframe after it was passed to the function. I'd originally written it as part of the code before I moved it to a function and haven't changed it yet

Question(s) regarding computational intensity, prediction of time required to produce a result

Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.

Setting a value in a column within a group according to some condition

I'm new to groups in pandas, and relatively new to pandas, so I hope someone of you can help me with my problem.
Aim: flag outliers within a group by setting the relevant cell in the relevant column to 1.The condition is that the data point is outside a calculated group specific limit.
Data: This is a geopandas dataframe with multiple time series with some numeric variables. Each timeseries has its own id.
Some background:
I want to determine outliers for each timeseries by
first grouping the timeseries according to timeseries id
then calculate the lower and upper limit of the variables PER group
then 'flag' the values which are outside the limit by adding a 1 in a specific 'outlier'column
Here is the code which calculates the limits, however, when it comes to setting the flag I have a hard time to figure that out:
df_timeseries['outlier'] = np.zeros
for timeseries, group in df_timeseries.groupby('timeseries.id'):
Q1 = group['Variable.value'].quantile(0.25)
Q3 = group['Variable.value'].quantile(0.75)
IQR = Q3 - Q1
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 *IQR
for value in group['Variable.value']:
if ((value < low_lim) or (value > up_lim)):
# here --> set '1' in the column 'outlier'
I tried it multiple ways, for example:
df_timeseries.loc[df_timeseries['Variable.value'] > up_lim, 'outlier']=1
I also tried 'apply()', so instead of iterating over the tracks I tried to first define a function and then apply it on the group. However nothing really worked, and I could not find out what I actually do wrong. If someone can help, I would be really glad, as I have already tried to figure this out about a couple of hours.
I would need something like:
group.loc[group['outlier']] = 1

How to calculate with the Poisson-Distribution in Matlab?

I’ve used Excel in the past but the calculations including the Poisson-Distribution took a while, that’s why I switched to SQL. Soon I’ve recognized that SQL might not be a proper solution to deal with statistical issues. Finally I’ve decided to switch to Matlab but I’m not used to it at all, my problem Is the following:
I’ve imported a .csv-table and have two columns with values, let’s say A and B (110 x 1 double)
These values both are the input values for my Poisson-calculations. Since I wanna calculate for at least the first 20 events, I’ve created a variable z=1:20.
When I now calculated let’s say
New = Poisspdf(z,A),
it says something like non-scalar arguments must match in size.
Z only has 20 records but A and l both have 110 records. So I’ve expanded Z= 1:110 and transposed it:
Znew = Z.
When I now try to execute the actual calculation:
Results = Poisspdf(Znew,A).*Poisspdf(Znew,B)
I always get only a 100x1 Vector but what I want is a matrix that is 20x20 for each record of A and B (based on my actual choice of z=1:20, I only changed to z=1:110 because Matlab told that they need to match in size).
So in this 20x20 Matrix there should always be in each cell the result of a slightly different calculation (Poisspdf(Znew,A).*Poisspdf(Znew,B)).
For example in the first cell (1,1) I want to have the result of
Poisspdf(0,value of A).*Poisspdf(0,value of B),
in cell(1,2): Poisspdf(0,value of A).*Poisspdf(1,value of B),
in cell(2,1): Poisspdf(1,value of A).*Poisspdf(0,value of B),
and so on...assuming that it’s in the Format cell(row, column)
Finally I want to sum up certain parts of each 20x20 matrix and show the result of the summed up parts in new columns.
Is there anybody able to help? Many thanks!
EDIT:
Poisson Matrix in Excel
In Excel there is Poisson-function: POISSON(x, μ, FALSE) = probability density function value f(x) at the value x for the Poisson distribution with mean μ.
In e.g. cell AD313 in the table above there is the following calculation:
=POISSON(0;first value of A;FALSE)*POISSON(0;first value of B;FALSE)
, in cell AD314
=POISSON(1;first value of A;FALSE)*POISSON(0;first value of B;FALSE)
, in cell AE313
=POISSON(0;first value of A;FALSE)*POISSON(1;first value of B;FALSE)
, and so on.
I am not sure if I completely understand your question. I wrote this code that might help you:
clear; clc
% These are the lambdas parameters for the Poisson distribution
lambdaA = 100;
lambdaB = 200;
% Generating Poisson data here
A = poissrnd(lambdaA,110,1);
B = poissrnd(lambdaB,110,1);
% Get the first 20 samples
zA = A(1:20);
zB = B(1:20);
% Perform the calculation
results = repmat(poisspdf(zA,lambdaA),1,20) .* repmat(poisspdf(zB,lambdaB)',20,1);
% Sum
sumFinal = sum(results,2);
Let me know if this is what you were trying to do.

Resources