Get Poisson expectation of preceding values of a time series in Python - python-3.x

I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.

You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.

Related

Avoid number truncation in pandas rows [duplicate]

I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.

Average data points in a range while condition is met in a Pandas DataFrame

I have a very large dataset with over 400,000 rows and growing. I understand that you are not supposed to use iterows to modify a pandas data frame. However I'm a little lost on what I should do in this case, since I'm not sure I could use .loc() or some rolling filter to modify a data frame in the way I need to. I'm trying to figure out if I can take a data frame and average the range while the condition is met. For example:
Condition
Temp.
Pressure
1
8
20
1
7
23
1
8
22
1
9
21
0
4
33
0
3
35
1
9
21
1
11
20
1
10
22
While the condition is == 1 the outputed dataframe would look like this:
Condition
Avg. Temp.
Avg. Pressure
1
8
21.5
1
10
21
Has anyone attempted something similar that can put me on the right path? I was thinking of using something like this:
df = pd.csv_read(csv_file)
for index, row in df.iterrows():
if row['condition'] == 1:
#start index = first value that equals 1
else: #end index & calculate rolling average of range
len = end - start
new_df = df.rolling(len).mean()
I know that my code isn't great, I also know I could brute force it doing something similar as I have shown above, but as I said it has a lot of rows and continues to grow so I need to be efficient.
TRY:
result = df.groupby((df.Condition != df.Condition.shift()).cumsum()).apply(
lambda x: x.rolling(len(x)).mean().dropna()).reset_index(drop=True)
print(result.loc[result.Condition.eq(1)]) # filter by required condition
OUTPUT:
Condition Temp. Pressure
0 1.0 8.0 21.5
2 1.0 10.0 21.0

Divide floating point with intager

So, I'm trying to divide values across two columns of a .csv file, one of which comprises intagers ('counts'), and the other is made up of floats ('Surface').
df = pd.read_csv(r'G:\file_path\file1.csv')
df['f'] = df['counts']/df['Surface']
Doing so returns the 'TypeError: string indices must be integers' error message.
An example of the file is:
I have tried to find information online on how to divide floats but can only find endless resourcess on how to use the one-slash (/) or two-slash (//) methods to output floats or intagers, opposed to anything about actually dividing floats themselves.
Any ideas on how I resolve this?? Surely it can't be all that complicated.
Cheers,
R
I suspect one of the columns is dtype object.
Please try
Data
df=pd.DataFrame({'counts':[49, 47,44,43],'Surface':[1.878914,1.854631,1.854631,1.660323]})
print(df)
counts Surface
0 49 1.878914
1 47 1.854631
2 44 1.854631
3 43 1.660323
df['f'] = df['counts'].astype(int)/df['Surface'].astype(float)
counts Surface f
0 49 1.878914 26.078895
1 47 1.854631 25.341968
2 44 1.854631 23.724396
3 43 1.660323 25.898575

Histogram with ggplot2 requires a continuous x variable

I have a dataset in a table format that looks like this:
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
....
If I use this command:
library(ggplot2)
ggplot(t, aes("frequency")) +
geom_histogram()
("t" is the name of my table)
Then RStudio says: "StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?"
I just want to see how many times a 3 or a 5 etc. occurs.
Thanks for your help.
It looks like your data is already aggregated? Maybe the ggplot2::geom_histogram() function might not appropriate for you to use? Have you tried the geom_col() function? This simply takes the numbers declared in the input data frame, and displays a column plot with that data.
Using the below code
# Declare data frame
t <- data.frame(test = c("test40", "test33", "test19", "test4521",
"test34", "test27", "test42", "test35"),
frequency = c(3, 5, 2, 1,
1, 3, 3, 1))
returns the data frame like this
# View data
print(t)
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
and therefore you can plot it like this
# Load package
library(ggplot2)
# Generate column plot
ggplot(t, aes(test, frequency)) +
geom_col()
If you simply wanted a count of the times that the number 2 or the number 3 occurred in your data frame, then yes the geom_histogram() is the correct function to use. See, the geom_histogram() function counts the frequency that a term occurs in the data frame, then returns the result. It has an internal validation that looks at the type of data that you are trying to plot across the x-axis, and notices that if it is discrete, then you need to parse the parameter stat="count" in the function. If you don't include this parameter, then ggplot will try to bin your data to create the histogram, which is illogical because all you want is a count.
Check out this link for a description of the difference between continuous and discrete data: What is the difference between discrete data and continuous data?
With this in mind, you can plot the histogram like this
# Generate histogram plot
ggplot(t, aes(frequency)) +
geom_histogram(stat="count")
I hope that helps mate.

Difference between two different pandas columns

I build a function for do a difference between two different pandas columns from different data set. First data set contain the predict value and the second data set contain observed. The problem is that the row of two data set are different and for do a difference I must use the ID of row.
The function is:
def difference(data1,data2):
for i in range(data1.shape[0]):
e_id=data2.iloc[i,0]
p_oss =data1.iloc[int(e_id),9]
diff= p_oss - data2.iloc[i,1]
return diff
difference(df,evaluation)
Where: data1: observed value data2:predict value
The error of functionis:
IndexError: single positional indexer is out-of-bounds
The observed data set is structured like this:
ID Attribute1 Attribute2 ... Prime
1 N 10 123
2 S 10 128
3 N 8 26
4 S 12 567
..
n N 15 5
The predict data set is structured like this:
ID Prime
4 566.89
1 123.03
2 127.95
3 26.01
...
The ID of predict data set change because I use a function (train_test_split) to split the originally df in train e test set.
I want an output like this:
ID difference
1 0.03
2 0.05
3 0.1
4 0.11
..

Resources