Difference between two different pandas columns - python-3.x

I build a function for do a difference between two different pandas columns from different data set. First data set contain the predict value and the second data set contain observed. The problem is that the row of two data set are different and for do a difference I must use the ID of row.
The function is:
def difference(data1,data2):
for i in range(data1.shape[0]):
e_id=data2.iloc[i,0]
p_oss =data1.iloc[int(e_id),9]
diff= p_oss - data2.iloc[i,1]
return diff
difference(df,evaluation)
Where: data1: observed value data2:predict value
The error of functionis:
IndexError: single positional indexer is out-of-bounds
The observed data set is structured like this:
ID Attribute1 Attribute2 ... Prime
1 N 10 123
2 S 10 128
3 N 8 26
4 S 12 567
..
n N 15 5
The predict data set is structured like this:
ID Prime
4 566.89
1 123.03
2 127.95
3 26.01
...
The ID of predict data set change because I use a function (train_test_split) to split the originally df in train e test set.
I want an output like this:
ID difference
1 0.03
2 0.05
3 0.1
4 0.11
..

Related

Csv file split comma separated values into separate rows and dividing the corresponding dollar amount by the number of comma separated values in panda

beginner here!
I have a csv file with comma separated values. I want to split each comma separated value in different rows in pandas. However, the corresponding dollar amounts should be divided by the number of comma separated values in each cell and export the result in a different csv file.
the csv table and the desired output table
I have used df.explode(IDs) but couldn’t figure out how to divide the Dollar_Amount by the number of IDs in the corresponding cells.
import pandas as pd
in_csv = pd.read_csv(‘inputCSV.csv’)
new_csv = df.explode(‘IDs’)
new_csv.to_csv(‘outputCSV.csv’)
You can divide the dollar amount by the number of ids in each row before using explode. This can be done as follows:
# Preprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].str[1:].str.replace(',', '').astype(float)
df['IDs'] = df['IDs'].str.split(",")
# Compute the new dollar amount and explode
df['Dollar_Amount'] = df['Dollar_Amount'] / df['IDs'].str.len()
df = df.explode('IDs')
# Postprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].round(2).apply(lambda x: '${0:,.2f}'.format(x))
With an example input:
IDs Dollar_Amount A
0 1,2,3,4 $100,000.00 4
1 5,6,7 $50,000.00 3
2 9 $20,000.00 1
3 10,11 $20,000.00 2
The result is as follows:
IDs Dollar_Amount A
0 1 $25,000.00 4
0 2 $25,000.00 4
0 3 $25,000.00 4
0 4 $25,000.00 4
1 5 $16,666.67 3
1 6 $16,666.67 3
1 7 $16,666.67 3
2 9 $20,000.00 1
3 10 $10,000.00 2
3 11 $10,000.00 2
There will be a one line way to do this with a lambda function (if you are new, read up on lambda functions!) but as a slightly less new beginner, I think its easier to think about this as two separate operations.
Operation 1 - get the count of ids, Operation 2 - do the division
If you take a look here https://towardsdatascience.com/count-occurrences-of-a-value-pandas-e5dad02303e9 you'll get a good lesson on how to do the group by you need to get the count of ids and join it back to your data frame. I'd read that because its a much more detailed explainer, but if you want a simple line of code consider this Pandas, how to count the occurance within grouped dataframe and create new column?
Once you have it, the divison is as simple as df['new_col'] = df['col1']/df['col2']

Get Poisson expectation of preceding values of a time series in Python

I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.
You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.

Histogram with ggplot2 requires a continuous x variable

I have a dataset in a table format that looks like this:
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
....
If I use this command:
library(ggplot2)
ggplot(t, aes("frequency")) +
geom_histogram()
("t" is the name of my table)
Then RStudio says: "StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?"
I just want to see how many times a 3 or a 5 etc. occurs.
Thanks for your help.
It looks like your data is already aggregated? Maybe the ggplot2::geom_histogram() function might not appropriate for you to use? Have you tried the geom_col() function? This simply takes the numbers declared in the input data frame, and displays a column plot with that data.
Using the below code
# Declare data frame
t <- data.frame(test = c("test40", "test33", "test19", "test4521",
"test34", "test27", "test42", "test35"),
frequency = c(3, 5, 2, 1,
1, 3, 3, 1))
returns the data frame like this
# View data
print(t)
test frequency
1 test40 3
2 test33 5
3 test19 2
4 test4521 1
5 test34 1
6 test27 3
7 test42 3
8 test35 1
and therefore you can plot it like this
# Load package
library(ggplot2)
# Generate column plot
ggplot(t, aes(test, frequency)) +
geom_col()
If you simply wanted a count of the times that the number 2 or the number 3 occurred in your data frame, then yes the geom_histogram() is the correct function to use. See, the geom_histogram() function counts the frequency that a term occurs in the data frame, then returns the result. It has an internal validation that looks at the type of data that you are trying to plot across the x-axis, and notices that if it is discrete, then you need to parse the parameter stat="count" in the function. If you don't include this parameter, then ggplot will try to bin your data to create the histogram, which is illogical because all you want is a count.
Check out this link for a description of the difference between continuous and discrete data: What is the difference between discrete data and continuous data?
With this in mind, you can plot the histogram like this
# Generate histogram plot
ggplot(t, aes(frequency)) +
geom_histogram(stat="count")
I hope that helps mate.

Merge regression results back to original dataframe

I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.

Efficiently concatanate a large number of columns

I tried to concatenate a large number of columns containing integers in one string.
Basically, starting from:
df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})
To obtain:
id join
0 1 481
1 2 592
2 3 603
3 4 714
I found several methods to do this (here and here):
Method 1:
conc['glued']=''
i=1
while i < len(df.columns):
conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
i=i+1
This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.
Method 2a
conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]
Method 2b
conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)
Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file).
I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.
Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !
Appendix:
Here is the code I use on my test case:
geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})
I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.
Method 1:
fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
fimpute_geno['Calls'] = fimpute_geno['Calls']\
+ geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)
This work in 45min.
There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.
Method 2:
fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
This leads to an MemoryError:
Here' something to try. It would require that you convert your columns to strings though. your sample frame
b c id
0 4 8 1
1 5 9 2
2 6 0 3
3 7 1 4
then
#you could also do this conc[['b','c','id']] for the next two lines
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)
Would give
a b c id join
0 0 4 8 1 481
1 1 5 9 2 592
2 2 6 0 3 603
3 3 7 1 4 714

Resources