I have a pandas dataframe with 1mi rows and hierarchical indexes (country, state, city, in this order) with price observations of a product for each row. How can I calculate de mean and standard deviation (std) for each country, state and city (keeping in mind I am avoinding loops as my df is big)?
For each level of mean and std, I want to save the values in new columns in this dataframe for future access.
Use groupby with the argument levels to group your data and then use mean and std. If you want to have your mean as new column in your existing dataframe, use transform which return a Series with the same index as your df :
grouped = df.groupby(level = ['Country','State', 'City'])
df['Mean'] = grouped['price_observation'].transform('mean')
df['Std'] = grouped['price_observation'].transform('std')
If you want to read more on grouping, you can read the pandas documentation
Related
Hi all I have this dataframe. I am trying to create 2 additional columns, max_temperature and min_temperature to record the maximum and minimum temperature values based on the stayid. How can I do that?
Try groupby, agg and pd.join
newdf=(df.set_index('stayid')#Set stayid to allow joining of the aggregated to the main df
.join(# This joins the ggregated df to main df
df.groupby('stayid')['temp'].agg([min,max])# Compute the min and max temperature and put them into a summarised df
).rename(columns={'min':'min_temp', 'max':'max_temp'}))#rename the min and max columns)
I have a dask dataframe with only the 'Name' and 'Value' column similar to the table below.
How do I compute the 'Average' column? I tried groupby in dash but that just gives me a dataframe of 2 records containing the average of A and B.
You can just left join your original table with the new one on Name. From https://docs.dask.org/en/latest/dataframe-joins.html:
small = small.repartition(npartitions=1)
result = big.merge(small)
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)
we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)
df = pd.DataFrame({"ID":['A','B','C','D','E','F'],
"IPaddress":['12.345.678.01','12.345.678.02','12.345.678.01','12.345.678.18','12.345.678.02','12.345.678.01'],
"score":[8,9,5,10,3,7]})
I'm using Python, and Pandas library. For those rows with duplicate IP addresses, I want to select only one row with highest score (score being from 0-10), and drop all duplicates.
I'm having a difficult time in turning this logic into a Python function.
Step 1: Using the groupby function of Pandas, split the df into groups of IPaddress.
df.groupby('IPaddress')
Result of this will create an groupby object. Once you check the type of this object, it will be the following: pandas.core.groupby.groupby.DataFrameGroupBy
Step 2: With the Pandas groupby object created from step1, using .idxmax() over the score, will return the Pandas series with maximum scores of each IPaddress
df.groupby('IPaddress').score.idxmax()
(Optional) Step 3: If you want to transform the above series to dataframe, you can do below:
df.loc[df.groupby('IPaddress').score.idxmax(),['IPaddress','score']]
Here, you are selecting all the rows with max scores, and showing the IPaddress, score columns.
Useful reference:
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/