Pandas groupby sum - group on two variables, sum all other numeric variables and change the original dataframe - pandas-groupby

As mentioned above, I'd like to group by two variables ('town_code', 'ballot'), sum the total of all numeric variables (number of votes for political parties) in those groups, and actually change the original dataframe (called "results", to be grouped). Note that the dataframe contains also a non-numeric column - which is names. Names are identical inside each group, so I just need to make sure it stays after the process.
example:
this is what I have:
and this is what I need:
In the meanwhile I've managed to stay only with the numeric variables while losing the non-numeric and the groupby variables, using this line of code:
results = results.groupby(['town_code','ballot']).sum()

I think below code fullfill your required solution
import pandas as pd
town_code=[1,1,1,1,2,2,2,3,3,4,4]
ballot=[1,2,3,3,1,2,2,1,2,1,2]
town_name=['townsville','townsville','townsville','townsville','citysville','citysville'
,'citysville','villagesville','villageville','policeville','policeville']
party_a=[14,11,14,10,8,7,16,9,13,12,9]
party_b=[13,17,9,11,9,15,19,21,15,8,11]
df=pd.DataFrame({'town_code':town_code,'ballot':ballot,'town_name':town_name,
'party_a':party_a,'party_b':party_b})
import numpy as np
pd.pivot_table(df,values=['party_a','party_b'],
index=['town_code','ballot','town_name'],aggfunc=np.sum)

Related

Making PANDAS identify a PATTERN inside a value on a DF

I'm using Python 3.9 with Pandas and Numpy.
Every day I receive a df with orders from the company I work for. Each day, this df comes from a different country that I don't know the language, and this dataframes don't have a pattern. In this case, I don't know what's the column name nor the index.
I just know that the orders follows a patter: 3 numbers + 2 letters like 000AA, 149KL, 555EE etc.
I saw that with strings is possible, but with pandas I just found commands that needs the name of the column.
df.column_name.str.contains(pat=r'\d\d\d\w\w', regex=True)
If I can find the column that only have this pattern, I know what the orders column is.
I started with a synthetic data set
import pandas
df = pandas.DataFrame([{'a':3,'b':4,'c':'222BB','d':'2asf'},
{'a':2,'b':1,'c':'111AA','d':'942'}])
I then cycle through each column. If the datatype is object, then I test whether all the elements in the Series match the regex
for column_id in df.columns:
if df[column_id].dtype=='object':
if all(df[column_id].str.contains(pat=r'\d\d\d\w\w', regex=True)):
print("matching column:",column_id)

merging pandas dataframes on multiple columns - error about levels

I'm merging my two dataframes below on two fields.
successes = pd.merge(failures, successes, left_on=['name', 'project_name'], right_on=['name', 'project_name'], how='left')
But I get this error - can anyone help me out please?
/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py:643: UserWarning: merging between different levels can give an unintended result (1 levels on the left,2 on the right)
warnings.warn(msg, UserWarning)
I think it must be written this way:
successes.merge(failures, on=['name', 'project_name'])
This happens when you merge DataFrames with different levels of column indices.
Artificial example below reproduces your warning:
import pandas as pd
# a has 2 level column index
a = pd.DataFrame({("name_0","name_01"):[1,2,3,4],
("name_0","name_02"):[4,3,2,1]})
# b has 1 level column index
b = pd.DataFrame({"name_0":[10,2,30,40],
"name_1":[40,30,20,10]})
# Notice how left_on accepts list of tuples. Tuples can be used to adress multilevel columns
pd.merge(a,b,how="left",left_on=[("name_0","name_01")],right_on=["name_0"])
If you instead use only the level 1 of multilevel column index in DataFrame "a" this warning disappears:
import pandas as pd
a = pd.DataFrame({("name_0","name_01"):[1,2,3,4],
("name_0","name_02"):[4,3,2,1]})
# Only use the 1st level index (e.g. "name_01" and "name_02")
a.columns = a.columns.get_level_values(1)
b = pd.DataFrame({"name_0":[10,2,30,40],
"name_1":[40,30,20,10]})
# Notice how left_on is now a normal string since only 1 level is used
pd.merge(a,b,how="left",left_on=["name_01"],right_on=["name_0"])
I suggest you check whether both your DataFrames have same level indices. If not consider dropping one level or flattening them to one level.

Assigning np.nans to rows of a Pandas column using a query

I want to assign NaNs to the rows of a column in a Pandas dataframe when some conditions are met.
For a reproducible example here are some data:
'{"Price":{"1581292800000":21.6800003052,"1581379200000":21.6000003815,"1581465600000":21.6000003815,"1581552000000":21.6000003815,"1581638400000":22.1599998474,"1581984000000":21.9300003052,"1582070400000":22.0,"1582156800000":21.9300003052,"1582243200000":22.0200004578,"1582502400000":21.8899993896,"1582588800000":21.9699993134,"1582675200000":21.9599990845,"1582761600000":21.8500003815,"1582848000000":22.0300006866,"1583107200000":21.8600006104,"1583193600000":21.8199996948,"1583280000000":21.9699993134,"1583366400000":22.0100002289,"1583452800000":21.7399997711,"1583712000000":21.5100002289},"Target10":{"1581292800000":22.9500007629,"1581379200000":23.1000003815,"1581465600000":23.0300006866,"1581552000000":22.7999992371,"1581638400000":22.9599990845,"1581984000000":22.5799999237,"1582070400000":22.3799991608,"1582156800000":22.25,"1582243200000":22.4699993134,"1582502400000":22.2900009155,"1582588800000":22.3248996735,"1582675200000":null,"1582761600000":null,"1582848000000":null,"1583107200000":null,"1583193600000":null,"1583280000000":null,"1583366400000":null,"1583452800000":null,"1583712000000":null}}'
In this particular toy example, I want to assign NaNs to the column 'Price' when the column 'Target10' has NaNs. (in the general case the condition may be more complex)
This line of code achieves that specific objective:
toy_data.Price.where(toy_data.Target10.notnull(), toy_data.Target10)
However when I attempt to use a query and assign NaNs to the targeted column I fail:
toy_data.query('Target10.isnull()', engine = 'python').Price = np.nan
The above line leaves toy_data intact.
Why is that and how I should use query to replace values in particular rows?
One way to do it is -
import numpy as np
toy_data['Price'] = np.where(toy_data['Target10'].isna(), np.nan, toy_data['Price'])

How to fillna() all columns of a dataframe from a single row of another dataframe with identical structure

I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)

Error while selecting rows from pandas data frame

I have a pandas dataframe df with the column Name. I did:
for name in df['Name'].unique ():
X = df[df['Name'] == name]
print (X.head())
but then X contains all kinds of different Name, not an unique name I want.
What did I do wrong?
Thanks a lot
You probably don't want to overwrite X with every iteration of your loop and only keep the dataframe containing the last value of df['Name'].unique().
Depending on your data and goal, you might want to use groupby as jezrael suggests, or maybe do something like df[~df['Name'].duplicated()].

Resources