merging pandas dataframes on multiple columns - error about levels - python-3.x

I'm merging my two dataframes below on two fields.
successes = pd.merge(failures, successes, left_on=['name', 'project_name'], right_on=['name', 'project_name'], how='left')
But I get this error - can anyone help me out please?
/usr/local/lib/python3.8/site-packages/pandas/core/reshape/merge.py:643: UserWarning: merging between different levels can give an unintended result (1 levels on the left,2 on the right)
warnings.warn(msg, UserWarning)

I think it must be written this way:
successes.merge(failures, on=['name', 'project_name'])

This happens when you merge DataFrames with different levels of column indices.
Artificial example below reproduces your warning:
import pandas as pd
# a has 2 level column index
a = pd.DataFrame({("name_0","name_01"):[1,2,3,4],
("name_0","name_02"):[4,3,2,1]})
# b has 1 level column index
b = pd.DataFrame({"name_0":[10,2,30,40],
"name_1":[40,30,20,10]})
# Notice how left_on accepts list of tuples. Tuples can be used to adress multilevel columns
pd.merge(a,b,how="left",left_on=[("name_0","name_01")],right_on=["name_0"])
If you instead use only the level 1 of multilevel column index in DataFrame "a" this warning disappears:
import pandas as pd
a = pd.DataFrame({("name_0","name_01"):[1,2,3,4],
("name_0","name_02"):[4,3,2,1]})
# Only use the 1st level index (e.g. "name_01" and "name_02")
a.columns = a.columns.get_level_values(1)
b = pd.DataFrame({"name_0":[10,2,30,40],
"name_1":[40,30,20,10]})
# Notice how left_on is now a normal string since only 1 level is used
pd.merge(a,b,how="left",left_on=["name_01"],right_on=["name_0"])
I suggest you check whether both your DataFrames have same level indices. If not consider dropping one level or flattening them to one level.

Related

Split dataframe into several ones (not the same size) based on consecuvite zero values

i have a dataframe with numeric values (i show here only the column used for the "condition").
I would like splitting it into several others (the size of the splitted dataframes could be different). The "splitting" should be based on consecutive no zero values.
In the following case from this initial dataframe:
:
I would like these three dataframes into new variables
, ,
Is there any function to achieve that without parsing all the initial dataframe?
Thank you
I think i found the solution...
df['index']=df.index.values #Create columns with index
s = df.iloc[:,3].eq(0) #mask of zero value
new_df = df.groupby([s, s.cumsum()]).apply(lambda x: list(x.index)) #find on stackoverflow to group value based on the previous mask
out=new_df.loc[False] #Select only False Value therefore only value >0
Finally i have a dataframe with the group of index of consecutive non zero values

How to fillna() all columns of a dataframe from a single row of another dataframe with identical structure

I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)

Pandas DataFrame - adding columns in for loop vs another approach

we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)

How to t-test by group in a pandas dataframe?

I have quite a huge pandas dataframe with many columns. The dataframe contains two groups. It is basically setup as follows:
import pandas as pd
csv = [{"air" : 0.47,"co2" : 0.43 , "Group" : 1}, {"air" : 0.77,"co2" : 0.13 , "Group" : 1}, {"air" : 0.17,"co2" : 0.93 , "Group" : 2} ]
df = pd.DataFrame(csv)
I want to perform a t-test paired t-test on air and co2 thereby compare the two groups Group = 1 and Group = 2.
I have many many more columns than just air co2- hence, I would like to find a procedure that works for all columns int the dataframe. I believe, I could use scipy.stats.ttest_rel together with pd.groupby oder apply. How would that work? Thanks in advance /R
I would use pandas dataframe.where method.
group1_air = df.where(df.Group== 1).dropna()['air']
group2_air = df.where(df.Group== 2).dropna()['air']
This bit of code returns into group1_air all the values of the air column where the group column is 1 and all the values of air where group is 2 in group2_air.
The drop.na() is required because the .where method will return NAN for every row in which the specified conditions is not met. So all rows where group is 2 will return with NAN values when you use df.where(df.Group== 1).
Whether you need to use scipy.stats.ttest_rel or scipy.stats.ttest_ind depends on your groups. If you samples are from independent groups you should use ttest_ind if your samples are from related groups you should use ttest_rel.
So if your samples are independent from oneanother your final piece of required code is.
scipy.stats.ttest_ind(group1_air,group2_air)
else you need to use
scipy.stats.ttest_rel(group1_air,group2_air)
When you want to also test co2 you simply need to change air for co2 in the given example.
Edit:
This is a rough sketch of the code you should run to execute ttests over every column in your dataframe except for the group column. You may need to tamper a bit with the column_list to get it completely compliant with your needs (you may not want to loop over every column for example).
# get a list of all columns in the dataframe without the Group column
column_list = [x for x in df.columns if x != 'Group']
# create an empty dictionary
t_test_results = {}
# loop over column_list and execute code explained above
for column in column_list:
group1 = df.where(df.Group== 1).dropna()[column]
group2 = df.where(df.Group== 2).dropna()[column]
# add the output to the dictionary
t_test_results[column] = scipy.stats.ttest_ind(group1,group2)
results_df = pd.DataFrame.from_dict(t_test_results,orient='Index')
results_df.columns = ['statistic','pvalue']
At the end of this code you have a dataframe with the output of the ttest over every column you will have looped over.

Python - Pandas Dataframe with Multiple Names per Column

Is there a way in pandas to give the same column of a pandas dataframe two names, so that I can index the column by only one of the two names? Here is a quick example illustrating my problem:
import pandas as pd
index=['a','b','c','d']
# The list of tuples here is really just to
# somehow visualize my problem below:
columns = [('A','B'), ('C','D'),('E','F')]
df = pd.DataFrame(index=index, columns=columns)
# I can index like that:
df[('A','B')]
# But I would like to be able to index like this:
df[('A',*)] #error
df[(*,'B')] #error
You can create a multi-index column:
df.columns = pd.MultiIndex.from_tuples(df.columns)
Then you can do:
df.loc[:, ("A", slice(None))]
Or: df.loc[:, (slice(None), "B")]
Here slice(None) is equivalent to selecting all indices at the level, so (slice(None), "B") selects columns whose second level is B regardless of the first level names. This is semantically the same as :. Or write in pandas index slice way. df.loc[:, pd.IndexSlice[:, "B"]] for the second case.

Resources