pandas dataframe manipulation without using loop - python-3.x

Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output

You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()

You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].

This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.

Related

How can I manipulate a named tuple produced via itertuples specifically to remove an element and produce a dictionary from the remaining elements?

The question is best expanded upon with an example:
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(1000,4),columns=list('ABCD'))
def func(A,B,C):
return A + B + C
for index,kwargs in df.iterrows():
kwargs.pop('D')
result = func(**kwargs)
The specific goal here is to replicate the above example, but deploy itertuples instead of iterrows, for efficiency. However, when switching to itertuples, I am unsure as to how to manipulate the pandas.core.frame.Pandas objects i.e. pandas named tuples that are produced for each row in a similar fashion to achieve the same goal as the manipulation of the pandas.core.series.Series objects that the iterrows function produces.
Here is the idea:
for kwargs in df.itertuples():
kwargs.pop('D')
result = func(**kwargs)
Of course, neither line in the for loop works, because of the different objects that the new iterative method yields. How can this be rewritten to achieve the same result, either directly (which I have not been able to find equivalent approaches for yet) or indirectly, without giving up the intended efficiency gains.
Thanks.
Why not convert the namedtuple to a dict?
wanted = [c for c in df.columns if c != 'D']
for row in df.loc[:, wanted].itertuples():
result = func(**row._asdict())
You could also convert the dataframe to a list of dict
wanted = [c for c in df.columns if c != 'D']
for kwargs in df.loc[:, wanted].to_dict('records'):
result = func(**kwargs)

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

Pandas iterate over group with SeriesGroupBy objects to Burst Data in MatPlotLib

I am attempting to iterate through an index of a grouped-by dataframe (see comments in code).
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1':['A','A','B','B'], 'COL2': [1,1,2,2,], 'COL3': [2,3,4,6]})
#Here, I'm creating a copy of 'COL2' so that it's still a column after I assign it to
#the index.
#I'm guessing there's a better way to do this in the groupby line.
df['COL2_copy'] = df['COL2']
df = df.groupby(['COL2_copy'], as_index=True)
#This will actually be a more complex function (creating a chart in MatPlotLib based on the
#data frame)
#slice(group) per iteration through the index ('COL2')).
#I'll just keep it simple for now.
def pandfun(df):
#Here's the real issue:
df['COL4'] = np.trunc(np.round(df['COL3']*100,0))
return df['COL4']
pandfun(df)
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'int'
The desired results are: 200 and 300 for the first group and 400 and 600 for the second group. So to summarize what I believe to be the main problem here is that I want to select individual groups of rows by index (i.e. 'COL2 == 1') and within each group, refer to individual rows for a calculation.
I am taking this approach because I'll actually be using this with a MatPlotLib function that I created and I want to "burst" the data into one chart for each group in the dataframe, where each chart refers to individual row data for a given group.
I did this instead:
1. Get a unique list of values from COL2 and create a copy of df:
ulist = pd.unique(df['COL2'].ravel())
df = df2
Iterate over that list where it matched in COL2:
for i in ulist:
df = df2.loc[df2['COL2']==i]
Within each iteration, apply the function.
Within the MatPlotLib function, enter the following code at the top:
df.reset_index()
This served to reset the selection of rows after each iteration.

Resources