Selecting specific rows with multi-level indexes - python-3.x

I am learning Python in DataCamp where I am in a course titled Data Manipulation with pandas
In an exercise I want to select some rows from a data frame but I get an error and I really don't understand the message of the error.
Here is the code
# Load packages
import pandas as pd
# Import data
sales = pd.read_pickle("https://assets.datacamp.com/production/repositories/5386/datasets/b0e855c644024f850a7df3fe8b7bf0e23f7bd2fc/walmart_sales.pkl.bz2", 'bz2')
print(sales)
# Create multi-level indexes
sales_ind = sales.set_index(["store", "type"])
print(sales_ind)
# Define the rows to select
rows_to_keep = [(1, "A"), (2, "B")]
print(rows_to_keep)
# Use .iloc to select the rows
print(sales_ind.loc[rows_to_keep])
The problem is when I run print(sales_ind.loc[rows_to_keep]) where I get this error message KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike' but I don't understand how to fix the code.
I am using Python 3.7.6 and pandas 1.0.0

I figured out your problem and it is not a problem with df.loc, but instead with the data you provided. There is no value in your multiindex that matches the tuple (2,'B').
You can see this if you try:
sales_ind.loc[(2,'B'),:]
which returns:
KeyError: (2, 'B')
while (1,'A') works just fine.
By the way, when selecting on a multiindex, I always recommend using pd.IndexSlice

Related

pandas dataframe manipulation without using loop

Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

Indexing Error from loc command with pandas in python. No index in csv file

I am trying to print particular columns from a data set that has nine columns. There is no index for the rows in the csv file. I am trying to use pandas as pd and a loc command. It comes up with an indexing error. At the bottom it says too many indexers
I am new to pandas and am playing with it for the first time. So I am not even sure what to try to fix this.
import pandas as pd
dfa=pd.read_csv('cars.csv', sep=';')
print(dfa)
print(dfa.columns)
print(dfa.loc['Car', 'Horsepower', 'MPG', 'Cylinders', 'Weight'])
I am running this in Jupyter Notebook. As expected, it first prints out a table with just over 400 cars. Then it prints out the column names (without any indexes as far as I can tell). Then it has an error running the loc command. First it says indexing error. Then it says
we by definition only have the 0th axis...
no multi-index, so validate all of the indexers
ugly hack for GH #836
IndexingError: Too many indexers
Use:
print(dfa.loc[:,['Car', 'Horsepower', 'MPG', 'Cylinders', 'Weight']])
Reason:
to df.loc[] you have to pass a list with the arguments of columns if you want select more than one:
['Car', 'Horsepower', 'MPG', 'Cylinders', 'Weight']
If you want get all rows you can use: :
But you always have to specify which rows to select one way or another.
To more info consult:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
Also you can use:
dfa[['Car', 'Horsepower', 'MPG', 'Cylinders', 'Weight']]

Pandas - Behaviour/Issue of reindex vs loc for setting values

I've been trying to use reindex instead of loc in pandas as from 0.24 there is a warning about reindexing with lists.
The issue I have is that I use loc to change the values of my dataframes.
Now if use reindex i lose this and if I try to be smart I even get a bug.
Contemplate the following case:
df = pd.DataFrame(data=pd.np.zeros(4, 2), columns=['a', 'b'])
ds = pd.Series(data=[1]*3)
I want to change a subset of values (while retaining the others), so df keeps the same shape.
So this is the original behaviour which works (and changes the values in a subset of df['a'] to 1)
df.loc[range(3), 'a'] = ds
But when I'm using the reindex I fail to change anything:
df.reindex(range(3)).loc['a'] = ds
Now when I try something like this:
df.loc[:, 'a'].reindex(range(3)) = ds
I get a SyntaxError: can't assign to function call error message.
For reference I am using pandas 0.24 and python 3.6.8
The quick answer from #coldspeed was the easiest, though the behaviour of the warning is misleading.
So reindex returns a copy when loc doesn't.
From the pandas docs:
A new object is produced unless the new index is equivalent to the current one and copy=False.
So saying reindex is an alternative to loc as per the warning is actually misleading.
Hope this helps people who face the same situation.

Error when using pandas read_excel(header=[0,1])

I'm trying to use pandas read_excel to work with a file. The file has two columns of headers so I'm trying to use the multiIndex feature apart of the header keyword argument.
import pandas as pd, os
"""data in 2015 MOR Folder"""
filename = 'MOR-JANUARY 2015.xlsx'
print(os.path.isfile(filename))
df1 = pd.read_excel(filename, header=[0,1], sheetname='MOR')
print(df1)
the error I get is ValueError: Length of new names must be 1, got 2. The file is in this google drive folder https://drive.google.com/drive/folders/0B0ynKIVAlSgidFFySWJoeFByMDQ?usp=sharing
I'm trying to follow the solution posted here
Read excel sheet with multiple header using Pandas
I could be mistaken but I don't think pandas handles parsing excel rows where there are merged cells. So in that first row, the merged cells get parsed as mostly empty cells. You'd need them nicely repeated to act correctly. This is what motivates the ffill below. If you could control the Excel workbook ahead of time and you might be able to use the code you have.
my solution
It's not pretty, but it'll get it done.
filename = 'MOR-JANUARY 2015.xlsx'
df1 = pd.read_excel(filename, sheetname='MOR', header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'DATE'])
df1 = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

Resources