Using loops to call multiple pandas dataframe columns - python-3.x

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan

Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)

chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

Related

Trying to use pandas to group all data in Column B [duplicate]

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

How can I manipulate a named tuple produced via itertuples specifically to remove an element and produce a dictionary from the remaining elements?

The question is best expanded upon with an example:
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(1000,4),columns=list('ABCD'))
def func(A,B,C):
return A + B + C
for index,kwargs in df.iterrows():
kwargs.pop('D')
result = func(**kwargs)
The specific goal here is to replicate the above example, but deploy itertuples instead of iterrows, for efficiency. However, when switching to itertuples, I am unsure as to how to manipulate the pandas.core.frame.Pandas objects i.e. pandas named tuples that are produced for each row in a similar fashion to achieve the same goal as the manipulation of the pandas.core.series.Series objects that the iterrows function produces.
Here is the idea:
for kwargs in df.itertuples():
kwargs.pop('D')
result = func(**kwargs)
Of course, neither line in the for loop works, because of the different objects that the new iterative method yields. How can this be rewritten to achieve the same result, either directly (which I have not been able to find equivalent approaches for yet) or indirectly, without giving up the intended efficiency gains.
Thanks.
Why not convert the namedtuple to a dict?
wanted = [c for c in df.columns if c != 'D']
for row in df.loc[:, wanted].itertuples():
result = func(**row._asdict())
You could also convert the dataframe to a list of dict
wanted = [c for c in df.columns if c != 'D']
for kwargs in df.loc[:, wanted].to_dict('records'):
result = func(**kwargs)

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

pandas dataframe manipulation without using loop

Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.

Selecting specific rows with multi-level indexes

I am learning Python in DataCamp where I am in a course titled Data Manipulation with pandas
In an exercise I want to select some rows from a data frame but I get an error and I really don't understand the message of the error.
Here is the code
# Load packages
import pandas as pd
# Import data
sales = pd.read_pickle("https://assets.datacamp.com/production/repositories/5386/datasets/b0e855c644024f850a7df3fe8b7bf0e23f7bd2fc/walmart_sales.pkl.bz2", 'bz2')
print(sales)
# Create multi-level indexes
sales_ind = sales.set_index(["store", "type"])
print(sales_ind)
# Define the rows to select
rows_to_keep = [(1, "A"), (2, "B")]
print(rows_to_keep)
# Use .iloc to select the rows
print(sales_ind.loc[rows_to_keep])
The problem is when I run print(sales_ind.loc[rows_to_keep]) where I get this error message KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike' but I don't understand how to fix the code.
I am using Python 3.7.6 and pandas 1.0.0
I figured out your problem and it is not a problem with df.loc, but instead with the data you provided. There is no value in your multiindex that matches the tuple (2,'B').
You can see this if you try:
sales_ind.loc[(2,'B'),:]
which returns:
KeyError: (2, 'B')
while (1,'A') works just fine.
By the way, when selecting on a multiindex, I always recommend using pd.IndexSlice

Resources