I am attempting to iterate through an index of a grouped-by dataframe (see comments in code).
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1':['A','A','B','B'], 'COL2': [1,1,2,2,], 'COL3': [2,3,4,6]})
#Here, I'm creating a copy of 'COL2' so that it's still a column after I assign it to
#the index.
#I'm guessing there's a better way to do this in the groupby line.
df['COL2_copy'] = df['COL2']
df = df.groupby(['COL2_copy'], as_index=True)
#This will actually be a more complex function (creating a chart in MatPlotLib based on the
#data frame)
#slice(group) per iteration through the index ('COL2')).
#I'll just keep it simple for now.
def pandfun(df):
#Here's the real issue:
df['COL4'] = np.trunc(np.round(df['COL3']*100,0))
return df['COL4']
pandfun(df)
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'int'
The desired results are: 200 and 300 for the first group and 400 and 600 for the second group. So to summarize what I believe to be the main problem here is that I want to select individual groups of rows by index (i.e. 'COL2 == 1') and within each group, refer to individual rows for a calculation.
I am taking this approach because I'll actually be using this with a MatPlotLib function that I created and I want to "burst" the data into one chart for each group in the dataframe, where each chart refers to individual row data for a given group.
I did this instead:
1. Get a unique list of values from COL2 and create a copy of df:
ulist = pd.unique(df['COL2'].ravel())
df = df2
Iterate over that list where it matched in COL2:
for i in ulist:
df = df2.loc[df2['COL2']==i]
Within each iteration, apply the function.
Within the MatPlotLib function, enter the following code at the top:
df.reset_index()
This served to reset the selection of rows after each iteration.
Related
I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).
Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.
I have the following pandas dataframe:
import pandas as pd
data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],'Time':[1,0,0,1,0,0,1,0]}
df = pd.DataFrame(data)
print(df)
I was wondering how to extract all the "Sentences" according to the "Time" column. I want to gather all the "sentences" from the first "1" to the last "0".
Maybe the expected output explains it better:
[[Sentences1,Sentences2,Sentences3],[Sentences4,Sentences5,Sentences6],[Sentences7,Sentences8]]
Is this somehow possible ? Sorry, I am very new to pandas.
Try this:
s = df['Time'].cumsum()
df.set_index([s, df.groupby(s).cumcount()])['Sentences'].unstack().to_numpy().tolist()
Output:
[['Sentence1', 'Sentence2', 'Sentence3'],
['Sentences4', 'Sentences5', 'Sentences6'],
['Sentences7', 'Sentences8', nan]]
Details:
Use cumsum to group by Time = 1 with following Time = 0.
Next, use groupby with cumcount to increment within each group
Lastly, use set_index and unstack to reshape dataframe.
I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan
New here but I've been searching for hours now and can't seem to find the solution for this. What I'm trying to do is display an aggregate of a dataframe in a Bokeh chart. I tried using a groupby object but I get an error when passing the groupby object to the ColumnDataSource (as mentioned in the post below).
how use bokeh vbar chart parameter with groupby object?
Here's some sample code I'm using:
import pandas
from bokeh.models import ColumnDataSource
df = pandas.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
group = df.groupby("A")
source = ColumnDataSource(group)
Getting this error:
ValueError: expected a dict or pandas.DataFrame, got <pandas.core.groupby.DataFrameGroupBy object at 0x103f7bfd0>
Any ideas as to plot the groupby object in a chart with Bokeh?
Thanks in advance!
I haven't used Bokeh, however from what I see you are passing a pandas.core.groupby.DataFrameGroupBy and ColumnDataSource is expecting a pd.DataFrame. That said the problem is that when using groupby you create a data structure that resembles key, value storage. So each group in the groups object will have a key and value, that value is the DataFrame that your are looking for. Running your code as shown below will help you understand the resulting data structure from applying groupby() to a DataFrame:
groups = df.group('A')
for group in groups:
# get group key
key = group[0]
# Get group df
group_df = group[1]
Notice that I replaced group = df.group('A') with groups = df.group('A')