Pandas: Use DataFrameGroupBy.filter() method to select DataFrame's rows with a value greater than the mean of the respective group - python-3.x

I am learning Python and Pandas and I am doing some exercises to understand how things work.
My question is the following: can I use the GroupBy.filter() method to select the DataFrame's rows that have a value (in a specific column) greater than the mean of the respective group?
For this exercise, I am using the "planets" dataset included in Seaborn: 1035 rows x 6 columns (column names: "method", "number", "orbital_period", "mass", "distance", "year").
In python:
import pandas as pd
import seaborn as sns
#Load the "planets" dataset included in Seaborn
data = sns.load_dataset("planets")
#Remove rows with NaN in "orbital_period"
data = data.dropna(how = "all", subset = ["orbital_period"])
#Set display of DataFrames for seeing all the columns:
pd.set_option("display.max_columns", 15)
#Group the DataFrame "data" by "method" ()
group1 = data.groupby("method")
#I obtain a DataFrameGroupBy object (group1) composed of 10 groups.
print(group1)
#Print the composition of the DataFrameGroupBy object "group1".
for lab, datafrm in group1:
print(lab, "\n", datafrm, sep="", end="\n\n")
print()
print()
print()
#Define the filter_function that will be used by the filter method.
#I want a function that returns True whenever the "orbital_period" value for
#a row is greater than the mean of the corresponding group's mean.
#This could have been done also directly with "lambda syntax" as argument
#of filter().
def filter_funct(x):
#print(type(x))
#print(x)
return x["orbital_period"] > x["orbital_period"].mean()
dataFiltered = group1.filter(filter_funct)
print("RESULT OF THE FILTER METHOD:")
print()
print(dataFiltered)
print()
print()
Unluckily, I obtain the following error when I run the script.
TypeError: filter function returned a Series, but expected a scalar bool
It looks like x["orbital_period"] does not behave as a vector, meaning that it does not return the single values of the Series...
Weirdly enough the transform() method does not suffer from this problem. Indeed on the same dataset (prepared as above) if I run the following:
#Define the transform_function that will be used by the transform() method.
#I want this function to subtract from each value in "orbital_period" the mean
#of the corresponding group.
def transf_funct(x):
#print(type(x))
#print(x)
return x-x.mean()
print("Transform method runs:")
print()
#I directly assign the transformed values to the "orbital_period" column of the DataFrame.
data["orbital_period"] = group1["orbital_period"].transform(transf_funct)
print("RESULT OF THE TRANSFORM METHOD:")
print()
print(data)
print()
print()
print()
I obtain the expected result...
Do DataFrameGroupBy.filter() and DataFrameGroupBy.transform() have different behavior?
I know I can achieve what I want in many other ways but my question is:
Is there a way to achieve what I want making use of the DataFrameGroupBy.filter() method?

Can I use DataFrameGroupBy.filter to exclude specific rows within a group?
The answer is No. DataFrameGroupBy.filter uses a single Boolean value to characterize an entire group. The result of the filtering is to remove the entirety of a group if it is characterized as False.
DataFrameGroupBy.filter is very slow, so it's often advised to use transform to broadcast the single truth value to all rows within a group and then to subset the DataFrame1. Here is an example of removing entire groups where the mean is <= 50. The filter method is 100x slower.
import pandas as pd
import numpy as np
N = 10000
df = pd.DataFrame({'grp': np.arange(0,N,1)//10,
'value': np.arange(0,N,1)%100})
# With Filter
%timeit df.groupby('grp').filter(lambda x: x['value'].mean() > 50)
#327 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With Transform
%timeit df[df.groupby('grp')['value'].transform('mean') > 50]
#2.7 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Verify they are equivalent
(df.groupby('grp').filter(lambda x: x['value'].mean() > 50)
== df[df.groupby('grp')['value'].transform('mean') > 50]).all().all()
#True
1The gain in performance comes form the fact that transform may allow you to use a GroupBy operation which is implemented in cython, which is the case for mean. If this is not the case filter may be just as performant, if not slightly better.
Finally, because DataFrameGroupBy.transform broadcasts a result to the entire group, it is the correct tool to use when needing to exclude specific rows within a group based on an overall group characteristic.
In the above example, if you want to keep rows within a group that are above the group mean it is
df[df['value'] > df.groupby('grp')['value'].transform('mean')]
# Compare to the mean of the group the row
# each row belongs to

Related

Trying to use pandas to group all data in Column B [duplicate]

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

what is the key to improve speed when we need action operation in bigdata processing by Spark?

I am recently using Spark 1.5.1 to process hadoop data. However, my experience of Spark is not so good for the slowness of processing action operation(e.g., .count(),.collect()). My task can be described as following:
I have a dataframe like this:
----------------------------
trans item_code item_qty
----------------------------
001 A 2
001 B 3
002 A 4
002 B 6
002 C 10
003 D 1
----------------------------
I need to find association rules of two items, e.g. one of A will result in one and a half of B with confidence of 0.8. The desired result dataframe is like this:
----------------------------
item1 item2 conf coef
----------------------------
A B 0.8 1.5
B A 1.0 0.67
A C 0.7 2.5
----------------------------
My method is using FP-growth to generate frequent item sets first and then filter item sets of one item and item sets with two items. After that I can calculate the confidence of one item resulting in another. For example, having (itemset=[A], support=0.4),(itemset=[B], support=0.2),(itemset=[A,B], support=0.2), I can generate association rules:(rule=(A->B),confidence=0.5),
(rule=(B->A),confidence=1.0).
However, when I broadcast the one-item frequent item sets as dictionary, the action of .collectAsMap is really very slow. I tried to use .join and it is even slower. I even need to wait hours to see rdd.count(). I know we should avoid any use of action operation in Spark, but sometimes it is unavoidable. So I am curious what is the key to improve speed when we face action operations.
My code is here:
#!/usr/bin/python
from pyspark import SparkContext,HiveContext
from pyspark.mllib.fpm import FPGrowth
import time
#read raw data from database
def read_data():
sql="""select t.orderno_nosplit,
t.prod_code,
t.item_code,
sum(t.item_qty)
as item_qty
from ioc_fdm.fdm_dwr_ioc_fcs_pk_spu_item_f_chain t
group by t.prod_code, t.orderno_nosplit,t.item_code """
data=sql_context.sql(sql)
return data.cache()
#calculate quantity coefficient of two items
def qty_coef(item1,item2):
sql =""" select t1.item, t1.qty from table t1
where t1.trans in
(select t2.trans from spu_table t2 where t2.item ='%s'
and
(select t3.trans from spu_table t3 where t3.item = '%s' """ % (item1,item2)
df=sql_context.sql(sql)
qty_item1=df.filter(df.item_code==item1).agg({"item_qty":"sum"}).first()[0]
qty_item2=df.filter(df.item_code==item2).agg({"item_qty":"sum"}).first()[0]
coef=float(qty_item2)/qty_item1
return coef
def train(prod):
spu=total_spu.filter(total_spu.prod_code == prod)
print 'data length',spu.count(),time.strftime("%H:%M:%S")
supp=0.1
conf=0.7
sql_context.registerDataFrameAsTable(spu,'spu_table')
sql_context.cacheTable('spu_table')
print 'table register over', time.strftime("%H:%M:%S")
trans_sets=spu.rdd.repartition(32).map(lambda x:(x[0],x[2])).groupByKey().mapvalues(list).values().cache()
print 'trans group over',time.strftime("%H:%M:%S")
model=FPGrowth.train(trans_sets,supp,10)
print 'model train over',time.strftime("%H:%M:%S")
model_f1=model.freqItemsets().filter(lambda x: len(x[0]==1))
model_f2=model.freqItemsets().filter(lambda x: len(x[0]==2))
#register model_f1 as dictionary
model_f1_tuple=model_f1.map(lambda (U,V):(tuple(U)[0],V))
model_f1Map=model_f1_tuple.collectAsMap()
#convert model_f1Map to broadcast
bc_model=sc.broadcast(model_f1Map)
#generate association rules
model_f2_conf=model_f2.map(lambda x:(x[0][0],x[0][1],float(x[1])/bc_model.value[x[0][0]],float(x[1]/bc_model.value[x[0][1]])))
print 'conf calculation over',time.strftime("%H:%M:%S")
model_f2_conf_flt=model_f2_conf.flatMap(lambda x: (x[0],x[1]))
#filter the association rules by confidence threshold
model_f2_conf_flt_ftr=model_f2_conf_flt.filter(lambda x: x[2]>=conf)
#calculate the quantity coefficient for the filtered association rules
#since we cannot use nested sql operations in rdd, I have to collect the rules to list first
asso_list=model_f2_conf_flt_ftr.map(lambda x: list(x)).collect()
print 'coef calculation over',time.strftime("%H:%M:%S")
for row in asso_list:
row.append(qty_coef(row[0],row[1]))
#rewrite the list to dataframe
asso_df=sql_context.createDataFrame(asso_list,['item1','item2','conf','coef'])
sql_context.clearCache()
path = "hdfs:/user/hive/wilber/%s"%(prod)
asso_df.write.mode('overwrite').parquet(path)
if __name__ == '__main__':
sc = SparkContext()
sql_context=HiveContext(sc)
prod_list=sc.textFile('hdfs:/user/hive/wilber/prod_list').collect()
total_spu=read_data()
print 'spu read over',time.strftime("%H:%M:%S")
for prod in list(prod_list):
print 'prod',prod
train(prod)

Pandas iterate over group with SeriesGroupBy objects to Burst Data in MatPlotLib

I am attempting to iterate through an index of a grouped-by dataframe (see comments in code).
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1':['A','A','B','B'], 'COL2': [1,1,2,2,], 'COL3': [2,3,4,6]})
#Here, I'm creating a copy of 'COL2' so that it's still a column after I assign it to
#the index.
#I'm guessing there's a better way to do this in the groupby line.
df['COL2_copy'] = df['COL2']
df = df.groupby(['COL2_copy'], as_index=True)
#This will actually be a more complex function (creating a chart in MatPlotLib based on the
#data frame)
#slice(group) per iteration through the index ('COL2')).
#I'll just keep it simple for now.
def pandfun(df):
#Here's the real issue:
df['COL4'] = np.trunc(np.round(df['COL3']*100,0))
return df['COL4']
pandfun(df)
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'int'
The desired results are: 200 and 300 for the first group and 400 and 600 for the second group. So to summarize what I believe to be the main problem here is that I want to select individual groups of rows by index (i.e. 'COL2 == 1') and within each group, refer to individual rows for a calculation.
I am taking this approach because I'll actually be using this with a MatPlotLib function that I created and I want to "burst" the data into one chart for each group in the dataframe, where each chart refers to individual row data for a given group.
I did this instead:
1. Get a unique list of values from COL2 and create a copy of df:
ulist = pd.unique(df['COL2'].ravel())
df = df2
Iterate over that list where it matched in COL2:
for i in ulist:
df = df2.loc[df2['COL2']==i]
Within each iteration, apply the function.
Within the MatPlotLib function, enter the following code at the top:
df.reset_index()
This served to reset the selection of rows after each iteration.

Resources