Unexpected output format in pandas.groupby.apply - python-3.x

does someone know why pandas behave differently when column which we use as BY in GROUPBY contains only 1 unique value? Specifically, if there is just 1 value and we return pandas.Series, returned output is basically transposed in comparison to multiple unique values:
dt = pd.date_range('2021-01-01', '2021-01-02 23:00', closed=None, freq='1H')
df = pd.DataFrame({'date':dt.date, 'vals': range(dt.shape[0])}, index=dt)
dt1 = pd.date_range('2021-01-01', '2021-01-01 23:00', closed=None, freq='1H')
df2 = pd.DataFrame({'date':dt1.date, 'vals': range(dt1.shape[0])}, index=dt1)
def f(row, ):
return row['vals']
print(df.groupby('date').apply(f).shape)
print(df2.groupby('date').apply(f).shape)
[out 1] (48,)
[out 2] (1, 24)
Is there some simple parameter I can use to make sure the behavior is consistent? Would it make sense to maybe sumbit it as bug-report due to inconsistency, or is it "expected" (I undestood from previous question that sometimes poor design or small part is not a bug)? (I still love pandas, just these small things can make their usage very painful)

squeeze()
DataFrame.squeeze() and Series.squeeze() can make the shapes consistent:
>>> df.groupby('date').apply(f).squeeze().shape
(48,)
>>> df2.groupby('date').apply(f).squeeze().shape
(24,)
squeeze=True (deprecated)
groupby() has a squeeze param:
squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
>>> df.groupby('date', squeeze=True).apply(f).shape
(48,)
>>> df2.groupby('date', squeeze=True).apply(f).shape
(24,)
This has been deprecated since pandas 1.1.0 and will be removed in the future.

Related

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Sum all counts when their fuzz.WRatio > 90 otherwise leave intact

What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})

Yet another Pandas SettingWithCopyWarning question

Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D['count'].iloc[x] = len(df_C) # triggers warning
I've tried:
Copying df_A and df_B in every possible place
Using a mask
Using a query
I know I can suppress the warning, but I don't want to do that.
What am I missing? I know it's probably something obvious.
Many thanks!
For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.
To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:
pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning
Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.
#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})
#your result
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
print (len(df_C))
3
2
1
#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
& np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
.sum(0) )
array([3, 2, 1])
so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you

pandas.Dataframe() mixed data types and strange .fillna() behaviour

I have a dataframe which has two dtypes: Object (was expecting string) and Datetime (expected datetime). I don't understand this behavior and why it affects my fillna().
Calling .fillna() with inplace=True wipes the data denoted as int64 despite being changed with .astype(str)
Calling .fillna() without it does nothing.
I know pandas / numpy dtypes are different to the python native, but is it correct behavior or am I getting something terribly wrong?
sample:
import random
import numpy
sample = pd.DataFrame({'A': [random.choice(['aabb',np.nan,'bbcc','ccdd']) for x in range(15)],
'B': [random.choice(['2019-11-30','2020-06-30','2018-12-31','2019-03-31']) for x in range(15)]})
sample.loc[:, 'B'] = pd.to_datetime(sample['B'])
for col in sample.select_dtypes(include='object').columns.tolist():
sample.loc[:, col].astype(str).apply(lambda x: str(x).strip().lower()).fillna('NULL')
for col in sample.columns:
print(sample[col].value_counts().head(15))
print('\n')
Here neither 'NULL' nor 'nan' appear. Added .replace('nan','NULL'), but still nothing. Can you give me a clue what to look for, please? Many thanks.
Problem here is converting missing values to strings, so fillna cannot working. solution is use pandas function Series.str.strip and Series.str.lower working with missing values very nice:
for col in sample.select_dtypes(include='object').columns:
sample[col] = sample[col].str.strip().str.lower().fillna('NULL')

how to make pandas read_csv handle numpy str (or unicode) scalar datatypes

Whenever I read a CSV file that has a column of strings, I've found that by default pandas gives it's dtype as object. I've tried to use mydf['mycol'].astype(str) to change the dtype of a column mycol from object to str, but it didn't work - it didn't give me an error, but at the same time, the dtype remained the same.
I read that pandas has been built on top of numpy, and numpy allows for both str_ and unicode_ see here: numpy scalar types. I'm NOT very familiar with the internal workings of pandas and NOT familiar with numpy in general.
Is there anything I can do when using pandas.io.parsers.read_csv to make sure that a column of strings in the CSV file is read as a dtype of str rather than object?
More specifically, what parameters (from those given below) do I need to use to achieve this?
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None,
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0,
skipinitialspace=False, lineterminator=None, header='infer', index_col=None,
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0,
na_values=None, na_fvalues=None, true_values=None, false_values=None,
delimiter=None, converters=None, dtype=None, usecols=None, engine=None,
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False,
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True,
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None,
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False,
date_parser=None, memory_map=False, float_precision=None, nrows=None,
iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False,
mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False,
skip_blank_lines=True)
Somewhat related: is there any variable / flag in the parameters of pandas.io.parsers.read_csv that can automatically read a missing string from a column of string as '' (empty string) rather than read a missing string as nan?
Also, many of the parameters that can be passed to pandas.io.parsers.read_csv are NOT described in the documentation : pandas.io.parsers.read_csv.html for example, na_fvalues, use_unsigned, compact_ints, etc. Aside from reading the code (which is a bit long), would there be any ohter place where a more detailed documentation for all the parameters is available?
This was a technical decision taken by Wes not to use numpy's string datatype: Numpy allocates all strings as the same size.
In most real world use cases strings are not fixed size and often a few are very long. It's wasteful to allocate a very large contiguous block of memory (and IIRC,
counterintuitively, can be slower!) to store them as if they are fixed size:
In [11]: np.array(["ab", "a"])  # The 2 is the length
Out[11]:
array(['ab', 'a'],
dtype='|S2')
In [12]: np.array(['this is a very long string', 'a', 'b', 'c'])
Out[12]:
array(['this is a very long string', 'a', 'b', 'c'],
dtype='|S26')
To give a silly example, we can see an example where object dtype takes up less memory:
In [21]: a = np.array(['a'] * 99 + ['this is a very long string, really really really really really long, oh yes'])
In [22]: a.nbytes
Out[22]: 7500
In [23]: b = a.astype(object)
In [24]: b.nbytes + sum(sys.getsizeof(item) for item in b)
Out[24]: 4674
There's also some "surprising" behaviour of numpy strings (also due to their layout):
In [31]: a = np.array(['ab', 'c'])
In [32]: a[1] = 'def'
In [33]: a # what the f?
Out[33]:
array(['ab', 'de'],
dtype='|S2')
If you wanted to fix this behaviour - and keep the numpy string dtype - you would have to make a copy for every assignment. (With object arrays you get this for free: you simply change the pointer.)
Hence in pandas strings are stored using the object dtype.
Note: I thought there was a section of the docs which discussed this decision but I can't seem to locate it...

Resources