Pandas .apply() function not always being called in python 3 - python-3.x

Hello I wanted to increment a global variable 'count' through a function which will be called on a pandas dataframe of length 1458.
I have read other answers where they talk about .apply() not being inplace.
I therefore follow their advice but the count variable still is 4
count = 0
def cc(x):
global count
count += 1
print(count)
#Expected final value of count is 1458 but instead it is 4
# I think its 4, because 'PoolQC' is a categorical column with 4 possible values
# I want the count variable to be 1458 by the end instead it shows 4
all_data['tempo'] = all_data['PoolQC'].apply(cc)
# prints 4 instead of 1458
print("Count final value is ",count)

Yes, the observed effect is because you have categorical type of the column. This is smart of pandas that it just calculates apply for each category. Is counting only thing you're doing there? I guess not, but why you need such a calculation? Can't you use df.shape?
Couple of options I see here:
You can change type of column
e.g.
all_data['tempo'] = all_data['PoolQC'].astype(str).apply(cc)
You can use different non-categorical column
You can use df.shape to see how many rows you have in the df.
You can use apply for whole DataFrame like all_data['tempo'] = df.apply(cc, axis=1).
In such a case you still can use whatever is in all_data['PoolQC'] within cc function, like:
def cc(x):
global count
count += 1
print(count)
return x['PoolQC']

Related

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

Can I use pandas.DataFrame.apply() to make more than one column at once?

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?
You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

How can I replace a particular column in a data frame based on a condition (categorical variables)?

I need to replace the salary status to 1 or 0 respectively if the salary is greater than 50,000 or less than or equal to 50,000 in a df.
The DataFrame shape:30162*13
I have tried this:
data2['SalStat']=data2['SalStat'].map({"less than or equal to 50,000":0,"greater than 50,000":1})
I also tried data2['SalStat']
and loc without any success.
How can I do the same?
I think your solution is nice.
If want match only by substring, e.g. by greater use Series.str.contains for boolean mask with converting to 0,1:
data2['SalStat']=data2['SalStat'].str.contains('greater').astype(int)
Or:
data2['SalStat']=data2['SalStat'].str.contains('greater').view('i1')
Try this
def status(d): return 0 if d == 'less than or equal to 50,000' else 1
data2['SalStat'] = list(map(status ,data2['SalStat']))

Counting in how many rows a value exists

I would like to count the frequency of words in a data frame. Here is an example of what i'm trying to achieve.
words = ['Dungeon',
'Crawling',
'Puzzle',
'RPG',]
desc =
0 [Dungeon, count, game, kid, draw, toddler, Unique]
1 [Beautiful, simple, music, application, toddle]
2 [Fun, intuitive, number, game, baby, toddler]
Note that desc is a 1690 rows pandas data frame.
Now I would like to check words[i] in desc
I do not want to have nested for loop, so made a function to just check if the word is in the desc and then use apply() to each row and then use sum.
The function I got is:
def tmp(word, desc):
return (word in desc)
However, when I use the following code: desc.apply(tmp, args = words[0]) I get the error that states: tmp() takes 2 positional arguments but 8 were given. However, when I manually use it with values tmp(words[0], desc[0]) it works just fine....
If want avoid loops, use DataFrame constructor with DataFrame.isin and for count True values use sum:
s = pd.DataFrame(desc.tolist()).isin(words).sum(axis=1)
print(s)
0 1
1 0
2 0
dtype: int64

Replace values in dataframe given count of group by

I have a dataframe of categorical variables. I want to replace all the fields in one column with an arbitrary unique string if the count of that category within the column is less than 100.
So, for example, in column color, if any color appears less than 100 times i want it to be replaced by the string "base"
I tried the below code and tried different things I found on stack overflow.
df['color'] = numpy.where(df.groupby("color").filter(lambda x: len(x) < 100), 'dummy', df['color'])
Operands could not be broadcast together with shapes (45638872,878) () (8765878782788,)
IIUC, you need this,
df.loc[df.groupby('color')['color'].transform('count')<100, 'color']= 'dummy'

Resources