Can I use pandas.DataFrame.apply() to make more than one column at once? - python-3.x

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?

You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

Related

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

split multiple values into two columns based on single seprator

I am new to pandas.I have a situation I want to split length column into two columns a and b.Values in length column are in pair.I want to compare first pair smaller value should be in a nad larger in b.then compare next pair on same row and smaller in a,larger in b.
I have hundred rows.I think I can not use str.split because there are multiple values and same delimiter.I have no idea how to do it
The output should be same like this.
Any help will be appreciated
length a b
{22.562,"35.012","25.456",37.342,24.541,38.241} 22.562,25.45624.541 35.012,37.342,38.241
{21.562,"37.012",25.256,36.342} 31.562,25.256 37.012,36.342
{22.256,36.456,26.245,35.342,25.56,"36.25"} 22.256,26.245,25.56 36.456,35.342,36.25
I have tried
df['a'] = df['length'].str.split(',').str[0::2]
df['b'] = df['length'].str.split(',').str[1::3]
through this ode column b output is perfect but col a is printing first full pair then second.. It is not giving only 0,2,4th values
The problem comes from the fact that your length column is made of set not lists.
Here is a way to do what you want by casting your length column as list:
df['length'] = [list(x) for x in df.length] # We cast the sets as lists
df['a'] = [x[0::2] for x in df.length]
df['b'] = [x[1::2] for x in df.length]
Output:
length a \
0 [35.012, 37.342, 38.241, 22.562, 24.541, 25.456] [35.012, 38.241, 24.541]
1 [25.256, 36.342, 21.562, 37.012] [25.256, 21.562]
2 [35.342, 36.456, 36.25, 22.256, 25.56, 26.245] [35.342, 36.25, 25.56]
b
0 [37.342, 22.562, 25.456]
1 [36.342, 37.012]
2 [36.456, 22.256, 26.245]

Transforming multiple data frame columns into one series

I have a dataset df(250,3) 250 raws and three columns. I want to write a loop that merges the content of each column in my dataframe to have one single series(250,1) of 250 raws and 1 columns 'df_single'. The manual operation is the following:
df_single = df['colour']+" "+df['model']+" "+df['size']
How can I create df_single with a for loop, or non-manually?
I tried to write this code with TypeError
df_conc=[]
for var in cols:
cat_list=df_code_part[var]
df_conc = df_conc+" "+cat_list
TypeError: can only concatenate list (not "str") to list
I think if need join 3 columns then your solution is really good:
df_single = df['colour']+" "+df['model']+" "+df['size']
If need general solution for many columns use DataFrame.astype for convert to strings if necessary with DataFrame.add for add whitespace, sum for concatenate and last remove tralining whitespeces by Series.str.rstrip for remove traling whitespace:
cols = ['color','model','size']
df_single = df[cols].astype(str).add(' ').sum(axis=1).str.rstrip()
Or:
df_single = df[cols].astype(str).apply(' '.join, axis=1)
If you want to have spaces between columns, run:
df.apply(' '.join, axis=1)
"Ordinary" df.sum(axis=1) concatenates all columns, but without
spaces between them.
if you want the sum You need use:
df_single=df.astype(str).add(' ').sum(axis=1).str.rstrip()
if you don't want to add all the columns then you need to select them previously:
columns=['colour','model','size']
df_single=df[columns].astype(str).add(' ').sum(axis=1).str.rstrip()

Pandas .apply() function not always being called in python 3

Hello I wanted to increment a global variable 'count' through a function which will be called on a pandas dataframe of length 1458.
I have read other answers where they talk about .apply() not being inplace.
I therefore follow their advice but the count variable still is 4
count = 0
def cc(x):
global count
count += 1
print(count)
#Expected final value of count is 1458 but instead it is 4
# I think its 4, because 'PoolQC' is a categorical column with 4 possible values
# I want the count variable to be 1458 by the end instead it shows 4
all_data['tempo'] = all_data['PoolQC'].apply(cc)
# prints 4 instead of 1458
print("Count final value is ",count)
Yes, the observed effect is because you have categorical type of the column. This is smart of pandas that it just calculates apply for each category. Is counting only thing you're doing there? I guess not, but why you need such a calculation? Can't you use df.shape?
Couple of options I see here:
You can change type of column
e.g.
all_data['tempo'] = all_data['PoolQC'].astype(str).apply(cc)
You can use different non-categorical column
You can use df.shape to see how many rows you have in the df.
You can use apply for whole DataFrame like all_data['tempo'] = df.apply(cc, axis=1).
In such a case you still can use whatever is in all_data['PoolQC'] within cc function, like:
def cc(x):
global count
count += 1
print(count)
return x['PoolQC']

How to change stringified numbers in data frame into pure numeric values in R

I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.

Resources