For each row, add column name to list in new column if row value matches a condition - python-3.x

I have a series of columns, each containing either Y or N.
I would like to create a new column that contains a list of columns (for that particular row) that contain Y.
Old DataFrame
>>> df
col1 col2 col3 col4 col5
a Y N N N Y
b Y N Y Y Y
c N N Y N N
New Dataframe
>>> df_new
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
So far I can get it working for a single column with:
df["col6"] = ["col1" if val == "Y" else "" for val in df["col1"]]
But ideally I want to do the same for all columns, so I somehow end up with the result above. I could imagine doing some kind of loop, but then how I go about appending the result to the list value in col6 I'm unsure on. Can someone steer me in the right direction please?

Compare values by Y first, then use DataFrame.dot with Series.str.split:
df["col6"] = df.eq('Y').dot(df.columns + ',').str[:-1].str.split(',')
print (df)
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
Or if need better performance use list comprehension with numpy arrays:
cols = df.columns.to_numpy()
df["col6"] = [cols[x].tolist() for x in df.eq('Y').to_numpy()]

Related

How to find the intersection of a pair of columns in pandas dataframe with pairs in any order?

I have below dataframe
col1 col2
a b
b a
c d
d c
e d
Desired Output should be unique pair from two columns
col1 col2
a b
c d
e d
Convert values to frozenset and then filter by DataFrame.duplicated in boolean indexing:
df = df[~df[['col1','col2']].apply(frozenset, axis=1).duplicated()]
print (df)
col1 col2
0 a b
2 c d
4 e d
Or you can sorting values by np.sort and remove duplicates by DataFrame.drop_duplicates:
df = pd.DataFrame(np.sort(df[['col1','col2']]), columns=['col1','col2']).drop_duplicates()
print (df)
col1 col2
0 a b
2 c d
4 d e

How to append a column to a dataframe with values from a specified column index [duplicate]

This question already has an answer here:
Vectorized look-up of values in Pandas dataframe
(1 answer)
Closed 2 years ago.
I have a dataframe in which one of the columns is used to designate which of the other columns has the specific value I'm looking for.
df = pd.DataFrame({'COL1':['X','Y','Z'],'COL2':['A','B','C'],'X_SORT':['COL1','COL2','COL1']})
I'm trying to add a new column called 'X_SORT_VALUE' and assigning the value of the column identified in the X_SORT column.
df = df.assign(X_SORT_VALUE=lambda x: (x['X_SORT']))
But I'm getting the value of the X_SORT column:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 COL1
1 Y B COL2 COL2
2 Z C COL1 COL1
Rather than getting the value of that column index, like I want:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z
You need df.lookup here:
df['X_SORT_VALUE'] = df.lookup(df.index,df['X_SORT'])
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z

select the first n largest groups from grouped data frames

Data frame(df) structure
col1 col2
x 3131
y 9647
y 9648
z 9217
y 9652
x 23
grouping:
grouped = df.groupby(col1)
I want to select first 2 largest groups i.e.,
y 9647
y 9648
y 9652
and
x 3131
x 23
How can I do that using pandas. I've achieved it using list but that makes it clumsy again as it becomes a list of tuples and I've to convert them back to data frame types
Use value_counts with indexing index and filter rows by isin in boolean indexing:
df1 = df[df['col1'].isin(df['col1'].value_counts().index[:2])]
print (df1)
col1 col2
0 x 3131
1 y 9647
2 y 9648
4 y 9652
5 x 23
If need DataFrames by top groups use dictionary comprehension with enumerate:
dfs = {i: df[df['col1'].eq(x)] for i, x in enumerate(df['col1'].value_counts().index[:2], 1)}
print (dfs)
{1: col1 col2
1 y 9647
2 y 9648
4 y 9652, 2: col1 col2
0 x 3131
5 x 23}
print (dfs[1])
col1 col2
1 y 9647
2 y 9648
4 y 9652
print (dfs[2])
col1 col2
0 x 3131
5 x 23

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Adding a space between two characters in a data frame in R

I have a data frame where each cell are 2 character strings (ie: "AA" , "BC" , "CD") where I am trying to put spaces between each of the two characters, and where NA values remain as is. I can't seem to figure this out. Any help????
Here is an example data frame:
df <- data.frame(col1=c("AB", "CD", "EF"), col2=c("AA", "BB", "CC"), col3=c("XX", "YY", NA))
And this is what the example data frame looks like:
col1 col2 col3
1 AB AA XX
2 CD BB YY
3 EF CC <NA>
This is what i want my data frame to look like:
col1 col2 col3
1 A B A A X X
2 C D B B Y Y
3 E F C C <NA>
Thanks in advance!
If it's as simple as you show this is an approach:
data.frame(lapply(df, function(x){
ifelse(is.na(x), NA,
paste(substring(x, 1, 1), substring(x, 2)))
}))
If your data.frame columns are all factors, then you can work on the levels
as.data.frame(lapply(df, function(x){
.l <- unlist(lapply(strsplit(levels(x),''), paste, collapse = ' '))
levels(x) <- .l
x}))
If your data.frame columns are character (stringsAsFactors = FALSE)
as.data.frame(lapply(df, function(x){
.l <- unlist(lapply(strsplit(x,''), paste, collapse = ' '))
.l
}))
Here's one way
df2 <- data.frame(lapply(df, function(x) {
levels(x) <- gsub("(.)(.)", "\\1 \\2", levels(x))
return(x)
}))
df2
# col1 col2 col3
# 1 A B A A X X
# 2 C D B B Y Y
# 3 E F C C <NA>
This of course relies on the assumption that, when creating the data.frame df the argument stringsAsFactors is TRUE.
Did you try this ?
df <- data.frame(col1=c("A B", "C D", "E F"), col2=c("A A", "B B", "C C"), col3=c("X X", "Y Y", NA))
I tried and I'm getting what you require, Seems to be too silly !
If you are getting the column values dynamically, guess you can use a paste appropriately along with strsplit
Example
x <- "AB"
strsplit(x, '')
[[1]]
[1] "A" "B"
Then you can use this and use paste appropriately

Resources