How to use FeatureTools to generate new features by crossing features in a table? - featuretools

Feature crossing is a very common technique to find the nonlinear relationships in a dataset.
How to use FeatureTools to generate new features by crossing features in a table?

It is possible to cross every pair of numeric features automatically in Featuretools using the primitive Multiply. As a code example, suppose we have the fictional dataframe
index price shares_bought date
index
1 1 1.00 3 2017-12-29
2 2 0.75 4 2017-12-30
3 3 0.60 5 2017-12-31
4 4 0.50 18 2018-01-01
5 5 1.00 1 2018-01-02
and we want to multiply price by shares_bought. We would run
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', time_index='date')
from featuretools.primitives import Multiply
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=[Multiply])
to make the dataframe into an entityset, and then run DFS to apply the Multiply in all places possible. In this case, since there are only two numeric features, we'll get a feature matrix fm which looks like
price shares_bought price * shares_bought
index
1 1.00 3 3.0
2 0.75 4 3.0
3 0.60 5 3.0
4 0.50 18 9.0
5 1.00 1 1.0
If we want to apply a primitive to a particular pair of features by hand, it is possible to do so using seed features. Our code would then be
n12_cross = Multiply(es['log']['price'], es['log']['shares_bought'])
fm, features = ft.dfs(entityset=es,
target_entity='log',
seed_features=[n12_cross])
to get the same feature matrix as above.
EDIT:
To make the dataframe above, I used
import pandas as pd
import featuretools as ft
df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
'shares_bought': [3, 4, 5, 18, 1],
'price': [1.00, 0.75, 0.60, 0.50, 1.00]})
df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')

Related

Transpose dataframe for time-series data analysis? [duplicate]

On the pandas tag, I often see users asking questions about melting dataframes in pandas. I am gonna attempt a cannonical Q&A (self-answer) with this topic.
I am gonna clarify:
What is melt?
How do I use melt?
When do I use melt?
I see some hotter questions about melt, like:
Convert columns into rows with Pandas : This one actually could be good, but some more explanation would be better.
Pandas Melt Function : Nice question answer is good, but it's a bit too vague, not much expanation.
Melting a pandas dataframe : Also a nice answer! But it's only for that particular situation, which is pretty simple, only pd.melt(df)
Pandas dataframe use columns as rows (melt) : Very neat! But the problem is that it's only for the specific question the OP asked, which is also required to use pivot_table as well.
So I am gonna attempt a canonical Q&A for this topic.
Dataset:
I will have all my answers on this dataset of random grades for random people with random ages (easier to explain for the answers :D):
import pandas as pd
df = pd.DataFrame({'Name': ['Bob', 'John', 'Foo', 'Bar', 'Alex', 'Tom'],
'Math': ['A+', 'B', 'A', 'F', 'D', 'C'],
'English': ['C', 'B', 'B', 'A+', 'F', 'A'],
'Age': [13, 16, 16, 15, 15, 13]})
>>> df
Name Math English Age
0 Bob A+ C 13
1 John B B 16
2 Foo A B 16
3 Bar F A+ 15
4 Alex D F 15
5 Tom C A 13
>>>
Problems:
I am gonna have some problems and they will be solved in my self-answer below.
Problem 1:
How do I melt a dataframe so that the original dataframe becomes:
Name Age Subject Grade
0 Bob 13 English C
1 John 16 English B
2 Foo 16 English B
3 Bar 15 English A+
4 Alex 17 English F
5 Tom 12 English A
6 Bob 13 Math A+
7 John 16 Math B
8 Foo 16 Math A
9 Bar 15 Math F
10 Alex 17 Math D
11 Tom 12 Math C
I want to transpose this so that one column would be each subject and the other columns would be the repeated names of the students and there age and score.
Problem 2:
This is similar to Problem 1, but this time I want to make the Problem 1 output Subject column only have Math, I want to filter out the English column:
Name Age Subject Grades
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
I want the output to be like the above.
Problem 3:
If I was to group the melt and order the students by there scores, how would I be able to do that, to get the desired output like the below:
value Name Subjects
0 A Foo, Tom Math, English
1 A+ Bob, Bar Math, English
2 B John, John, Foo Math, English, English
3 C Tom, Bob Math, English
4 D Alex Math
5 F Bar, Alex Math, English
I need it to be ordered and the names separated by comma and also the Subjects separated by comma in the same order respectively
Problem 4:
How would I unmelt a melted dataframe? Let's say I already melted this dataframe:
print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))
To become:
Name Age Subject Grades
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
6 Bob 13 English C
7 John 16 English B
8 Foo 16 English B
9 Bar 15 English A+
10 Alex 15 English F
11 Tom 13 English A
Then how would I translate this back to the original dataframe, the below:
Name Math English Age
0 Bob A+ C 13
1 John B B 16
2 Foo A B 16
3 Bar F A+ 15
4 Alex D F 15
5 Tom C A 13
How would I go about doing this?
Problem 5:
If I was to group by the names of the students and separate the subjects and grades by comma, how would I do it?
Name Subject Grades
0 Alex Math, English D, F
1 Bar Math, English F, A+
2 Bob Math, English A+, C
3 Foo Math, English A, B
4 John Math, English B, B
5 Tom Math, English C, A
I want to have a dataframe like above.
Problem 6:
If I was gonna completely melt my dataframe, all columns as values, how would I do it?
Column Value
0 Name Bob
1 Name John
2 Name Foo
3 Name Bar
4 Name Alex
5 Name Tom
6 Math A+
7 Math B
8 Math A
9 Math F
10 Math D
11 Math C
12 English C
13 English B
14 English B
15 English A+
16 English F
17 English A
18 Age 13
19 Age 16
20 Age 16
21 Age 15
22 Age 15
23 Age 13
I want to have a dataframe like above. All columns as values.
Please check my self-answer below :)
Note for pandas versions < 0.20.0: I will be using df.melt(...) for my examples, but you will need to use pd.melt(df, ...) instead.
Documentation references:
Most of the solutions here would be used with melt, so to know the method melt, see the documentaion explanation
Unpivot a DataFrame from wide to long format, optionally leaving
identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted”
to the row axis, leaving just two non-identifier columns, ‘variable’
and ‘value’.
Parameters
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name : scalar
Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name : scalar, default ‘value’
Name to use for the ‘value’ column.
col_level : int or str, optional
If columns are a MultiIndex then use this level to melt.
ignore_index : bool, default True
If True, original index is ignored. If False, the original index is retained. Index labels will be repeated
as necessary.
New in version 1.1.0.
Logic to melting:
Melting merges multiple columns and converts the dataframe from wide to long, for the solution to Problem 1 (see below), the steps are:
First we got the original dataframe.
Then the melt firstly merges the Math and English columns and makes the dataframe replicated (longer).
Then finally adds the column Subject which is the subject of the Grades columns value respectively.
This is the simple logic to what the melt function does.
Solutions:
I will solve my own questions.
Problem 1:
Problem 1 could be solve using pd.DataFrame.melt with the following code:
print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))
This code passes the id_vars argument to ['Name', 'Age'], then automatically the value_vars would be set to the other columns (['Math', 'English']), which is transposed into that format.
You could also solve Problem 1 using stack like the below:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index(name="Grade")
.rename(columns={"level_2": "Subject"})
.sort_values("Subject")
.reset_index(drop=True)
)
This code sets the Name and Age columns as the index and stacks the rest of the columns Math and English, and resets the index and assigns Grade as the column name, then renames the other column level_2 to Subject and then sorts by the Subject column, then finally resets the index again.
Both of these solutions output:
Name Age Subject Grade
0 Bob 13 English C
1 John 16 English B
2 Foo 16 English B
3 Bar 15 English A+
4 Alex 17 English F
5 Tom 12 English A
6 Bob 13 Math A+
7 John 16 Math B
8 Foo 16 Math A
9 Bar 15 Math F
10 Alex 17 Math D
11 Tom 12 Math C
Problem 2:
This is similar to my first question, but this one I only one to filter in the Math columns, this time the value_vars argument can come into use, like the below:
print(
df.melt(
id_vars=["Name", "Age"],
value_vars="Math",
var_name="Subject",
value_name="Grades",
)
)
Or we can also use stack with column specification:
print(
df.set_index(["Name", "Age"])[["Math"]]
.stack()
.reset_index(name="Grade")
.rename(columns={"level_2": "Subject"})
.sort_values("Subject")
.reset_index(drop=True)
)
Both of these solutions give:
Name Age Subject Grade
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
Problem 3:
Problem 3 could be solved with melt and groupby, using the agg function with ', '.join, like the below:
print(
df.melt(id_vars=["Name", "Age"])
.groupby("value", as_index=False)
.agg(", ".join)
)
It melts the dataframe then groups by the grades and aggregates them and joins them by a comma.
stack could be also used to solve this problem, with stack and groupby like the below:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index()
.rename(columns={"level_2": "Subjects", 0: "Grade"})
.groupby("Grade", as_index=False)
.agg(", ".join)
)
This stack function just transposes the dataframe in a way that is equivalent to melt, then resets the index, renames the columns and groups and aggregates.
Both solutions output:
Grade Name Subjects
0 A Foo, Tom Math, English
1 A+ Bob, Bar Math, English
2 B John, John, Foo Math, English, English
3 C Bob, Tom English, Math
4 D Alex Math
5 F Bar, Alex Math, English
Problem 4:
We first melt the dataframe for the input data:
df = df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades')
Then now we can start solving this Problem 4.
Problem 4 could be solved with pivot_table, we would have to specify to the pivot_table arguments, values, index, columns and also aggfunc.
We could solve it with the below code:
print(
df.pivot_table("Grades", ["Name", "Age"], "Subject", aggfunc="first")
.reset_index()
.rename_axis(columns=None)
)
Output:
Name Age English Math
0 Alex 15 F D
1 Bar 15 A+ F
2 Bob 13 C A+
3 Foo 16 B A
4 John 16 B B
5 Tom 13 A C
The melted dataframe is converted back to the exact same format as the original dataframe.
We first pivot the melted dataframe and then reset the index and remove the column axis name.
Problem 5:
Problem 5 could be solved with melt and groupby like the following:
print(
df.melt(id_vars=["Name", "Age"], var_name="Subject", value_name="Grades")
.groupby("Name", as_index=False)
.agg(", ".join)
)
That melts and groups by Name.
Or you could stack:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index()
.groupby("Name", as_index=False)
.agg(", ".join)
.rename({"level_2": "Subjects", 0: "Grades"}, axis=1)
)
Both codes output:
Name Subjects Grades
0 Alex Math, English D, F
1 Bar Math, English F, A+
2 Bob Math, English A+, C
3 Foo Math, English A, B
4 John Math, English B, B
5 Tom Math, English C, A
Problem 6:
Problem 6 could be solved with melt and no column needed to be specified, just specify the expected column names:
print(df.melt(var_name='Column', value_name='Value'))
That melts the whole dataframe
Or you could stack:
print(
df.stack()
.reset_index(level=1)
.sort_values("level_1")
.reset_index(drop=True)
.set_axis(["Column", "Value"], axis=1)
)
Both codes output:
Column Value
0 Age 16
1 Age 15
2 Age 15
3 Age 16
4 Age 13
5 Age 13
6 English A+
7 English B
8 English B
9 English A
10 English F
11 English C
12 Math C
13 Math A+
14 Math D
15 Math B
16 Math F
17 Math A
18 Name Alex
19 Name Bar
20 Name Tom
21 Name Foo
22 Name John
23 Name Bob
Conclusion:
melt is a really handy function, often it's required, once you meet these types of problems, don't forget to try melt, it may well solve your problem.
There is another kind of melt not mentioned in the question, which is that with a dataframe whose column header contains common prefix and you want to melt the suffix to column value.
It is kind of the opposite of question 11 in How can I pivot a dataframe?
Say you have a following DataFrame, and you want to melt 1970, 1980 to column values
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -1.085631 0
1 b e 1.2 1.3 0.997345 1
2 c f 0.7 0.1 0.282978 2
In this case you can try pandas.wide_to_long
pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")
X A B
id year
0 1970 -1.085631 a 2.5
1 1970 0.997345 b 1.2
2 1970 0.282978 c 0.7
0 1980 -1.085631 d 3.2
1 1980 0.997345 e 1.3
2 1980 0.282978 f 0.1
As described here by U12-Forward, melting a dataframe primarily means reshaping the data from wide form to long form. More often than not, the new dataframe will have more rows and less columns compared to the original dataframe.
There are different scenarios when it comes to melting - all column labels could be melted into a single column, or multiple columns; some parts of the column labels could be retained as headers, while the rest are collated into a column, and so on. This answer shows how to melt a pandas dataframe, using pd.stack, pd.melt, pd.wide_to_long and pivot_longer from pyjanitor (I am a contributor to the pyjanitor library). The examples won't be exhaustive, but hopefully should point you in the right direction when it comes to reshaping dataframes from wide to long form.
Sample Data
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Scenario 1 - Melt all columns:
In this case, we wish to convert all the specified column headers into rows - this can be done with pd.melt or pd.stack, and the solutions to problem 1 already cover this. The reshaping can also be done with pivot_longer
# pip install pyjanitor
import janitor
df.pivot_longer(index = 'Species')
Species variable value
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
Just like in pd.melt, you can rename the variable and value column, by passing arguments to names_to and values_to parameters:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm')
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
You can also retain the original index, and keep the dataframe based on order of appearance:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True)
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
0 setosa Sepal.Width 3.5
0 setosa Petal.Length 1.4
0 setosa Petal.Width 0.2
1 virginica Sepal.Length 5.9
1 virginica Sepal.Width 3.0
1 virginica Petal.Length 5.1
1 virginica Petal.Width 1.8
By default, the values in names_to are strings; they can be converted to other data types via the names_transform parameter - this can be helpful/performant for large dataframes, as it is generally more efficient compared to converting the data types after the reshaping. Note that this feature is currently only available in the development version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
out = df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True,
names_transform = 'category')
out.dtypes
Species object
dimension category
measurement_in_cm float64
dtype: object
Scenario 2 - Melt column labels into multiple columns:
So far, we've melted our data into single columns, one for the column names and one for the values. However, there might be scenarios where we wish to split the column labels into different columns, or even the values into different columns. Continuing with our sample data, we could prefer to have sepal and petal under a part column, while length and width are into a dimension column:
Via pd.melt - The separation is done after the melt:
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.drop(columns = 'variable')
)
Species value part dimension
0 setosa 5.1 Sepal Length
1 virginica 5.9 Sepal Length
2 setosa 3.5 Sepal Width
3 virginica 3.0 Sepal Width
4 setosa 1.4 Petal Length
5 virginica 5.1 Petal Length
6 setosa 0.2 Petal Width
7 virginica 1.8 Petal Width
Via pd.stack - offers a more efficient way of splitting the columns; the split is done on the columns, meaning less number of rows to deal with, meaning potentially faster outcome, as the data size increases:
out = df.set_index('Species')
# this returns a MultiIndex
out.columns = out.columns.str.split('.', expand = True)
new_names = ['part', 'dimension']
out.columns.names = new_names
out.stack(new_names).rename('value').reset_index()
Species part dimension value
0 setosa Petal Length 1.4
1 setosa Petal Width 0.2
2 setosa Sepal Length 5.1
3 setosa Sepal Width 3.5
4 virginica Petal Length 5.1
5 virginica Petal Width 1.8
6 virginica Sepal Length 5.9
7 virginica Sepal Width 3.0
Via pivot_longer - The key thing to note about pivot_longer is that it looks for patterns. The column labels are separated by a dot .. Simply pass a list/tuple of new names to names_to, and pass a separator to names_sep (under the hood it just uses pd.str.split):
df.pivot_longer(index = 'Species',
names_to = ('part', 'dimension'),
names_sep='.')
Species part dimension value
0 setosa Sepal Length 5.1
1 virginica Sepal Length 5.9
2 setosa Sepal Width 3.5
3 virginica Sepal Width 3.0
4 setosa Petal Length 1.4
5 virginica Petal Length 5.1
6 setosa Petal Width 0.2
7 virginica Petal Width 1.8
So far, we've seen how melt, stack and pivot_longer can split the column labels into multiple new columns, as long as there is a defined separator. What if there isn't a clearly defined separator, like in the dataframe below :
# https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv
who = pd.DataFrame({'id': [1], 'new_sp_m5564': [2], 'newrel_f65': [3]})
who
id new_sp_m5564 newrel_f65
0 1 2 3
In the second column we have multiple _, compared to the 3rd column which has just one _. The goal here is to split the column labels into individual columns (sp & rel to diagnosis column, m & f to gender column, the numbers to age column). One option is to extract the column sub labels via regex
Via pd.melt - again with pd.melt, the reshaping occurs after the melt:
out = who.melt('id')
regex = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)"
new_df = out.variable.str.extract(regex)
# pd.concat can be used here instead
out.drop(columns='variable').assign(**new_df)
id value diagnosis gender age
0 1 2 sp m 5564
1 1 3 rel f 65
Note how the extracts occurred for the regex in groups (the one in parentheses).
Via pd.stack - just like in the previous example, the split is done on the columns, offering more in terms of efficiency:
out = who.set_index('id')
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
# returns a dataframe
new_cols = out.columns.str.extract(regex)
new_cols.columns = new_names
new_cols = pd.MultiIndex.from_frame(new_cols)
out.columns = new_cols
out.stack(new_names).rename('value').reset_index()
id diagnosis age gender value
0 1 rel f 65 3.0
1 1 sp m 5564 2.0
Again, the extracts occur for the regex in groups.
Via pivot_longer - again we know the pattern, and the new column names, we simply pass those to the function, this time we use names_pattern, since we are dealing with a regex. The extracts will match the regular expression in the groups (the ones in parentheses) :
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
who.pivot_longer(index = 'id',
names_to = new_names,
names_pattern = regex)
id diagnosis age gender value
0 1 sp m 5564 2
1 1 rel f 65 3
Scenario 3 - Melt column labels and values into multiple columns:
What if we wish to split the values into multiple columns as well? Let's use a fairly popular question on SO:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],
'State': ['Texas', 'Texas', 'Alabama'],
'Name':['Aria', 'Penelope', 'Niko'],
'Mango':[4, 10, 90],
'Orange': [10, 8, 14],
'Watermelon':[40, 99, 43],
'Gin':[16, 200, 34],
'Vodka':[20, 33, 18]},
columns=['City', 'State', 'Name', 'Mango', 'Orange', 'Watermelon', 'Gin', 'Vodka'])
df
City State Name Mango Orange Watermelon Gin Vodka
0 Houston Texas Aria 4 10 40 16 20
1 Austin Texas Penelope 10 8 99 200 33
2 Hoover Alabama Niko 90 14 43 34 18
The goal is to collate Mango, Orange, and Watermelon into a fruits column, Gin and Vodka into a Drinks column, and collate the respective values into Pounds and Ounces respectively.
Via pd.melt - I am copying the excellent solution verbatim :
df1 = df.melt(id_vars=['City', 'State'],
value_vars=['Mango', 'Orange', 'Watermelon'],
var_name='Fruit', value_name='Pounds')
df2 = df.melt(id_vars=['City', 'State'],
value_vars=['Gin', 'Vodka'],
var_name='Drink', value_name='Ounces')
df1 = df1.set_index(['City', 'State', df1.groupby(['City', 'State']).cumcount()])
df2 = df2.set_index(['City', 'State', df2.groupby(['City', 'State']).cumcount()])
df3 = (pd.concat([df1, df2],axis=1)
.sort_index(level=2)
.reset_index(level=2, drop=True)
.reset_index())
print (df3)
City State Fruit Pounds Drink Ounces
0 Austin Texas Mango 10 Gin 200.0
1 Hoover Alabama Mango 90 Gin 34.0
2 Houston Texas Mango 4 Gin 16.0
3 Austin Texas Orange 8 Vodka 33.0
4 Hoover Alabama Orange 14 Vodka 18.0
5 Houston Texas Orange 10 Vodka 20.0
6 Austin Texas Watermelon 99 NaN NaN
7 Hoover Alabama Watermelon 43 NaN NaN
8 Houston Texas Watermelon 40 NaN NaN
Via pd.stack - I can't think of a solution via stack, so I'll skip
Via pivot_longer - The reshape can be efficiently done by passing the list of names to names_to and values_to, and pass a list of regular expressions to names_pattern- when splitting values into multiple columns, a list of regex to names_pattern is required:
df.pivot_longer(
index=["City", "State"],
column_names=slice("Mango", "Vodka"),
names_to=("Fruit", "Drink"),
values_to=("Pounds", "Ounces"),
names_pattern=[r"M|O|W", r"G|V"],
)
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 None NaN
7 Austin Texas Watermelon 99 None NaN
8 Hoover Alabama Watermelon 43 None NaN
The efficiency is even more as the dataframe size increases.
Scenario 4 - Group similar columns together:
Extending the concept of melting into multiple columns, let's say we wish to group similar columns together. We do not care about retaining the column labels, just combining the values of similar columns into new columns.
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
For the code above, we wish to combine similar columns (columns that start with the same letter) into new unique columns - all x* columns will be lumped under x_mean, while all y* columns will be collated under y_mean. We are not saving the column labels, we are only interested in the values of these columns:
Via pd.melt - one possible way via melt is to run it via groupby on the columns:
out = df.set_index('unit')
grouped = out.columns.str.split('_\d_').str.join('')
# group on the split
grouped = out.groupby(grouped, axis = 1)
# iterate, melt individually, and recombine to get a new dataframe
out = {key : frame.melt(ignore_index = False).value
for key, frame in grouped}
pd.DataFrame(out).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.stack - Here we split the columns and build a MultiIndex:
out = df.set_index('unit')
split = out.columns.str.split('_(\d)_')
split = [(f"{first}{last}", middle)
for first, middle, last
in split]
out.columns = pd.MultiIndex.from_tuples(split)
out.stack(-1).droplevel(-1).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.wide_to_long - Here we reorder the sub labels - move the numbers to the end of the columns:
out = df.set_index('unit')
out.columns = [f"{first}{last}_{middle}"
for first, middle, last
in out.columns.str.split('_(\d)_')]
(pd
.wide_to_long(
out.reset_index(),
stubnames = ['xmean', 'ymean'],
i = 'unit',
j = 'num',
sep = '_')
.droplevel(-1)
.reset_index()
)
unit xmean ymean
0 50 10 30
1 50 20 40
Via pivot_longer - Again, with pivot_longer, it is all about the patterns. Simply pass a list of new column names to names_to, and the corresponding regular expressions to names_pattern:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 30
1 50 20 40
Note that with this pattern it is on a first come first serve basis - if the column order was flipped, pivot_longer would give a different output. Lets see this in action:
# reorder the columns in a different form:
df = df.loc[:, ['x_1_mean', 'x_2_mean', 'y_2_mean', 'y_1_mean', 'unit']]
df
x_1_mean x_2_mean y_2_mean y_1_mean unit
0 10 20 40 30 50
Because the order has changed, x_1_mean will be paired with y_2_mean, because that is the first y column it sees, while x_2_mean gets paired with y_1_mean:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 40
1 50 20 30
Note the difference in the output compared to the previous run. This is something to note when using names_pattern with a sequence. Order matters.
Scenario 5 - Retain part of the column names as headers:
This might probably be one of the biggest use cases when reshaping to long form. Some parts of the column label we may wish to keep as header, and move the remaining columns to new columns (or even ignore them).
Let's revisit our iris dataframe:
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Our goal here is to keep Sepal, Petal as column names, and the rest (Length, Width) are collated into a dimension column:
Via pd.melt - A pivot is used after melting into long form :
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'dimension'], 'part', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
This is not as efficient as other options below, as this involves wide to long, then long to wide, this might have poor performance on large enough dataframe.
Via pd.stack - This offers more efficiency as most of the reshaping is on the columns - less is more.
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = [None, 'dimension']
out.stack('dimension').reset_index()
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
Via pd.wide_to_long - Straightforward - simply pass in the relevant arguments:
(pd
.wide_to_long(
df,
stubnames=['Sepal', 'Petal'],
i = 'Species',
j = 'dimension',
sep='.',
suffix='.+')
.reset_index()
)
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
As the data size increases, pd.wide_to_long might not be as efficient.
Via pivot_longer : Again, back to patterns. Since we are keeping a part of the column as header, we use .value as a placeholder. The function sees the .value and knows that that sub label has to remain as a header. The split in the columns can either be by names_sep or names_pattern. In this case, it is simpler to use names_sep :
df.pivot_longer(index = 'Species',
names_to = ('.value', 'dimension'),
names_sep = '.')
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
When the column is split with ., we have Petal, Length. When compared with ('.value', 'dimension'), Petal is associated with .value, while Length is associated with dimension. Petal stays as a column header, while Length is lumped into the dimension column. We didn't need to be explicit about the column name, we just use .value and let the function do the heavy work. This way, if you have lots of columns, you don't need to work out what the columns to stay as headers should be, as long as you have the right pattern via names_sep or names_pattern.
What if we want the Length/Width as column names instead, and Petal/Sepal get lumped into a part column:
Via pd.melt
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'part'], 'dimension', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.stack:
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = ['part', None]
out.stack('part').reset_index()
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.wide_to_long - First, we need to reorder the columns, such that Length/Width are at the front:
out = df.set_index('Species')
out.columns = out.columns.str.split('.').str[::-1].str.join('.')
(pd
.wide_to_long(
out.reset_index(),
stubnames=['Length', 'Width'],
i = 'Species',
j = 'part',
sep='.',
suffix='.+')
.reset_index()
)
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Via pivot_longer:
df.pivot_longer(index = 'Species',
names_to = ('part', '.value'),
names_sep = '.')
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Notice that we did not have to do any column reordering (there are scenarios where column reordering is unavoidable), the function simply paired .value with whatever the split from names_sep gave and outputted the reshaped dataframe. You can even use multiple .value where applicable. Let's revisit an earlier dataframe:
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
df.pivot_longer(index = 'unit',
names_to = ('.value', '.value'),
names_pattern = r"(.).+(mean)")
unit xmean ymean
0 50 10 30
1 50 20 40
It is all about seeing the patterns and taking advantage of them. pivot_longer just offers efficient and performant abstractions over common reshaping scenarios - under the hood it is just Pandas/numpy/python.
Hopefully the various answers point you in the right direction when you need to reshape from wide to long.

Transpose specific columns to rows in Pandas [duplicate]

On the pandas tag, I often see users asking questions about melting dataframes in pandas. I am gonna attempt a cannonical Q&A (self-answer) with this topic.
I am gonna clarify:
What is melt?
How do I use melt?
When do I use melt?
I see some hotter questions about melt, like:
Convert columns into rows with Pandas : This one actually could be good, but some more explanation would be better.
Pandas Melt Function : Nice question answer is good, but it's a bit too vague, not much expanation.
Melting a pandas dataframe : Also a nice answer! But it's only for that particular situation, which is pretty simple, only pd.melt(df)
Pandas dataframe use columns as rows (melt) : Very neat! But the problem is that it's only for the specific question the OP asked, which is also required to use pivot_table as well.
So I am gonna attempt a canonical Q&A for this topic.
Dataset:
I will have all my answers on this dataset of random grades for random people with random ages (easier to explain for the answers :D):
import pandas as pd
df = pd.DataFrame({'Name': ['Bob', 'John', 'Foo', 'Bar', 'Alex', 'Tom'],
'Math': ['A+', 'B', 'A', 'F', 'D', 'C'],
'English': ['C', 'B', 'B', 'A+', 'F', 'A'],
'Age': [13, 16, 16, 15, 15, 13]})
>>> df
Name Math English Age
0 Bob A+ C 13
1 John B B 16
2 Foo A B 16
3 Bar F A+ 15
4 Alex D F 15
5 Tom C A 13
>>>
Problems:
I am gonna have some problems and they will be solved in my self-answer below.
Problem 1:
How do I melt a dataframe so that the original dataframe becomes:
Name Age Subject Grade
0 Bob 13 English C
1 John 16 English B
2 Foo 16 English B
3 Bar 15 English A+
4 Alex 17 English F
5 Tom 12 English A
6 Bob 13 Math A+
7 John 16 Math B
8 Foo 16 Math A
9 Bar 15 Math F
10 Alex 17 Math D
11 Tom 12 Math C
I want to transpose this so that one column would be each subject and the other columns would be the repeated names of the students and there age and score.
Problem 2:
This is similar to Problem 1, but this time I want to make the Problem 1 output Subject column only have Math, I want to filter out the English column:
Name Age Subject Grades
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
I want the output to be like the above.
Problem 3:
If I was to group the melt and order the students by there scores, how would I be able to do that, to get the desired output like the below:
value Name Subjects
0 A Foo, Tom Math, English
1 A+ Bob, Bar Math, English
2 B John, John, Foo Math, English, English
3 C Tom, Bob Math, English
4 D Alex Math
5 F Bar, Alex Math, English
I need it to be ordered and the names separated by comma and also the Subjects separated by comma in the same order respectively
Problem 4:
How would I unmelt a melted dataframe? Let's say I already melted this dataframe:
print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))
To become:
Name Age Subject Grades
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
6 Bob 13 English C
7 John 16 English B
8 Foo 16 English B
9 Bar 15 English A+
10 Alex 15 English F
11 Tom 13 English A
Then how would I translate this back to the original dataframe, the below:
Name Math English Age
0 Bob A+ C 13
1 John B B 16
2 Foo A B 16
3 Bar F A+ 15
4 Alex D F 15
5 Tom C A 13
How would I go about doing this?
Problem 5:
If I was to group by the names of the students and separate the subjects and grades by comma, how would I do it?
Name Subject Grades
0 Alex Math, English D, F
1 Bar Math, English F, A+
2 Bob Math, English A+, C
3 Foo Math, English A, B
4 John Math, English B, B
5 Tom Math, English C, A
I want to have a dataframe like above.
Problem 6:
If I was gonna completely melt my dataframe, all columns as values, how would I do it?
Column Value
0 Name Bob
1 Name John
2 Name Foo
3 Name Bar
4 Name Alex
5 Name Tom
6 Math A+
7 Math B
8 Math A
9 Math F
10 Math D
11 Math C
12 English C
13 English B
14 English B
15 English A+
16 English F
17 English A
18 Age 13
19 Age 16
20 Age 16
21 Age 15
22 Age 15
23 Age 13
I want to have a dataframe like above. All columns as values.
Please check my self-answer below :)
Note for pandas versions < 0.20.0: I will be using df.melt(...) for my examples, but you will need to use pd.melt(df, ...) instead.
Documentation references:
Most of the solutions here would be used with melt, so to know the method melt, see the documentaion explanation
Unpivot a DataFrame from wide to long format, optionally leaving
identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted”
to the row axis, leaving just two non-identifier columns, ‘variable’
and ‘value’.
Parameters
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name : scalar
Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name : scalar, default ‘value’
Name to use for the ‘value’ column.
col_level : int or str, optional
If columns are a MultiIndex then use this level to melt.
ignore_index : bool, default True
If True, original index is ignored. If False, the original index is retained. Index labels will be repeated
as necessary.
New in version 1.1.0.
Logic to melting:
Melting merges multiple columns and converts the dataframe from wide to long, for the solution to Problem 1 (see below), the steps are:
First we got the original dataframe.
Then the melt firstly merges the Math and English columns and makes the dataframe replicated (longer).
Then finally adds the column Subject which is the subject of the Grades columns value respectively.
This is the simple logic to what the melt function does.
Solutions:
I will solve my own questions.
Problem 1:
Problem 1 could be solve using pd.DataFrame.melt with the following code:
print(df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades'))
This code passes the id_vars argument to ['Name', 'Age'], then automatically the value_vars would be set to the other columns (['Math', 'English']), which is transposed into that format.
You could also solve Problem 1 using stack like the below:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index(name="Grade")
.rename(columns={"level_2": "Subject"})
.sort_values("Subject")
.reset_index(drop=True)
)
This code sets the Name and Age columns as the index and stacks the rest of the columns Math and English, and resets the index and assigns Grade as the column name, then renames the other column level_2 to Subject and then sorts by the Subject column, then finally resets the index again.
Both of these solutions output:
Name Age Subject Grade
0 Bob 13 English C
1 John 16 English B
2 Foo 16 English B
3 Bar 15 English A+
4 Alex 17 English F
5 Tom 12 English A
6 Bob 13 Math A+
7 John 16 Math B
8 Foo 16 Math A
9 Bar 15 Math F
10 Alex 17 Math D
11 Tom 12 Math C
Problem 2:
This is similar to my first question, but this one I only one to filter in the Math columns, this time the value_vars argument can come into use, like the below:
print(
df.melt(
id_vars=["Name", "Age"],
value_vars="Math",
var_name="Subject",
value_name="Grades",
)
)
Or we can also use stack with column specification:
print(
df.set_index(["Name", "Age"])[["Math"]]
.stack()
.reset_index(name="Grade")
.rename(columns={"level_2": "Subject"})
.sort_values("Subject")
.reset_index(drop=True)
)
Both of these solutions give:
Name Age Subject Grade
0 Bob 13 Math A+
1 John 16 Math B
2 Foo 16 Math A
3 Bar 15 Math F
4 Alex 15 Math D
5 Tom 13 Math C
Problem 3:
Problem 3 could be solved with melt and groupby, using the agg function with ', '.join, like the below:
print(
df.melt(id_vars=["Name", "Age"])
.groupby("value", as_index=False)
.agg(", ".join)
)
It melts the dataframe then groups by the grades and aggregates them and joins them by a comma.
stack could be also used to solve this problem, with stack and groupby like the below:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index()
.rename(columns={"level_2": "Subjects", 0: "Grade"})
.groupby("Grade", as_index=False)
.agg(", ".join)
)
This stack function just transposes the dataframe in a way that is equivalent to melt, then resets the index, renames the columns and groups and aggregates.
Both solutions output:
Grade Name Subjects
0 A Foo, Tom Math, English
1 A+ Bob, Bar Math, English
2 B John, John, Foo Math, English, English
3 C Bob, Tom English, Math
4 D Alex Math
5 F Bar, Alex Math, English
Problem 4:
We first melt the dataframe for the input data:
df = df.melt(id_vars=['Name', 'Age'], var_name='Subject', value_name='Grades')
Then now we can start solving this Problem 4.
Problem 4 could be solved with pivot_table, we would have to specify to the pivot_table arguments, values, index, columns and also aggfunc.
We could solve it with the below code:
print(
df.pivot_table("Grades", ["Name", "Age"], "Subject", aggfunc="first")
.reset_index()
.rename_axis(columns=None)
)
Output:
Name Age English Math
0 Alex 15 F D
1 Bar 15 A+ F
2 Bob 13 C A+
3 Foo 16 B A
4 John 16 B B
5 Tom 13 A C
The melted dataframe is converted back to the exact same format as the original dataframe.
We first pivot the melted dataframe and then reset the index and remove the column axis name.
Problem 5:
Problem 5 could be solved with melt and groupby like the following:
print(
df.melt(id_vars=["Name", "Age"], var_name="Subject", value_name="Grades")
.groupby("Name", as_index=False)
.agg(", ".join)
)
That melts and groups by Name.
Or you could stack:
print(
df.set_index(["Name", "Age"])
.stack()
.reset_index()
.groupby("Name", as_index=False)
.agg(", ".join)
.rename({"level_2": "Subjects", 0: "Grades"}, axis=1)
)
Both codes output:
Name Subjects Grades
0 Alex Math, English D, F
1 Bar Math, English F, A+
2 Bob Math, English A+, C
3 Foo Math, English A, B
4 John Math, English B, B
5 Tom Math, English C, A
Problem 6:
Problem 6 could be solved with melt and no column needed to be specified, just specify the expected column names:
print(df.melt(var_name='Column', value_name='Value'))
That melts the whole dataframe
Or you could stack:
print(
df.stack()
.reset_index(level=1)
.sort_values("level_1")
.reset_index(drop=True)
.set_axis(["Column", "Value"], axis=1)
)
Both codes output:
Column Value
0 Age 16
1 Age 15
2 Age 15
3 Age 16
4 Age 13
5 Age 13
6 English A+
7 English B
8 English B
9 English A
10 English F
11 English C
12 Math C
13 Math A+
14 Math D
15 Math B
16 Math F
17 Math A
18 Name Alex
19 Name Bar
20 Name Tom
21 Name Foo
22 Name John
23 Name Bob
Conclusion:
melt is a really handy function, often it's required, once you meet these types of problems, don't forget to try melt, it may well solve your problem.
There is another kind of melt not mentioned in the question, which is that with a dataframe whose column header contains common prefix and you want to melt the suffix to column value.
It is kind of the opposite of question 11 in How can I pivot a dataframe?
Say you have a following DataFrame, and you want to melt 1970, 1980 to column values
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -1.085631 0
1 b e 1.2 1.3 0.997345 1
2 c f 0.7 0.1 0.282978 2
In this case you can try pandas.wide_to_long
pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")
X A B
id year
0 1970 -1.085631 a 2.5
1 1970 0.997345 b 1.2
2 1970 0.282978 c 0.7
0 1980 -1.085631 d 3.2
1 1980 0.997345 e 1.3
2 1980 0.282978 f 0.1
As described here by U12-Forward, melting a dataframe primarily means reshaping the data from wide form to long form. More often than not, the new dataframe will have more rows and less columns compared to the original dataframe.
There are different scenarios when it comes to melting - all column labels could be melted into a single column, or multiple columns; some parts of the column labels could be retained as headers, while the rest are collated into a column, and so on. This answer shows how to melt a pandas dataframe, using pd.stack, pd.melt, pd.wide_to_long and pivot_longer from pyjanitor (I am a contributor to the pyjanitor library). The examples won't be exhaustive, but hopefully should point you in the right direction when it comes to reshaping dataframes from wide to long form.
Sample Data
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Scenario 1 - Melt all columns:
In this case, we wish to convert all the specified column headers into rows - this can be done with pd.melt or pd.stack, and the solutions to problem 1 already cover this. The reshaping can also be done with pivot_longer
# pip install pyjanitor
import janitor
df.pivot_longer(index = 'Species')
Species variable value
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
Just like in pd.melt, you can rename the variable and value column, by passing arguments to names_to and values_to parameters:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm')
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
1 virginica Sepal.Length 5.9
2 setosa Sepal.Width 3.5
3 virginica Sepal.Width 3.0
4 setosa Petal.Length 1.4
5 virginica Petal.Length 5.1
6 setosa Petal.Width 0.2
7 virginica Petal.Width 1.8
You can also retain the original index, and keep the dataframe based on order of appearance:
df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True)
Species dimension measurement_in_cm
0 setosa Sepal.Length 5.1
0 setosa Sepal.Width 3.5
0 setosa Petal.Length 1.4
0 setosa Petal.Width 0.2
1 virginica Sepal.Length 5.9
1 virginica Sepal.Width 3.0
1 virginica Petal.Length 5.1
1 virginica Petal.Width 1.8
By default, the values in names_to are strings; they can be converted to other data types via the names_transform parameter - this can be helpful/performant for large dataframes, as it is generally more efficient compared to converting the data types after the reshaping. Note that this feature is currently only available in the development version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
out = df.pivot_longer(index = 'Species',
names_to = 'dimension',
values_to = 'measurement_in_cm',
ignore_index = False,
sort_by_appearance=True,
names_transform = 'category')
out.dtypes
Species object
dimension category
measurement_in_cm float64
dtype: object
Scenario 2 - Melt column labels into multiple columns:
So far, we've melted our data into single columns, one for the column names and one for the values. However, there might be scenarios where we wish to split the column labels into different columns, or even the values into different columns. Continuing with our sample data, we could prefer to have sepal and petal under a part column, while length and width are into a dimension column:
Via pd.melt - The separation is done after the melt:
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.drop(columns = 'variable')
)
Species value part dimension
0 setosa 5.1 Sepal Length
1 virginica 5.9 Sepal Length
2 setosa 3.5 Sepal Width
3 virginica 3.0 Sepal Width
4 setosa 1.4 Petal Length
5 virginica 5.1 Petal Length
6 setosa 0.2 Petal Width
7 virginica 1.8 Petal Width
Via pd.stack - offers a more efficient way of splitting the columns; the split is done on the columns, meaning less number of rows to deal with, meaning potentially faster outcome, as the data size increases:
out = df.set_index('Species')
# this returns a MultiIndex
out.columns = out.columns.str.split('.', expand = True)
new_names = ['part', 'dimension']
out.columns.names = new_names
out.stack(new_names).rename('value').reset_index()
Species part dimension value
0 setosa Petal Length 1.4
1 setosa Petal Width 0.2
2 setosa Sepal Length 5.1
3 setosa Sepal Width 3.5
4 virginica Petal Length 5.1
5 virginica Petal Width 1.8
6 virginica Sepal Length 5.9
7 virginica Sepal Width 3.0
Via pivot_longer - The key thing to note about pivot_longer is that it looks for patterns. The column labels are separated by a dot .. Simply pass a list/tuple of new names to names_to, and pass a separator to names_sep (under the hood it just uses pd.str.split):
df.pivot_longer(index = 'Species',
names_to = ('part', 'dimension'),
names_sep='.')
Species part dimension value
0 setosa Sepal Length 5.1
1 virginica Sepal Length 5.9
2 setosa Sepal Width 3.5
3 virginica Sepal Width 3.0
4 setosa Petal Length 1.4
5 virginica Petal Length 5.1
6 setosa Petal Width 0.2
7 virginica Petal Width 1.8
So far, we've seen how melt, stack and pivot_longer can split the column labels into multiple new columns, as long as there is a defined separator. What if there isn't a clearly defined separator, like in the dataframe below :
# https://github.com/tidyverse/tidyr/blob/main/data-raw/who.csv
who = pd.DataFrame({'id': [1], 'new_sp_m5564': [2], 'newrel_f65': [3]})
who
id new_sp_m5564 newrel_f65
0 1 2 3
In the second column we have multiple _, compared to the 3rd column which has just one _. The goal here is to split the column labels into individual columns (sp & rel to diagnosis column, m & f to gender column, the numbers to age column). One option is to extract the column sub labels via regex
Via pd.melt - again with pd.melt, the reshaping occurs after the melt:
out = who.melt('id')
regex = r"new_?(?P<diagnosis>.+)_(?P<gender>.)(?P<age>\d+)"
new_df = out.variable.str.extract(regex)
# pd.concat can be used here instead
out.drop(columns='variable').assign(**new_df)
id value diagnosis gender age
0 1 2 sp m 5564
1 1 3 rel f 65
Note how the extracts occurred for the regex in groups (the one in parentheses).
Via pd.stack - just like in the previous example, the split is done on the columns, offering more in terms of efficiency:
out = who.set_index('id')
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
# returns a dataframe
new_cols = out.columns.str.extract(regex)
new_cols.columns = new_names
new_cols = pd.MultiIndex.from_frame(new_cols)
out.columns = new_cols
out.stack(new_names).rename('value').reset_index()
id diagnosis age gender value
0 1 rel f 65 3.0
1 1 sp m 5564 2.0
Again, the extracts occur for the regex in groups.
Via pivot_longer - again we know the pattern, and the new column names, we simply pass those to the function, this time we use names_pattern, since we are dealing with a regex. The extracts will match the regular expression in the groups (the ones in parentheses) :
regex = r"new_?(.+)_(.)(\d+)"
new_names = ['diagnosis', 'age', 'gender']
who.pivot_longer(index = 'id',
names_to = new_names,
names_pattern = regex)
id diagnosis age gender value
0 1 sp m 5564 2
1 1 rel f 65 3
Scenario 3 - Melt column labels and values into multiple columns:
What if we wish to split the values into multiple columns as well? Let's use a fairly popular question on SO:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],
'State': ['Texas', 'Texas', 'Alabama'],
'Name':['Aria', 'Penelope', 'Niko'],
'Mango':[4, 10, 90],
'Orange': [10, 8, 14],
'Watermelon':[40, 99, 43],
'Gin':[16, 200, 34],
'Vodka':[20, 33, 18]},
columns=['City', 'State', 'Name', 'Mango', 'Orange', 'Watermelon', 'Gin', 'Vodka'])
df
City State Name Mango Orange Watermelon Gin Vodka
0 Houston Texas Aria 4 10 40 16 20
1 Austin Texas Penelope 10 8 99 200 33
2 Hoover Alabama Niko 90 14 43 34 18
The goal is to collate Mango, Orange, and Watermelon into a fruits column, Gin and Vodka into a Drinks column, and collate the respective values into Pounds and Ounces respectively.
Via pd.melt - I am copying the excellent solution verbatim :
df1 = df.melt(id_vars=['City', 'State'],
value_vars=['Mango', 'Orange', 'Watermelon'],
var_name='Fruit', value_name='Pounds')
df2 = df.melt(id_vars=['City', 'State'],
value_vars=['Gin', 'Vodka'],
var_name='Drink', value_name='Ounces')
df1 = df1.set_index(['City', 'State', df1.groupby(['City', 'State']).cumcount()])
df2 = df2.set_index(['City', 'State', df2.groupby(['City', 'State']).cumcount()])
df3 = (pd.concat([df1, df2],axis=1)
.sort_index(level=2)
.reset_index(level=2, drop=True)
.reset_index())
print (df3)
City State Fruit Pounds Drink Ounces
0 Austin Texas Mango 10 Gin 200.0
1 Hoover Alabama Mango 90 Gin 34.0
2 Houston Texas Mango 4 Gin 16.0
3 Austin Texas Orange 8 Vodka 33.0
4 Hoover Alabama Orange 14 Vodka 18.0
5 Houston Texas Orange 10 Vodka 20.0
6 Austin Texas Watermelon 99 NaN NaN
7 Hoover Alabama Watermelon 43 NaN NaN
8 Houston Texas Watermelon 40 NaN NaN
Via pd.stack - I can't think of a solution via stack, so I'll skip
Via pivot_longer - The reshape can be efficiently done by passing the list of names to names_to and values_to, and pass a list of regular expressions to names_pattern- when splitting values into multiple columns, a list of regex to names_pattern is required:
df.pivot_longer(
index=["City", "State"],
column_names=slice("Mango", "Vodka"),
names_to=("Fruit", "Drink"),
values_to=("Pounds", "Ounces"),
names_pattern=[r"M|O|W", r"G|V"],
)
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 None NaN
7 Austin Texas Watermelon 99 None NaN
8 Hoover Alabama Watermelon 43 None NaN
The efficiency is even more as the dataframe size increases.
Scenario 4 - Group similar columns together:
Extending the concept of melting into multiple columns, let's say we wish to group similar columns together. We do not care about retaining the column labels, just combining the values of similar columns into new columns.
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
For the code above, we wish to combine similar columns (columns that start with the same letter) into new unique columns - all x* columns will be lumped under x_mean, while all y* columns will be collated under y_mean. We are not saving the column labels, we are only interested in the values of these columns:
Via pd.melt - one possible way via melt is to run it via groupby on the columns:
out = df.set_index('unit')
grouped = out.columns.str.split('_\d_').str.join('')
# group on the split
grouped = out.groupby(grouped, axis = 1)
# iterate, melt individually, and recombine to get a new dataframe
out = {key : frame.melt(ignore_index = False).value
for key, frame in grouped}
pd.DataFrame(out).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.stack - Here we split the columns and build a MultiIndex:
out = df.set_index('unit')
split = out.columns.str.split('_(\d)_')
split = [(f"{first}{last}", middle)
for first, middle, last
in split]
out.columns = pd.MultiIndex.from_tuples(split)
out.stack(-1).droplevel(-1).reset_index()
unit xmean ymean
0 50 10 30
1 50 20 40
Via pd.wide_to_long - Here we reorder the sub labels - move the numbers to the end of the columns:
out = df.set_index('unit')
out.columns = [f"{first}{last}_{middle}"
for first, middle, last
in out.columns.str.split('_(\d)_')]
(pd
.wide_to_long(
out.reset_index(),
stubnames = ['xmean', 'ymean'],
i = 'unit',
j = 'num',
sep = '_')
.droplevel(-1)
.reset_index()
)
unit xmean ymean
0 50 10 30
1 50 20 40
Via pivot_longer - Again, with pivot_longer, it is all about the patterns. Simply pass a list of new column names to names_to, and the corresponding regular expressions to names_pattern:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 30
1 50 20 40
Note that with this pattern it is on a first come first serve basis - if the column order was flipped, pivot_longer would give a different output. Lets see this in action:
# reorder the columns in a different form:
df = df.loc[:, ['x_1_mean', 'x_2_mean', 'y_2_mean', 'y_1_mean', 'unit']]
df
x_1_mean x_2_mean y_2_mean y_1_mean unit
0 10 20 40 30 50
Because the order has changed, x_1_mean will be paired with y_2_mean, because that is the first y column it sees, while x_2_mean gets paired with y_1_mean:
df.pivot_longer(index = 'unit',
names_to = ['xmean', 'ymean'],
names_pattern = ['x', 'y']
)
unit xmean ymean
0 50 10 40
1 50 20 30
Note the difference in the output compared to the previous run. This is something to note when using names_pattern with a sequence. Order matters.
Scenario 5 - Retain part of the column names as headers:
This might probably be one of the biggest use cases when reshaping to long form. Some parts of the column label we may wish to keep as header, and move the remaining columns to new columns (or even ignore them).
Let's revisit our iris dataframe:
df = pd.DataFrame(
{'Sepal.Length': [5.1, 5.9],
'Sepal.Width': [3.5, 3.0],
'Petal.Length': [1.4, 5.1],
'Petal.Width': [0.2, 1.8],
'Species': ['setosa', 'virginica']}
)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 5.9 3.0 5.1 1.8 virginica
Our goal here is to keep Sepal, Petal as column names, and the rest (Length, Width) are collated into a dimension column:
Via pd.melt - A pivot is used after melting into long form :
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'dimension'], 'part', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
This is not as efficient as other options below, as this involves wide to long, then long to wide, this might have poor performance on large enough dataframe.
Via pd.stack - This offers more efficiency as most of the reshaping is on the columns - less is more.
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = [None, 'dimension']
out.stack('dimension').reset_index()
Species dimension Petal Sepal
0 setosa Length 1.4 5.1
1 setosa Width 0.2 3.5
2 virginica Length 5.1 5.9
3 virginica Width 1.8 3.0
Via pd.wide_to_long - Straightforward - simply pass in the relevant arguments:
(pd
.wide_to_long(
df,
stubnames=['Sepal', 'Petal'],
i = 'Species',
j = 'dimension',
sep='.',
suffix='.+')
.reset_index()
)
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
As the data size increases, pd.wide_to_long might not be as efficient.
Via pivot_longer : Again, back to patterns. Since we are keeping a part of the column as header, we use .value as a placeholder. The function sees the .value and knows that that sub label has to remain as a header. The split in the columns can either be by names_sep or names_pattern. In this case, it is simpler to use names_sep :
df.pivot_longer(index = 'Species',
names_to = ('.value', 'dimension'),
names_sep = '.')
Species dimension Sepal Petal
0 setosa Length 5.1 1.4
1 virginica Length 5.9 5.1
2 setosa Width 3.5 0.2
3 virginica Width 3.0 1.8
When the column is split with ., we have Petal, Length. When compared with ('.value', 'dimension'), Petal is associated with .value, while Length is associated with dimension. Petal stays as a column header, while Length is lumped into the dimension column. We didn't need to be explicit about the column name, we just use .value and let the function do the heavy work. This way, if you have lots of columns, you don't need to work out what the columns to stay as headers should be, as long as you have the right pattern via names_sep or names_pattern.
What if we want the Length/Width as column names instead, and Petal/Sepal get lumped into a part column:
Via pd.melt
out = df.melt(id_vars = 'Species')
arr = out.variable.str.split('.')
(out
.assign(part = arr.str[0],
dimension = arr.str[1])
.pivot(['Species', 'part'], 'dimension', 'value')
.rename_axis(columns = None)
.reset_index()
)
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.stack:
out = df.set_index('Species')
out.columns = out.columns.str.split('.', expand = True)
out.columns.names = ['part', None]
out.stack('part').reset_index()
Species part Length Width
0 setosa Petal 1.4 0.2
1 setosa Sepal 5.1 3.5
2 virginica Petal 5.1 1.8
3 virginica Sepal 5.9 3.0
Via pd.wide_to_long - First, we need to reorder the columns, such that Length/Width are at the front:
out = df.set_index('Species')
out.columns = out.columns.str.split('.').str[::-1].str.join('.')
(pd
.wide_to_long(
out.reset_index(),
stubnames=['Length', 'Width'],
i = 'Species',
j = 'part',
sep='.',
suffix='.+')
.reset_index()
)
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Via pivot_longer:
df.pivot_longer(index = 'Species',
names_to = ('part', '.value'),
names_sep = '.')
Species part Length Width
0 setosa Sepal 5.1 3.5
1 virginica Sepal 5.9 3.0
2 setosa Petal 1.4 0.2
3 virginica Petal 5.1 1.8
Notice that we did not have to do any column reordering (there are scenarios where column reordering is unavoidable), the function simply paired .value with whatever the split from names_sep gave and outputted the reshaped dataframe. You can even use multiple .value where applicable. Let's revisit an earlier dataframe:
df = pd.DataFrame({'x_1_mean': [10],
'x_2_mean': [20],
'y_1_mean': [30],
'y_2_mean': [40],
'unit': [50]})
df
x_1_mean x_2_mean y_1_mean y_2_mean unit
0 10 20 30 40 50
df.pivot_longer(index = 'unit',
names_to = ('.value', '.value'),
names_pattern = r"(.).+(mean)")
unit xmean ymean
0 50 10 30
1 50 20 40
It is all about seeing the patterns and taking advantage of them. pivot_longer just offers efficient and performant abstractions over common reshaping scenarios - under the hood it is just Pandas/numpy/python.
Hopefully the various answers point you in the right direction when you need to reshape from wide to long.

Rank the highest number of downloaded app in every genre and filter only top 2 apps from each genre

I am trying to get the highest top 2 downloaded app from the below data set, here is the dataset im using
import pandas as pd
df = pd.DataFrame(data={'genre_id': ['tools', 'tools', 'VIDEO_PLAYERS',
'VIDEO_PLAYERS', 'PHOTOGRAPHY'],
'app_id':['MP3Cutter','PhotoCutter','VLC','MXPlayer','Picasa'],
'min_installs': [10, 100, 10, 20,1000]})
df
This is what i have tried
df['default_rank'] = df.groupby(['genre_id']).agg(['rank'])
df.sort_values(by='default_rank')
I am getting output such like this :
genre_id app_id min_installs default_rank
0 tools MP3Cutter 10 1.0
2 VIDEO_PLAYERS VLC 10 1.0
4 PHOTOGRAPHY Picasa 1000 1.0
1 tools PhotoCutter 100 2.0
3 VIDEO_PLAYERS MXPlayer 20 2.0
But i want to get something like this :
genre_id app_id min_installs default_rank
4 PHOTOGRAPHY Picasa 1000 1.0
1 tools PhotoCutter 100 1.0
0 tools MP3Cutter 10 2.0
3 VIDEO_PLAYERS MXPlayer 20 1.0
2 VIDEO_PLAYERS VLC 10 2.0
Im new to pandas, Is the Advance data manipulations are possible using python pandas? like we do in SQL ?
I believe you need sorting by DataFrameGroupBy.rank and also by maximal values per groups, then use:
s = df.groupby('genre_id')['min_installs'].transform('max')
df['default_rank'] = (df.groupby('genre_id')['min_installs']
.rank(method='max', ascending=False))
df = (df.assign(m=s)
.sort_values(by=['m', 'default_rank', 'genre_id'],
ascending=[False, True, True])
.drop('m', axis=1))
If need filter top2 values per groups:
df = df.groupby('genre_id').head(2)
print (df)
genre_id app_id min_installs default_rank
4 PHOTOGRAPHY Picasa 1000 1.0
1 tools PhotoCutter 100 1.0
0 tools MP3Cutter 10 2.0
3 VIDEO_PLAYERS MXPlayer 20 1.0
2 VIDEO_PLAYERS VLC 10 2.0

Convert dataframe from long to wide with custom column names [duplicate]

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.
Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')
A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)
Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN
A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.
pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis
An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

Resources