How to melt a dataframe into a long form?

How to melt a dataframe into a long form? - python-3.x

I have the following dataframe
recycling 1 metric tonne (1000 kilogram) per waste type Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 1 barrel oil is approximately 159 litres of oil NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 material Plastic Glass Ferrous Metal Non-Ferrous Metal Paper
3 energy_saved 5774 Kwh 42 Kwh 642 Kwh 14000 Kwh 4000 kWh
4 crude_oil saved 16 barrels NaN 1.8 barrels 40 barrels 1.7 barrels
For reference look at the image:
What I want to do is to get the rows 2, 3, 4 into cols in a new dataframe. It should be looking some like this..
material energy_saved crude_oil saved
plastic 5774Kwh 16 barrels
Glass 42 Kwh NaN
... ... ...
I tried using .melt but it was not working.
If you notice, the col name and its values are in a single row. I just want them to be in a new data frame as col and value.

IIUC, is it just:
out = df.loc[[2,3,4],:].T.reset_index(drop=True)

Related

Extract rows when a column value is not na in pandas dataframe

I am trying to understand why I am getting NaN for all rows when I extract non na values in a specific column. This happens only when I read in the excel file. It works fine with the csv
df=pd.read_excel('q.xlsx',sheet_name=None)
cols=['Name','Age','City']
for k,v in df.items():
if k=="Sheet1":
mod_cols=v.columns.to_list()
#The below is to filter on the column that is extra apart from the ones defined in cols.
#The reason I am doing this because I have multiple sheets in
#the excel file and when I iterate over the entire excel file, I want to filter on that additional column in each
#of those sheets. For this example, will focus on the first sheet
diff=set(mod_cols)-set(cols)
#diff is State in this case
d=v[~v[diff].isna()]
d
Name Age City State
0 NaN NaN NaN NaN
1 NaN NaN NaN NJ
2 NaN NaN NaN NaN
3 NaN NaN NaN NY
4 NaN NaN NaN NaN
5 NaN NaN NaN NC
6 NaN NaN NaN NaN
However with csv, it returns perfectly
df=pd.read_csv('q.csv')
d=df[~df['State'].isna()]
d
Name Age City State
1 Joe 31 Newark NJ
3 Mike 32 NYC NY
5 Moe 33 Durham NC

Is it possible to only fill in 50% of the missing values in pandas?

This is the DF:
amount cost
5 NaN
7 NaN
9 78.0
6 80.0
12 NaN
14 NaN
And I only want to fill 50% of the NANs so that I would get something like this:
amount cost
5 'hello'
7 NaN
9 78.0
6 80.0
12 NaN
14 'hello'
And is it possible to fill lets say 28% of the missing data with bigger dataSets.
Thanks for help.

We can do
idx=df.index[df.cost.isna()]
df.loc[np.random.choice(idx, size=int(len(idx)/2) ,replace=False),'cost']='somevalue'
df
Out[16]:
amount cost
0 5 NaN
1 7 somevalue
2 9 78
3 6 80
4 12 somevalue
5 14 NaN

Try with df.update()
nans = df.loc[df.cost.isna(), ]
nans.iloc[:int(len(nans) * 0.5), 'cost'] = 'hello'
df.update(nans.cost)

How to slice only portions dataframes and rotate into new dataframes regardless or row size?

I have a df that looks like this:
answerRequired answerTime choiceId \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 False 1.564541e+12 1542213646976
14 False 1.564541e+12 1542213646984
15 True 1.564541e+12 1542213646994
16 True 1.564541e+12 1542213647040
17 True 1.564541e+12 1542213647041
18 True 1.564541e+12 1542213647042
19 True 1.564541e+12 1542213647043
20 False 1.564541e+12 NaN
choiceLabel \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 NaN
exportLabel logicalType \
0 Participant ID NaN
1 Viewed NaN
2 Started NaN
3 Completed NaN
4 Time spent (HH:MM:SS.SSS) NaN
5 Country NaN
6 City NaN
7 IP NaN
8 Operating System NaN
9 Browser NaN
10 Device NaN
11 External ID NaN
12 Warnings NaN
13 It's all about the green arrow! (not that Gree... singleSelection
14 Make your choice. 2. Hit the green arrow at th... singleSelection
15 What are you most interested in? (Pick one) (T... singleSelection
16 We have the threads that you want! What kind o... multipleSelection
17 We have the threads that you want! What kind o... multipleSelection
18 We have the threads that you want! What kind o... multipleSelection
19 We have the threads that you want! What kind o... multipleSelection
20 NaN text
question questionId \
0 NaN participantId
1 NaN viewTime
2 NaN startedTime
3 NaN completedTime
4 NaN timeSpent
5 NaN country_name
6 NaN city
7 NaN ip
8 NaN os
9 NaN browser
10 NaN device
11 NaN externalId
12 NaN warnings
13 It's all about the green arrow! (not that Gree... 1542213646975
14 Make your choice. 2. Hit the green arrow at th... 1542213646983
15 What are you most interested in? (Pick one) 1542213646991
16 We have the threads that you want! What kind o... 1542213647039
17 We have the threads that you want! What kind o... 1542213647039
18 We have the threads that you want! What kind o... 1542213647039
19 We have the threads that you want! What kind o... 1542213647039
20 Almost Done! Enter Your Email Address! 1542213647050
questionOrder subType type value \
0 NaN NaN id -Ll4truw3KbSjVRtXmJy
1 NaN NaN time 2019-07-31T02:41:34.063Z
2 NaN NaN time 2019-07-31T02:44:37.732Z
3 NaN NaN time 2019-07-31T02:44:57.936Z
4 NaN NaN time 00:00:00.000
5 NaN NaN location Unknown
6 NaN NaN location Roslindale
7 NaN NaN location
8 NaN NaN device macOS 10.14
9 NaN NaN device Firefox 68.0
10 NaN NaN device
11 NaN NaN id
12 NaN NaN info []
13 0.0 singleSelection mediaGallery True
14 2.0 singleSelection mediaGallery True
15 4.0 singleSelection mediaGallery True
16 12.0 multipleSelection mediaGallery True
17 12.0 multipleSelection mediaGallery True
18 12.0 multipleSelection mediaGallery True
19 12.0 multipleSelection mediaGallery True
20 14.0 NaN emailBox 123456789#yellow.com
visualType
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 mediaGallery
14 mediaGallery
15 mediaGallery
16 mediaGallery
17 mediaGallery
18 mediaGallery
19 mediaGallery
20 emailBox
How do I cut the dataframe and turn it too look like this:
I tried this:
df.T.stack()
df_stack_test.T.groupby('level_1')[0].apply(lambda x: pd.Series(list(x))).unstack().T
but these are turning the data around without aggregating the data.
At a high level I want to:
flip exportLabel column values into columns and value column values into values under the column values from exportLabel, only where question column is null.
Then I want to flip the question column values into columns where it is not null and values from choicelabel under the question column. Note the questions with same question, are collapsed into one column. The exception is that the last value in question column has the choice under the value column.
We can drop rest of the columns for now. Also I can post the original json string that I am trying to flatten from the API.
EDIT:
Here is the json string:
{"id":"4","survey_id":"-L","response_id":"-L","response_url":"data":[{"type":"id","questionId":"participantId","exportLabel":"Participant ID","value":"-Ll4truw3KbSjVRtXmJy"},{"type":"time","questionId":"viewTime","exportLabel":"Viewed","value":"2019-07-31T02:41:34.063Z"},{"type":"time","questionId":"startedTime","exportLabel":"Started","value":"2019-07-31T02:44:37.732Z"},{"type":"time","questionId":"completedTime","exportLabel":"Completed","value":"2019-07-31T02:44:57.936Z"},{"type":"time","questionId":"timeSpent","exportLabel":"Time spent (HH:MM:SS.SSS)","value":"00:00:00.000"},{"type":"location","questionId":"country_name","exportLabel":"Country","value":"Unknown"},{"type":"location","questionId":"city","exportLabel":"City","value":"Roslindale"},{"type":"location","questionId":"ip","exportLabel":"IP","value":""},{"type":"device","questionId":"os","exportLabel":"Operating System","value":"macOS 10.14"},{"type":"device","questionId":"browser","exportLabel":"Browser","value":"Firefox 68.0"},{"type":"device","questionId":"device","exportLabel":"Device","value":""},{"type":"id","questionId":"externalId","exportLabel":"External ID","value":""},{"type":"info","questionId":"warnings","exportLabel":"Warnings","value":[]},{"logicalType":"singleSelection","choiceId":"1542213646976","choiceLabel":"Give it a shot! Hit the arrow below! Don't be shy!","exportLabel":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! (Give it a shot! Hit the arrow below! Don't be shy!)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646975,"questionOrder":0,"question":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541080009},{"logicalType":"singleSelection","choiceId":"1542213646984","choiceLabel":"","exportLabel":"Make your choice. 2. Hit the green arrow at the bottom! ()","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646983,"questionOrder":2,"question":"Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541081044},{"logicalType":"singleSelection","choiceId":"1542213646994","choiceLabel":"T-Shirts","exportLabel":"What are you most interested in? (Pick one) (T-Shirts)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213646991,"questionOrder":4,"question":"What are you most interested in? (Pick one)","subType":"singleSelection","answerTime":1564541083354},{"logicalType":"multipleSelection","choiceId":"1542213647040","choiceLabel":"Band / Music","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Band / Music)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647041","choiceLabel":"Fun","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Fun)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647042","choiceLabel":"TV","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (TV)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647043","choiceLabel":"Movies","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Movies)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"type":"emailBox","visualType":"emailBox","answerRequired":false,"questionId":1542213647050,"questionOrder":14,"question":"Almost Done! Enter Your Email Address!","answerTime":1564541097466,"logicalType":"text","value":"123456789#yellow.com"}]}
I transform the string into the first df like so:
from pandas.io.json import json_normalize
import pandas as pd
import json
with open('jsonfile') as json_file:
data = json.load(json_normalize(json_file))
df = json_normalize(data['data'])

Idea is filter rows by conditions by boolean indexing and then reshape by GroupBy.cumcount for counter and DataFrame.unstack, if need same order like in original add DataFrame.reindex:
First part is:
df1 = data.loc[data['question'].isna(), ['exportLabel','value']]
print (df1)
exportLabel value
0 Participant ID -Ll4truw3KbSjVRtXmJy
1 Viewed 2019-07-31T02:41:34.063Z
2 Started 2019-07-31T02:44:37.732Z
3 Completed 2019-07-31T02:44:57.936Z
4 Time spent (HH:MM:SS.SSS) 00:00:00.000
5 Country Unknown
6 City Roslindale
7 IP
8 Operating System macOS 10.14
9 Browser Firefox 68.0
10 Device
11 External ID
12 Warnings []
df11 = (df1.set_index([df1.groupby('exportLabel').cumcount(),
'exportLabel'])['value']
.unstack()
.rename_axis(None, axis=1)
.reindex(df1['exportLabel'].unique(), axis=1)
)
print (df11)
Participant ID Viewed Started \
0 -Ll4truw3KbSjVRtXmJy 2019-07-31T02:41:34.063Z 2019-07-31T02:44:37.732Z
Completed Time spent (HH:MM:SS.SSS) Country City IP \
0 2019-07-31T02:44:57.936Z 00:00:00.000 Unknown Roslindale
Operating System Browser Device External ID Warnings
0 macOS 10.14 Firefox 68.0 []
And second:
df2 = data.loc[data['question'].notna(), ['question','value','choiceLabel']]
#if need replace all missing values by value column
#df2['choiceLabel'] = df2['choiceLabel'].fillna(df2['value'])
#if need replace only last value if missing
idx = df2.index[[-1]]
df2.loc[idx,'choiceLabel'] = df2.loc[idx,'choiceLabel'].fillna(df2.loc[idx,'value'])
print (df2)
question value \
13 It's all about the green arrow! (not that Gree... True
14 Make your choice. 2. Hit the green arrow at th... True
15 What are you most interested in? (Pick one) True
16 We have the threads that you want! What kind o... True
17 We have the threads that you want! What kind o... True
18 We have the threads that you want! What kind o... True
19 We have the threads that you want! What kind o... True
20 Almost Done! Enter Your Email Address! 123456789#yellow.com
choiceLabel
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 123456789#yellow.com
df21 = (df2.set_index([df2.groupby('question').cumcount(),
'question'])['choiceLabel']
.unstack()
.rename_axis(None, axis=1)
.reindex(df2['question'].unique(), axis=1)
)
print (df21)
It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! \
0 Give it a shot! Hit the arrow below! Don't be ...
1 NaN
2 NaN
3 NaN
Make your choice. 2. Hit the green arrow at the bottom! \
0
1 NaN
2 NaN
3 NaN
What are you most interested in? (Pick one) \
0 T-Shirts
1 NaN
2 NaN
3 NaN
We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) \
0 Band / Music
1 Fun
2 TV
3 Movies
Almost Done! Enter Your Email Address!
0 123456789#yellow.com
1 NaN
2 NaN
3 NaN

Extarct Rows Until a Certain Row with Certain Word of a Column Pandas

I have a data frame like this,
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
4 Total 6223 16
5 RRA 7222 18
6 MLQ 5648 45
Now, I need to extract rows/new dataframe that has rows until Total that is in Name column.
Output needed:
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I tried this,
df[df.Name.str.contains("Total", na=False)]
This is not helpful for now. Any suggestion would be great.

Select the index where the True value is located and slice using df.iloc:
df_new=df.iloc[:df.loc[df.Name.str.contains('Total',na=False)].index[0]]
or using series.idxmax() which allows you to get the index of max value (max of True/False is True):
df_new=df.iloc[:df.Name.str.contains('Total',na=False).idxmax()]
print(df_new)
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14

Number of NaN values before first non NaN value Python dataframe

I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!

Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string