This is the DF:
amount cost
5 NaN
7 NaN
9 78.0
6 80.0
12 NaN
14 NaN
And I only want to fill 50% of the NANs so that I would get something like this:
amount cost
5 'hello'
7 NaN
9 78.0
6 80.0
12 NaN
14 'hello'
And is it possible to fill lets say 28% of the missing data with bigger dataSets.
Thanks for help.
We can do
idx=df.index[df.cost.isna()]
df.loc[np.random.choice(idx, size=int(len(idx)/2) ,replace=False),'cost']='somevalue'
df
Out[16]:
amount cost
0 5 NaN
1 7 somevalue
2 9 78
3 6 80
4 12 somevalue
5 14 NaN
Try with df.update()
nans = df.loc[df.cost.isna(), ]
nans.iloc[:int(len(nans) * 0.5), 'cost'] = 'hello'
df.update(nans.cost)
I have a df that looks like this:
answerRequired answerTime choiceId \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 False 1.564541e+12 1542213646976
14 False 1.564541e+12 1542213646984
15 True 1.564541e+12 1542213646994
16 True 1.564541e+12 1542213647040
17 True 1.564541e+12 1542213647041
18 True 1.564541e+12 1542213647042
19 True 1.564541e+12 1542213647043
20 False 1.564541e+12 NaN
choiceLabel \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 NaN
exportLabel logicalType \
0 Participant ID NaN
1 Viewed NaN
2 Started NaN
3 Completed NaN
4 Time spent (HH:MM:SS.SSS) NaN
5 Country NaN
6 City NaN
7 IP NaN
8 Operating System NaN
9 Browser NaN
10 Device NaN
11 External ID NaN
12 Warnings NaN
13 It's all about the green arrow! (not that Gree... singleSelection
14 Make your choice. 2. Hit the green arrow at th... singleSelection
15 What are you most interested in? (Pick one) (T... singleSelection
16 We have the threads that you want! What kind o... multipleSelection
17 We have the threads that you want! What kind o... multipleSelection
18 We have the threads that you want! What kind o... multipleSelection
19 We have the threads that you want! What kind o... multipleSelection
20 NaN text
question questionId \
0 NaN participantId
1 NaN viewTime
2 NaN startedTime
3 NaN completedTime
4 NaN timeSpent
5 NaN country_name
6 NaN city
7 NaN ip
8 NaN os
9 NaN browser
10 NaN device
11 NaN externalId
12 NaN warnings
13 It's all about the green arrow! (not that Gree... 1542213646975
14 Make your choice. 2. Hit the green arrow at th... 1542213646983
15 What are you most interested in? (Pick one) 1542213646991
16 We have the threads that you want! What kind o... 1542213647039
17 We have the threads that you want! What kind o... 1542213647039
18 We have the threads that you want! What kind o... 1542213647039
19 We have the threads that you want! What kind o... 1542213647039
20 Almost Done! Enter Your Email Address! 1542213647050
questionOrder subType type value \
0 NaN NaN id -Ll4truw3KbSjVRtXmJy
1 NaN NaN time 2019-07-31T02:41:34.063Z
2 NaN NaN time 2019-07-31T02:44:37.732Z
3 NaN NaN time 2019-07-31T02:44:57.936Z
4 NaN NaN time 00:00:00.000
5 NaN NaN location Unknown
6 NaN NaN location Roslindale
7 NaN NaN location
8 NaN NaN device macOS 10.14
9 NaN NaN device Firefox 68.0
10 NaN NaN device
11 NaN NaN id
12 NaN NaN info []
13 0.0 singleSelection mediaGallery True
14 2.0 singleSelection mediaGallery True
15 4.0 singleSelection mediaGallery True
16 12.0 multipleSelection mediaGallery True
17 12.0 multipleSelection mediaGallery True
18 12.0 multipleSelection mediaGallery True
19 12.0 multipleSelection mediaGallery True
20 14.0 NaN emailBox 123456789#yellow.com
visualType
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 mediaGallery
14 mediaGallery
15 mediaGallery
16 mediaGallery
17 mediaGallery
18 mediaGallery
19 mediaGallery
20 emailBox
How do I cut the dataframe and turn it too look like this:
I tried this:
df.T.stack()
df_stack_test.T.groupby('level_1')[0].apply(lambda x: pd.Series(list(x))).unstack().T
but these are turning the data around without aggregating the data.
At a high level I want to:
flip exportLabel column values into columns and value column values into values under the column values from exportLabel, only where question column is null.
Then I want to flip the question column values into columns where it is not null and values from choicelabel under the question column. Note the questions with same question, are collapsed into one column. The exception is that the last value in question column has the choice under the value column.
We can drop rest of the columns for now. Also I can post the original json string that I am trying to flatten from the API.
EDIT:
Here is the json string:
{"id":"4","survey_id":"-L","response_id":"-L","response_url":"data":[{"type":"id","questionId":"participantId","exportLabel":"Participant ID","value":"-Ll4truw3KbSjVRtXmJy"},{"type":"time","questionId":"viewTime","exportLabel":"Viewed","value":"2019-07-31T02:41:34.063Z"},{"type":"time","questionId":"startedTime","exportLabel":"Started","value":"2019-07-31T02:44:37.732Z"},{"type":"time","questionId":"completedTime","exportLabel":"Completed","value":"2019-07-31T02:44:57.936Z"},{"type":"time","questionId":"timeSpent","exportLabel":"Time spent (HH:MM:SS.SSS)","value":"00:00:00.000"},{"type":"location","questionId":"country_name","exportLabel":"Country","value":"Unknown"},{"type":"location","questionId":"city","exportLabel":"City","value":"Roslindale"},{"type":"location","questionId":"ip","exportLabel":"IP","value":""},{"type":"device","questionId":"os","exportLabel":"Operating System","value":"macOS 10.14"},{"type":"device","questionId":"browser","exportLabel":"Browser","value":"Firefox 68.0"},{"type":"device","questionId":"device","exportLabel":"Device","value":""},{"type":"id","questionId":"externalId","exportLabel":"External ID","value":""},{"type":"info","questionId":"warnings","exportLabel":"Warnings","value":[]},{"logicalType":"singleSelection","choiceId":"1542213646976","choiceLabel":"Give it a shot! Hit the arrow below! Don't be shy!","exportLabel":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! (Give it a shot! Hit the arrow below! Don't be shy!)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646975,"questionOrder":0,"question":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541080009},{"logicalType":"singleSelection","choiceId":"1542213646984","choiceLabel":"","exportLabel":"Make your choice. 2. Hit the green arrow at the bottom! ()","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646983,"questionOrder":2,"question":"Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541081044},{"logicalType":"singleSelection","choiceId":"1542213646994","choiceLabel":"T-Shirts","exportLabel":"What are you most interested in? (Pick one) (T-Shirts)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213646991,"questionOrder":4,"question":"What are you most interested in? (Pick one)","subType":"singleSelection","answerTime":1564541083354},{"logicalType":"multipleSelection","choiceId":"1542213647040","choiceLabel":"Band / Music","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Band / Music)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647041","choiceLabel":"Fun","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Fun)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647042","choiceLabel":"TV","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (TV)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647043","choiceLabel":"Movies","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Movies)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"type":"emailBox","visualType":"emailBox","answerRequired":false,"questionId":1542213647050,"questionOrder":14,"question":"Almost Done! Enter Your Email Address!","answerTime":1564541097466,"logicalType":"text","value":"123456789#yellow.com"}]}
I transform the string into the first df like so:
from pandas.io.json import json_normalize
import pandas as pd
import json
with open('jsonfile') as json_file:
data = json.load(json_normalize(json_file))
df = json_normalize(data['data'])
Idea is filter rows by conditions by boolean indexing and then reshape by GroupBy.cumcount for counter and DataFrame.unstack, if need same order like in original add DataFrame.reindex:
First part is:
df1 = data.loc[data['question'].isna(), ['exportLabel','value']]
print (df1)
exportLabel value
0 Participant ID -Ll4truw3KbSjVRtXmJy
1 Viewed 2019-07-31T02:41:34.063Z
2 Started 2019-07-31T02:44:37.732Z
3 Completed 2019-07-31T02:44:57.936Z
4 Time spent (HH:MM:SS.SSS) 00:00:00.000
5 Country Unknown
6 City Roslindale
7 IP
8 Operating System macOS 10.14
9 Browser Firefox 68.0
10 Device
11 External ID
12 Warnings []
df11 = (df1.set_index([df1.groupby('exportLabel').cumcount(),
'exportLabel'])['value']
.unstack()
.rename_axis(None, axis=1)
.reindex(df1['exportLabel'].unique(), axis=1)
)
print (df11)
Participant ID Viewed Started \
0 -Ll4truw3KbSjVRtXmJy 2019-07-31T02:41:34.063Z 2019-07-31T02:44:37.732Z
Completed Time spent (HH:MM:SS.SSS) Country City IP \
0 2019-07-31T02:44:57.936Z 00:00:00.000 Unknown Roslindale
Operating System Browser Device External ID Warnings
0 macOS 10.14 Firefox 68.0 []
And second:
df2 = data.loc[data['question'].notna(), ['question','value','choiceLabel']]
#if need replace all missing values by value column
#df2['choiceLabel'] = df2['choiceLabel'].fillna(df2['value'])
#if need replace only last value if missing
idx = df2.index[[-1]]
df2.loc[idx,'choiceLabel'] = df2.loc[idx,'choiceLabel'].fillna(df2.loc[idx,'value'])
print (df2)
question value \
13 It's all about the green arrow! (not that Gree... True
14 Make your choice. 2. Hit the green arrow at th... True
15 What are you most interested in? (Pick one) True
16 We have the threads that you want! What kind o... True
17 We have the threads that you want! What kind o... True
18 We have the threads that you want! What kind o... True
19 We have the threads that you want! What kind o... True
20 Almost Done! Enter Your Email Address! 123456789#yellow.com
choiceLabel
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 123456789#yellow.com
df21 = (df2.set_index([df2.groupby('question').cumcount(),
'question'])['choiceLabel']
.unstack()
.rename_axis(None, axis=1)
.reindex(df2['question'].unique(), axis=1)
)
print (df21)
It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! \
0 Give it a shot! Hit the arrow below! Don't be ...
1 NaN
2 NaN
3 NaN
Make your choice. 2. Hit the green arrow at the bottom! \
0
1 NaN
2 NaN
3 NaN
What are you most interested in? (Pick one) \
0 T-Shirts
1 NaN
2 NaN
3 NaN
We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) \
0 Band / Music
1 Fun
2 TV
3 Movies
Almost Done! Enter Your Email Address!
0 123456789#yellow.com
1 NaN
2 NaN
3 NaN
I have a data frame like this,
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
4 Total 6223 16
5 RRA 7222 18
6 MLQ 5648 45
Now, I need to extract rows/new dataframe that has rows until Total that is in Name column.
Output needed:
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I tried this,
df[df.Name.str.contains("Total", na=False)]
This is not helpful for now. Any suggestion would be great.
Select the index where the True value is located and slice using df.iloc:
df_new=df.iloc[:df.loc[df.Name.str.contains('Total',na=False)].index[0]]
or using series.idxmax() which allows you to get the index of max value (max of True/False is True):
df_new=df.iloc[:df.Name.str.contains('Total',na=False).idxmax()]
print(df_new)
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14