Extarct Rows Until a Certain Row with Certain Word of a Column Pandas - python-3.x

I have a data frame like this,
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
4 Total 6223 16
5 RRA 7222 18
6 MLQ 5648 45
Now, I need to extract rows/new dataframe that has rows until Total that is in Name column.
Output needed:
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14
I tried this,
df[df.Name.str.contains("Total", na=False)]
This is not helpful for now. Any suggestion would be great.

Select the index where the True value is located and slice using df.iloc:
df_new=df.iloc[:df.loc[df.Name.str.contains('Total',na=False)].index[0]]
or using series.idxmax() which allows you to get the index of max value (max of True/False is True):
df_new=df.iloc[:df.Name.str.contains('Total',na=False).idxmax()]
print(df_new)
Name Product Quantity
0 NaN 1010 10
1 NaN 2010 12
2 NaN 4145 18
3 NaN 5225 14

Related

create a python function as store_id & date as input and output as the previous date's sales

I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0

INDEX and MATCH with multiple criteria in Pandas Python

I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance
Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)

Is it possible to only fill in 50% of the missing values in pandas?

This is the DF:
amount cost
5 NaN
7 NaN
9 78.0
6 80.0
12 NaN
14 NaN
And I only want to fill 50% of the NANs so that I would get something like this:
amount cost
5 'hello'
7 NaN
9 78.0
6 80.0
12 NaN
14 'hello'
And is it possible to fill lets say 28% of the missing data with bigger dataSets.
Thanks for help.
We can do
idx=df.index[df.cost.isna()]
df.loc[np.random.choice(idx, size=int(len(idx)/2) ,replace=False),'cost']='somevalue'
df
Out[16]:
amount cost
0 5 NaN
1 7 somevalue
2 9 78
3 6 80
4 12 somevalue
5 14 NaN
Try with df.update()
nans = df.loc[df.cost.isna(), ]
nans.iloc[:int(len(nans) * 0.5), 'cost'] = 'hello'
df.update(nans.cost)

How to slice only portions dataframes and rotate into new dataframes regardless or row size?

I have a df that looks like this:
answerRequired answerTime choiceId \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 False 1.564541e+12 1542213646976
14 False 1.564541e+12 1542213646984
15 True 1.564541e+12 1542213646994
16 True 1.564541e+12 1542213647040
17 True 1.564541e+12 1542213647041
18 True 1.564541e+12 1542213647042
19 True 1.564541e+12 1542213647043
20 False 1.564541e+12 NaN
choiceLabel \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 NaN
exportLabel logicalType \
0 Participant ID NaN
1 Viewed NaN
2 Started NaN
3 Completed NaN
4 Time spent (HH:MM:SS.SSS) NaN
5 Country NaN
6 City NaN
7 IP NaN
8 Operating System NaN
9 Browser NaN
10 Device NaN
11 External ID NaN
12 Warnings NaN
13 It's all about the green arrow! (not that Gree... singleSelection
14 Make your choice. 2. Hit the green arrow at th... singleSelection
15 What are you most interested in? (Pick one) (T... singleSelection
16 We have the threads that you want! What kind o... multipleSelection
17 We have the threads that you want! What kind o... multipleSelection
18 We have the threads that you want! What kind o... multipleSelection
19 We have the threads that you want! What kind o... multipleSelection
20 NaN text
question questionId \
0 NaN participantId
1 NaN viewTime
2 NaN startedTime
3 NaN completedTime
4 NaN timeSpent
5 NaN country_name
6 NaN city
7 NaN ip
8 NaN os
9 NaN browser
10 NaN device
11 NaN externalId
12 NaN warnings
13 It's all about the green arrow! (not that Gree... 1542213646975
14 Make your choice. 2. Hit the green arrow at th... 1542213646983
15 What are you most interested in? (Pick one) 1542213646991
16 We have the threads that you want! What kind o... 1542213647039
17 We have the threads that you want! What kind o... 1542213647039
18 We have the threads that you want! What kind o... 1542213647039
19 We have the threads that you want! What kind o... 1542213647039
20 Almost Done! Enter Your Email Address! 1542213647050
questionOrder subType type value \
0 NaN NaN id -Ll4truw3KbSjVRtXmJy
1 NaN NaN time 2019-07-31T02:41:34.063Z
2 NaN NaN time 2019-07-31T02:44:37.732Z
3 NaN NaN time 2019-07-31T02:44:57.936Z
4 NaN NaN time 00:00:00.000
5 NaN NaN location Unknown
6 NaN NaN location Roslindale
7 NaN NaN location
8 NaN NaN device macOS 10.14
9 NaN NaN device Firefox 68.0
10 NaN NaN device
11 NaN NaN id
12 NaN NaN info []
13 0.0 singleSelection mediaGallery True
14 2.0 singleSelection mediaGallery True
15 4.0 singleSelection mediaGallery True
16 12.0 multipleSelection mediaGallery True
17 12.0 multipleSelection mediaGallery True
18 12.0 multipleSelection mediaGallery True
19 12.0 multipleSelection mediaGallery True
20 14.0 NaN emailBox 123456789#yellow.com
visualType
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 mediaGallery
14 mediaGallery
15 mediaGallery
16 mediaGallery
17 mediaGallery
18 mediaGallery
19 mediaGallery
20 emailBox
How do I cut the dataframe and turn it too look like this:
I tried this:
df.T.stack()
df_stack_test.T.groupby('level_1')[0].apply(lambda x: pd.Series(list(x))).unstack().T
but these are turning the data around without aggregating the data.
At a high level I want to:
flip exportLabel column values into columns and value column values into values under the column values from exportLabel, only where question column is null.
Then I want to flip the question column values into columns where it is not null and values from choicelabel under the question column. Note the questions with same question, are collapsed into one column. The exception is that the last value in question column has the choice under the value column.
We can drop rest of the columns for now. Also I can post the original json string that I am trying to flatten from the API.
EDIT:
Here is the json string:
{"id":"4","survey_id":"-L","response_id":"-L","response_url":"data":[{"type":"id","questionId":"participantId","exportLabel":"Participant ID","value":"-Ll4truw3KbSjVRtXmJy"},{"type":"time","questionId":"viewTime","exportLabel":"Viewed","value":"2019-07-31T02:41:34.063Z"},{"type":"time","questionId":"startedTime","exportLabel":"Started","value":"2019-07-31T02:44:37.732Z"},{"type":"time","questionId":"completedTime","exportLabel":"Completed","value":"2019-07-31T02:44:57.936Z"},{"type":"time","questionId":"timeSpent","exportLabel":"Time spent (HH:MM:SS.SSS)","value":"00:00:00.000"},{"type":"location","questionId":"country_name","exportLabel":"Country","value":"Unknown"},{"type":"location","questionId":"city","exportLabel":"City","value":"Roslindale"},{"type":"location","questionId":"ip","exportLabel":"IP","value":""},{"type":"device","questionId":"os","exportLabel":"Operating System","value":"macOS 10.14"},{"type":"device","questionId":"browser","exportLabel":"Browser","value":"Firefox 68.0"},{"type":"device","questionId":"device","exportLabel":"Device","value":""},{"type":"id","questionId":"externalId","exportLabel":"External ID","value":""},{"type":"info","questionId":"warnings","exportLabel":"Warnings","value":[]},{"logicalType":"singleSelection","choiceId":"1542213646976","choiceLabel":"Give it a shot! Hit the arrow below! Don't be shy!","exportLabel":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! (Give it a shot! Hit the arrow below! Don't be shy!)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646975,"questionOrder":0,"question":"It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541080009},{"logicalType":"singleSelection","choiceId":"1542213646984","choiceLabel":"","exportLabel":"Make your choice. 2. Hit the green arrow at the bottom! ()","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":false,"questionId":1542213646983,"questionOrder":2,"question":"Make your choice. 2. Hit the green arrow at the bottom!","subType":"singleSelection","answerTime":1564541081044},{"logicalType":"singleSelection","choiceId":"1542213646994","choiceLabel":"T-Shirts","exportLabel":"What are you most interested in? (Pick one) (T-Shirts)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213646991,"questionOrder":4,"question":"What are you most interested in? (Pick one)","subType":"singleSelection","answerTime":1564541083354},{"logicalType":"multipleSelection","choiceId":"1542213647040","choiceLabel":"Band / Music","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Band / Music)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647041","choiceLabel":"Fun","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Fun)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647042","choiceLabel":"TV","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (TV)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"logicalType":"multipleSelection","choiceId":"1542213647043","choiceLabel":"Movies","exportLabel":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) (Movies)","value":true,"type":"mediaGallery","visualType":"mediaGallery","answerRequired":true,"questionId":1542213647039,"questionOrder":12,"question":"We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!)","subType":"multipleSelection","answerTime":1564541086280},{"type":"emailBox","visualType":"emailBox","answerRequired":false,"questionId":1542213647050,"questionOrder":14,"question":"Almost Done! Enter Your Email Address!","answerTime":1564541097466,"logicalType":"text","value":"123456789#yellow.com"}]}
I transform the string into the first df like so:
from pandas.io.json import json_normalize
import pandas as pd
import json
with open('jsonfile') as json_file:
data = json.load(json_normalize(json_file))
df = json_normalize(data['data'])
Idea is filter rows by conditions by boolean indexing and then reshape by GroupBy.cumcount for counter and DataFrame.unstack, if need same order like in original add DataFrame.reindex:
First part is:
df1 = data.loc[data['question'].isna(), ['exportLabel','value']]
print (df1)
exportLabel value
0 Participant ID -Ll4truw3KbSjVRtXmJy
1 Viewed 2019-07-31T02:41:34.063Z
2 Started 2019-07-31T02:44:37.732Z
3 Completed 2019-07-31T02:44:57.936Z
4 Time spent (HH:MM:SS.SSS) 00:00:00.000
5 Country Unknown
6 City Roslindale
7 IP
8 Operating System macOS 10.14
9 Browser Firefox 68.0
10 Device
11 External ID
12 Warnings []
df11 = (df1.set_index([df1.groupby('exportLabel').cumcount(),
'exportLabel'])['value']
.unstack()
.rename_axis(None, axis=1)
.reindex(df1['exportLabel'].unique(), axis=1)
)
print (df11)
Participant ID Viewed Started \
0 -Ll4truw3KbSjVRtXmJy 2019-07-31T02:41:34.063Z 2019-07-31T02:44:37.732Z
Completed Time spent (HH:MM:SS.SSS) Country City IP \
0 2019-07-31T02:44:57.936Z 00:00:00.000 Unknown Roslindale
Operating System Browser Device External ID Warnings
0 macOS 10.14 Firefox 68.0 []
And second:
df2 = data.loc[data['question'].notna(), ['question','value','choiceLabel']]
#if need replace all missing values by value column
#df2['choiceLabel'] = df2['choiceLabel'].fillna(df2['value'])
#if need replace only last value if missing
idx = df2.index[[-1]]
df2.loc[idx,'choiceLabel'] = df2.loc[idx,'choiceLabel'].fillna(df2.loc[idx,'value'])
print (df2)
question value \
13 It's all about the green arrow! (not that Gree... True
14 Make your choice. 2. Hit the green arrow at th... True
15 What are you most interested in? (Pick one) True
16 We have the threads that you want! What kind o... True
17 We have the threads that you want! What kind o... True
18 We have the threads that you want! What kind o... True
19 We have the threads that you want! What kind o... True
20 Almost Done! Enter Your Email Address! 123456789#yellow.com
choiceLabel
13 Give it a shot! Hit the arrow below! Don't be ...
14
15 T-Shirts
16 Band / Music
17 Fun
18 TV
19 Movies
20 123456789#yellow.com
df21 = (df2.set_index([df2.groupby('question').cumcount(),
'question'])['choiceLabel']
.unstack()
.rename_axis(None, axis=1)
.reindex(df2['question'].unique(), axis=1)
)
print (df21)
It's all about the green arrow! (not that Green Arrow!) 1. Make your choice. 2. Hit the green arrow at the bottom! \
0 Give it a shot! Hit the arrow below! Don't be ...
1 NaN
2 NaN
3 NaN
Make your choice. 2. Hit the green arrow at the bottom! \
0
1 NaN
2 NaN
3 NaN
What are you most interested in? (Pick one) \
0 T-Shirts
1 NaN
2 NaN
3 NaN
We have the threads that you want! What kind of tees live in your closet? (Pick one or more - we won't judge!) \
0 Band / Music
1 Fun
2 TV
3 Movies
Almost Done! Enter Your Email Address!
0 123456789#yellow.com
1 NaN
2 NaN
3 NaN

Number of NaN values before first non NaN value Python dataframe

I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!
Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2

Resources