condition after groupby: data science - python-3.x

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool

Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

Related

How to insert a pandas series as a new column in DataFrame, matching with the indexes of df with series of different length

I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows
While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)

Pyspark Multiple Filter Dataframe [duplicate]

This question already has answers here:
Multiple condition filter on dataframe
(2 answers)
Closed 2 years ago.
My input spark dataframe is;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2019 7 1
2019 8 1
2019 9 1
2019 10 1
2019 11 1
2019 12 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
2019 7 2
2019 8 2
2019 9 2
2019 10 2
2019 11 2
2019 12 2
Dataframe is ordered by client, year and month. I want to extract the data after 2019-06 for each clients.
I shared the desired output according to the data above;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
Could you please help me about this?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Did you mean before 2019-06? (you wrote after 2019-06)
If so, you can do a filter:
df2 = df.filter('Year < 2019 or (Year = 2019 and Month <= 6)')

want to assign id to duplicate rows

id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
3 khu 12 2018
4 she 21 2018
5 waqar 22 2015
want like this
id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
0 khu 12 2018
1 she 21 2018
2 waqar 22 2015
Use GroupBy.ngroup:
df['id'] = df.groupby('name', sort=False).ngroup()
#if need grouping by multiple columns for check duplicates
#df['id'] = df.groupby(['name','age'], sort=False).ngroup()
print (df)
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015
Using factorize as well you can check with category and cat.codes, or sklearn LabelEncoder
df['id']=pd.factorize(df['name'])[0]
df
Out[470]:
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

TypeError: strptime() argument 1 must be str, not float

I'm having parsing errors on my code, below is the code and almost understandable dataset
import numpy as np
import pandas as pd
from datetime import datetime as dt
data0 = pd.read_csv('2009-10.csv')
data1 = pd.read_csv('2010-11.csv')
def parse_date(date):
if date == '':
return None
else:
return dt.strptime(date, '%d/%m/%y').date()
data0.Date = data0.Date.apply(parse_date)
data1.Date = data1.Date.apply(parse_date)
TypeError: strptime() argument 1 must be str, not float
Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A
15/08/09 Aston Villa Wigan 0 2 A 0 1 A M Clattenburg 11 14 5 7 15 14 4 6 2 2 0 0 1.67 3.6 5.5
15/08/09 Blackburn Man City 0 2 A 0 1 A M Dean 17 8 9 5 12 9 5 4 2 1 0 0 3.6 3.25 2.1
15/08/09 Bolton Sunderland 0 1 A 0 1 A A Marriner 11 20 3 13 16 10 4 7 2 1 0 0 2.25 3.25 3.25
15/08/09 Chelsea Hull 2 1 H 1 1 D A Wiley 26 7 12 3 13 15 12 4 1 2 0 0 1.17 6.5 21
15/08/09 Everton Arsenal 1 6 A 0 3 A M Halsey 8 15 5 9 11 13 4 9 0 0 0 0 3.2 3.25 2.3
Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A
14/08/10 Aston Villa West Ham 3 0 H 2 0 H M Dean 23 12 11 2 15 15 16 7 1 2 0 0 2 3.3 4
14/08/10 Blackburn Everton 1 0 H 1 0 H P Dowd 7 17 2 12 19 14 1 3 2 1 0 0 2.88 3.25 2.5
14/08/10 Bolton Fulham 0 0 D 0 0 D S Attwell 13 12 9 7 12 13 4 8 1 3 0 0 2.2 3.3 3.4
14/08/10 Chelsea West Brom 6 0 H 2 0 H M Clattenburg 18 10 13 4 10 10 3 1 1 0 0 0 1.17 7 17
14/08/10 Sunderland Birmingham 2 2 D 1 0 H A Taylor 6 13 2 7 13 10 3 6 3 3 1 0 2.1 3.3 3.6
14/08/10 Tottenham Man City 0 0 D 0 0 D A Marriner 22 11 18 7 13 16 10 3 0 2 0 0 2.4 3.3 3
IIUC, I think you are converting strings into datetime dtypes.
You can use Pandas to_datetime:
data0['Date'] = pd.to_datetime(data0['Date'], format='%d/%m/%y')
data1['Date'] = pd.to_datetime(data1['Date'], format='%d/%m/%y')

Sum the values from the days in a specific week in Excel

So I have some rows of data and some columns with dates.
As you can see on the image below.
I want the sum of the week for each row - but the tricky thing is that not every week is 5 days, so there might be weeks with 3 days. So somehow, I want to try to go for the weeknumber and then sum it.
Can anyone help with me a formular (or a VBA macro)?
I am completely lost after trying several approaches.
18-May-15 19-May-15 20-May-15 21-May-15 22-May-15 25-May-15 26-May-15 27-May-15 28-May-15 29-May-15 1-Jun-15 2-Jun-15 3-Jun-15 4-Jun-15 WEEK 1 TOTAL WEEK 2 TOTAL
33 15 10 19 18 8 10 15 10 29 16 24 8 26 74
18 11 8 17 0 6 16 9 16 16 36 9 6 4 55
0 0 1 0 0 1 0 0 1 0 0 3 3 2 8
30 7 4 8 8 11 10 3 0 11 3 4 5 6 18
0 0 0 11 0 0 0 1 0 7 8 1 1 2 12
1 1 4 0 5 1 6 2 1 4 2 4 5 4 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
52 27 22 36 23 15 32 26 27 49 54 37 19 34 144
30 50 25 21 34 12 33 32 26 43 54 43 18 32 147
0 0 1 0 3 0 0 0 0 0 0 0 0 0 0
29 5 3 4 4 1 1 2 4 4 3 4 2 3 12
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 4 1 10 9 0 0 0 0 0 1 1 2
1 2 0 0 0 0 0 1 3 0 0 0 2 2 4
15 29 5 17 16 4 18 20 12 28 25 22 4 23 74
11 15 11 3 15 7 11 9 5 12 18 10 5 7 40
1 0 2 1 1 0 0 1 8 1 4 3 2 0 9
3 6 7 0 2 1 4 2 1 2 7 8 7 2 24
21 21 21 21 21 22 22 22 22 22 23 23 23 23
Using SUMIF is one way. But you need to get your references straight in order to make it easy to enter.
Note in the diagram below, the formula:
=SUMIF(Weeknums,M$1,$B2:$K2)
where weeknums is the row of calculated Week Numbers.
Also note that the column headers showing the Week number to be summed could be made more explanatory with custom formatting:
I know you've already accepted an answer but just to show you:
If you transposed your data you would then be able to utilise the pivot tables
You could set up a calculated field to calculate exactly what you wanted (and depending on how you sorted/grouped the date you could sort this by weeks, months, quarters or even years
You would then get all of your final values displayed in an easy to read format grouped by whatever you want. In my opinion this is a lot more powerful solution for the long run.

Resources