Efficient way to populate missing indexes from pandas group by

Efficient way to populate missing indexes from pandas group by - python-3.x

I grouped a column in a pandas dataframe by the number of occurrences of an event per hour of the day like so:
df_sep.hour.groupby(df_sep.time.dt.hour).size()
Which gives the following result:
time
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
18 1
20 4
21 29
Name: hour, dtype: int64
For plotting, I would like to complete the series for each hour of the day. ie, there are no occurrences at midnight (0). So for every missing hour, I would like to create that index and add zero to the corresponding value.
To solve this I created two lists (x and y) using the following loop, but it feels a bit hacky... is there a better way to solve this?
x = []
y = []
for i in range(24):
if i not in df_sep.hour.groupby(df_sep.time.dt.hour).size().index:
x.append(i)
y.append(0)
else:
x.append(i)
y.append(df_sep.hour.groupby(df_sep.time.dt.hour).size().loc[i])
result:
for i, j in zip(x, y):
print(i, j)
0 0
1 0
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
17 0
18 1
19 0
20 4
21 29
22 0
23 0

Use Series.reindex with range(24):
df_sep.hour.groupby(df_sep.time.dt.hour).size().reindex(range(24), fill_value=0)

Related

How to insert a pandas series as a new column in DataFrame, matching with the indexes of df with series of different length

I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows

While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)

resampling a pandas dataframe and filling new rows with zero

I have a time series as a dataframe. The first column is the week number, the second are values for that week. The first week (22) and the last week (48), are the lower and upper bounds of the time series. Some weeks are missing, for example, there is no week 27 and 28. I would like to resample this series such that there are no missing weeks. Where a week was inserted, I would like the corresponding value to be zero. This is my data:
week value
0 22 1
1 23 2
2 24 2
3 25 3
4 26 2
5 29 3
6 30 3
7 31 3
8 32 7
9 33 4
10 34 5
11 35 4
12 36 2
13 37 3
14 38 10
15 39 5
16 40 7
17 41 10
18 42 11
19 43 15
20 44 9
21 45 13
22 46 5
23 47 6
24 48 2
I am wondering if this can be achieved in Pandas without creating a loop from scratch. I have looked into pd.resample, but can't achieve the results I am looking for.

I would set week as index, reindex with fill_value option:
start, end = df['week'].agg(['min','max'])
df.set_index('week').reindex(np.arange(start, end+1), fill_value=0).reset_index()
Output (head):
week value
0 22 1
1 23 2
2 24 2
3 25 3
4 26 2
5 27 0
6 28 0
7 29 3
8 30 3

How to swap two rows of a Pandas DataFrame?

Suppose I have this dataframe :
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
I want to swap the position of row 1 and 2.
Is there a native Pandas function that can do this?
Thanks!

Use rename with a custom dict and sort_index
d = {1: 2, 2: 1}
df_final = df.rename(d).sort_index()
Out[27]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24

As far as I am aware there is no Native Pandas function for this.
But here is a custom function:
# Input
df = pd.DataFrame(np.arange(25).reshape(5, -1))
# Output
def swap_rows(df, i1, i2):
a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
df.iloc[i1, :], df.iloc[i2, :] = b, a
return df
print(swap_rows(df, 1, 2))
Output:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24
Cheers!

Try numpy flip:
df.iloc[1:3] = np.flip(df.to_numpy()[1:3], axis=0)
df
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24

df1=df.copy()
df1.iloc[1,:],df1.iloc[2,:]=df.iloc[2,:],df.iloc[1,:]
df1

Sum two dataframes for equal entries

I have two dataframes with same entries in column A, but different entries in columns B and C.
One dataframe has multiple entries for one entry in A.
df1
A B C
0 this 3 4
1 is 4 6
2 an 7 9
3 example 12 20
df2
A B C
0 this 11 11
1 this 5 9
2 this 18 7
3 is 12 14
4 an 1 4
5 an 8 12
6 example 3 17
7 example 9 5
8 example 19 6
9 example 7 1
I want to sum the two dataframes for same entries in column A. The result shoul look like this:
df3
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
How can I calculate this in a fast way in pandas?

Use DataFrame.merge to left merge the dataframe df2 with df1 on column A then add the columns B, C of df2 to the columns B, C of df3:
df3 = df2[['A']].merge(df1, on='A', how='left')
df3[['B', 'C']] += df2[['B', 'C']]
Result:
print(df3)
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
OR another possible idea if order is not important:
df3 = df2.set_index('A').add(df1.set_index('A')).reset_index()
print(df3)
A B C
0 an 8 13
1 an 15 21
2 example 15 37
3 example 21 25
4 example 31 26
5 example 19 21
6 is 16 20
7 this 14 15
8 this 8 13
9 this 21 11

Getting a number of quarter from numeric week number and the week number within the quarter in python?

I've a list of number from 1 to 53. I am trying to calculate 1) the quarter of a week and 2) the number of that week within that quarter using numeric week numbers. (if 53, needs to be qtr 4 wk 14, if 27 needs to be 3rd quarter wk 1). Got this working in excel, but not in python? Any thoughts?
tried the following, but at each try I've an issue with the wk's like 13 or 27 depending on the method I'm using.
13 -> should be qtr 1 , 27 -> should be 3 qtr.
df['qtr1'] = df['wk']//13
df['qtr2']=(np.maximum((df['wk']-1),1)/13)+1
df['qtr3']=((df1['wk']-1)//13)
df['qtr4'] = df['qtr2'].astype(int)
Results are awkward
wk qtr qtr2 qtr3 qtr4
1.0 0 1.076923 -1.0 1
13.0 1(wrong) 1.923077 0.0 1
14.0 1 2.000000 1.0 2
27.0 2 3.000000 1.0 2 (wrong)
28.0 2 3.076923 2.0 3

You can convert your weeks to integers, by using astype:
df['wk'] = df['wk'].astype(int)
You should subtract it with one first, like:
df['qtr'] = ((df['wk']-1) // 13) + 1
df['weekinqtr'] = (df['wk']-1) % 13 + 1
since 13//13 will be 1, not zero. This gives us:
>>> df
wk qtr weekinqtr
0 1 1 1
1 13 1 13
2 14 2 1
3 26 2 13
4 27 3 1
5 28 3 2
If you want extra columns per quarter, you can use get_dummies(..) [pandas-doc] to obtain a one-hot encoding per quarter:
>>> df.join(pd.get_dummies(df['qtr'], prefix='qtr'))
wk qtr weekinqtr qtr_1 qtr_2 qtr_3
0 1 1 1 1 0 0
1 13 1 13 1 0 0
2 14 2 1 0 1 0
3 26 2 13 0 1 0
4 27 3 1 0 0 1
5 28 3 2 0 0 1

Using div // and modulo % work for what you want I think
In [254]: df = pd.DataFrame({'week':range(52)})
In [255]: df['qtr'] = (df['week'] // 13) + 1
In [256]: df['qtr_week'] = df['week'] % 13
In [257]: df.loc[(df['qtr_week'] ==0),'qtr_week']=13
In [258]: df
Out[258]:
week qtr qtr_week
0 1 1 1
1 2 1 2
2 3 1 3
3 4 1 4
4 5 1 5
5 6 1 6
6 7 1 7
7 8 1 8
8 9 1 9
9 10 1 10
10 11 1 11
11 12 1 12
12 13 2 13
13 14 2 1
14 15 2 2
15 16 2 3
16 17 2 4
17 18 2 5
18 19 2 6
19 20 2 7
20 21 2 8
21 22 2 9
22 23 2 10
23 24 2 11
24 25 2 12
25 26 3 13
26 27 3 1
27 28 3 2
28 29 3 3
29 30 3 4
30 31 3 5
31 32 3 6
32 33 3 7
33 34 3 8
34 35 3 9
35 36 3 10
36 37 3 11
37 38 3 12
38 39 4 13
39 40 4 1
40 41 4 2
41 42 4 3
42 43 4 4
43 44 4 5
44 45 4 6
45 46 4 7
46 47 4 8
47 48 4 9
48 49 4 10
49 50 4 11
50 51 4 12

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient way to populate missing indexes from pandas group by - python-3.x

Use Series.reindex with range(24): df_sep.hour.groupby(df_sep.time.dt.hour).size().reindex(range(24), fill_value=0)

Related

How to insert a pandas series as a new column in DataFrame, matching with the indexes of df with series of different length

resampling a pandas dataframe and filling new rows with zero

How to swap two rows of a Pandas DataFrame?

Sum two dataframes for equal entries

Getting a number of quarter from numeric week number and the week number within the quarter in python?

Categories

Resources