How to find again the index after pivoting dataframe? - python-3.x

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?

You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Related

Calculate mean value by interval coordinates in pandas

I have a dataframe such as :
Name Position Value
A 1 10
A 2 11
A 3 10
A 4 8
A 5 6
A 6 12
A 7 10
A 8 9
A 9 9
A 10 9
A 11 9
A 12 9
and I woulde like for each interval of 3 position, to calculate the mean of Values.
And create a new df with start and end coordinates (of length 3 then), with the Mean_value column.
Name Start End Mean_value
A 1 3 10.33 <---- here this is (10+11+10)/3 = 10.33
A 4 6 8.7
A 7 9 9.3
A 10 13 9
Does someone have an idea using pandas please ?
Solution for get each 3 rows (if exist) per Name groups - first get counter by GroupBy.cumcount with integer division and pass it to named aggregations:
g = df.groupby('Name').cumcount() // 3
df = df.groupby(['Name',g]).agg(Start=('Position','first'),
End=('Position','last'),
Value=('Value','mean')).droplevel(1).reset_index()
print (df)
Name Start End Value
0 A 1 3 10.333333
1 A 4 6 8.666667
2 A 7 9 9.333333
3 A 10 12 9.000000

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

Iterating through a data frame and grouping values in a range

I have a python data frame of weekly data like this :
Week Val
1 11
2 11
3 11
4 11
5 9
6 9
7 9
8 9
I would like create an output table like this:
Week 1 Week 2 Val
1 4 11
5 8 9
Apologies, I am quite new to python and its iterative tools. I am not sure how to solve this problem.
I tried to match using the previous row columns but I do not think how to go further:
df['Match'] = df['Val'].eq(df['Val'].shift(-1))
You want to groupby the consecutive blocks of Val. So you can use cumsum on the non-zero differences to get the block:
blocks = df['Val'].ne(df['Val'].shift(1)).cumsum()
(df.groupby(blocks, as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'), Val=('Val', 'first'))
)
Or you can chain:
(df.groupby(df['Val'].ne(df['Val'].shift(1)).cumsum(), as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'),Val=('Val', 'first'))
)
Output:
Week1 Week2 Val
0 1 4 11
1 5 8 9

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

How to split rows in pandas with special condition of date?

I have a DataFrame like:
Code Date sales
1 2/2013 10
1 3/2013 11
2 3/2013 12
2 4/2013 14
...
I want to convert it into a DataFrame with a timeline, code, and sales of each type of item:
Date Code Sales1 Code Sales2
2/2013 1 10 NA NA
3/2013 1 11 2 12
4/2013 NA NA 2 14
....
or into a simpler way:
Date Code Sales1 Date Code Sales2 .....
2/2013 1 10 3/2013 2 12
3/2013 1 11 4/2013 2 14
or even into the simplest way, splitting into many small DataFrames
IIUC using concatwith the groupby result
df.index=df.groupby('Code').cumcount()# create the key for concat
pd.concat([x for _,x in df.groupby('Code')],1)
Out[392]:
Code Date sales Code Date sales
0 1 2/2013 10 2 3/2013 12
1 1 3/2013 11 2 4/2013 14
Actually, I was stupid to split the data that way, I rethink and solve the problem with the pivot_table
pd.pivot_table(df, values = ['sales'], index = ['code'], columns = ['date'])
and the result should be like.
sum
date 2/2013 3/2013 4/2013 ....
code
1 10 11 NaN
2 NaN 12 14
...

Resources