How to convert multi-indexed datetime index into integer? - python-3.x

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?

Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Related

convert string as 'hours' and 'mins' into minutes

I have a column in my dataframe df:
Time
2 hours 3 mins
5 hours 10 mins
1 hours 40 mins
10 mins
4 hours
6 hours 0 mins
I want to create a new column in df 'Minutes' that converts this column over to minutes
Minutes
123
310
100
10
240
360
Is there a python function to do this?
What I have tried is:
df['Minutes'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True))
Here is ugly bug pd.eval processing only less like 100 rows, so after stripping + is called pd.eval in Series.apply for prevent it:
df['Minutes'] = (df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True)
.str.strip('+')
.apply(pd.eval))
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
#verify for 120 rows
df = pd.concat([df] * 20, ignore_index=True)
df['Minutes1'] = pd.eval(
df['Time'].replace(['hours?', 'mins'], ['*60+', ''], regex=True).str.strip('+'))
print (df)
ValueError: unknown type object
Another solution with Series.str.extract and Series.add:
h = df['Time'].str.extract('(\d+)\s+hours').astype(float).mul(60)
m = df['Time'].str.extract('(\d+)\s+mins').astype(float)
df['Minutes'] = h.add(m, fill_value=0).astype(int)
print (df)
Time Minutes
0 2 hours 3 mins 123
1 5 hours 10 mins 310
2 1 hours 40 mins 100
3 10 mins 10
4 4 hours 240
5 6 hours 0 mins 360
jezrael's answer is excellent, but I spent quite some time working on this so i figured i'll post it.
You can use a regex to capture 'hours' and 'minutes' from your column, and then assign back to a new column after applying the logical mathematical operation to convert to minutes:
r = "(?:(\d+) hours ?)?(?:(\d+) mins)?"
hours = df.Time.str.extract(r)[0].astype(float).fillna(0) * 60
minutes = df.Time.str.extract(r)[1].astype(float).fillna(0)
df['minutes'] = hours + minutes
print(df)
Time minutes
0 2 hours 3 mins 123.0
1 5 hours 10 mins 310.0
2 1 hours 40 mins 100.0
3 10 mins 10.0
4 4 hours 240.0
5 6 hours 0 mins 360.0
I enjoy using https://regexr.com/ to test my regex

How to replace a column in dataframe for the result of a function

currently I have a dataframe with a column named age, which has the age of the person in days. I would like to convert this value to year, how could I achieve that?
at this moment, if one runs this command
df['age']
the result would be something like
0 18393
1 20228
2 18857
3 17623
4 17474
5 21914
6 22113
7 22584
8 17668
9 19834
10 22530
11 18815
12 14791
13 19809
I would like to change the value from each row to the current value/ 365 (which would convert days to year)
As suggested:
>>> df['age'] / 365
age
0 50.391781
1 55.419178
2 51.663014
3 48.282192
4 47.873973
Or if you need a real year:
>>> df['age'] // 365
age
0 50
1 55
2 51
3 48
4 47

how to group a string in a column in python?

i have a dataframe
PROD TYPE QUANTI
0 wood i2 20
1 tv ut1 30
2 tabl il3 50
3 rmt z1 40
4 zet u1 60
5 rm t1 60
6 rt t2 80
7 dud i4 40
I want to group the column "TYPE" in-group categories of (i,u,z,y...etc)
Expected Output
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet y_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40
Use Series.replace for replace number to _group:
df['TYPE'] = df['TYPE'].replace('\d+', '_group', regex=True)
print (df)
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet u_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40
If possible some values with no number use:
df['TYPE'] = df['TYPE'].replace('\d+', '', regex=True) + '_group'

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Pandas - Fill N rows for a specific column with a integer value and increment the integer there after

I have a dataframe to which I added say a column named col_1. I want to add integer values to that column starting from the first row that increment after every 4th row. So the new resulting column should have values of as such.
col_1
1
1
1
1
2
2
2
2
The current approach I have is a very brute force one:
for x in range(len(df)):
if x <= 3:
df['col_1'][x] = 1
if x >3 and x <= 7:
df['col_1'][x] = 2
This might work for something small but when moving to something larger it will chew up a lot of time.
If there si default RangeIndex you can use integer division with add 1:
df['col_1'] = df.index // 4 + 1
Or for general solution use helper array by lenght of DataFrame:
df['col_1'] = np.arange(len(df)) // 4 + 1
For repeat 1 and 2 pattern use also modulo by 2 like:
df = pd.DataFrame({'a':range(20, 40)})
df['col_1'] = (np.arange(len(df)) // 4) % 2 + 1
print (df)
a col_1
0 20 1
1 21 1
2 22 1
3 23 1
4 24 2
5 25 2
6 26 2
7 27 2
8 28 1
9 29 1
10 30 1
11 31 1
12 32 2
13 33 2
14 34 2
15 35 2
16 36 1
17 37 1
18 38 1
19 39 1

Resources