How to replace a column in dataframe for the result of a function - python-3.x

currently I have a dataframe with a column named age, which has the age of the person in days. I would like to convert this value to year, how could I achieve that?
at this moment, if one runs this command
df['age']
the result would be something like
0 18393
1 20228
2 18857
3 17623
4 17474
5 21914
6 22113
7 22584
8 17668
9 19834
10 22530
11 18815
12 14791
13 19809
I would like to change the value from each row to the current value/ 365 (which would convert days to year)

As suggested:
>>> df['age'] / 365
age
0 50.391781
1 55.419178
2 51.663014
3 48.282192
4 47.873973
Or if you need a real year:
>>> df['age'] // 365
age
0 50
1 55
2 51
3 48
4 47

Related

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Pandas : merge dataframes with conditions

I'd like something pretty complicated, I think.
So i have 2 pandas DataFrames,
contact_extrafields (which is a CSV file converted to a DataFrame):
contact_id departement age region size
0 17068CE3 5 19.5
1 788159ED 59 18 ABC
2 4796EDA9 69 100.0
3 2BB080E4 32 DEF 50.5
4 8562B30E 10 GHI 79.95
5 9602758E 67 JKL 23.7
6 3CBBA9F7 65 MNO 14.7
7 DAE5EE44 75 98 159.6
8 5B9E3410 49 10 PQR 890.1
...
datafield_types (which is a dictionary converted to a DataFrame):
name datatype_id datafield_id datatype_name
0 size 1 4 float
1 region 2 3 string
2 age 3 2 integer
3 departement 3 1 integer
I would like a new DataFrame like this :
contact_id datafield_id string_value integer_value boolean_value float_value
0 17068CE3 4 19.5
1 17068CE3 3
2 17068CE3 2 5
3 17068CE3 1
4 788159ED 4
5 788159ED 3 ABC
6 788159ED 2 18
7 788159ED 1 59
....
The DataFrame contact_extrafields contains about 3 million lines.
EDIT (exemple):
If I take contact_id 788159ED from DataFrame contact_extrafields,
I'll take the name of the column and its value,
check the type of the value with in DataFrame datafield_types with the column name,
for example for the column department its value is 59 and its type is integrated according to the DataFrame datafield_types so the id is 3,
it should insert a line in the new DataFrame that i will create like this:
contact_id datafield_id string_value integer_value boolean_value float_value
0 788159ED 1 59
....
The datafield_id is retrieved from the DataFrame datafield_types this will allow me to know that the contact 788159ED had for the column department which is integer type the value 59.
Each column create a row in the DataFrame I want to create.
Is it possible to do it with pandas?
How to do it?
The columns in contact_extrafields can change (so i will change the datafield_types names too)
I've tried a lot of things that have led me to a memory saturation.
My code is running on a machine with 16 gigas of ram.
Thanks a lot !

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Select rows from with same values in one column but different value in the other column

I have some duplicates in my data that I need to correct.
This is a sample of a dataframe:
test = pd.DataFrame({'event_id':['1','1','2','3','5','6','9','3','9','10'],
'user_id':[0,0,0,1,1,3,3,4,4,4],
'index':[10,20,30,40,50,60,70,80,90,100]})
I need to select all the rows that have equal values in event_id but differing values in user_id. I tried this (based on a similar question but with no accepted answer):
test.groupby('event_id').filter(lambda g: len(g) > 1).drop_duplicates(subset=['event_id', 'user_id'], keep="first")
out:
event_id user_id index
0 1 0 10
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
But I do not need the first row where user_id is the same - 0.
The second part of the question is - what is the best way to correct the duplicate record? How could I add a suffix to event_id (_new) but only in this row:
event_id user_id index
3 3_new 1 40
6 9_new 3 70
7 3 4 80
8 9 4 90
Ummm, I try to fix your code
test.groupby('event_id').
filter(lambda x : (len(x['event_id'])==x['user_id'].nunique())&(len(x['event_id'])>1))
Out[85]:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
For Correct the duplicate row, you can do with create a new sub key , personally not recommended modify your original columns .
df['subkey']=df.groupby('event_id').cumcount()
Try:
test[test.duplicated(['event_id'], keep=False) &
~test.duplicated(['event_id','user_id'], keep=False)]
Output:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90

Resources