want to assign id to duplicate rows - python-3.x

id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
3 khu 12 2018
4 she 21 2018
5 waqar 22 2015
want like this
id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
0 khu 12 2018
1 she 21 2018
2 waqar 22 2015

Use GroupBy.ngroup:
df['id'] = df.groupby('name', sort=False).ngroup()
#if need grouping by multiple columns for check duplicates
#df['id'] = df.groupby(['name','age'], sort=False).ngroup()
print (df)
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

Using factorize as well you can check with category and cat.codes, or sklearn LabelEncoder
df['id']=pd.factorize(df['name'])[0]
df
Out[470]:
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

Related

using python want to calculate last 6 months average for each month

I have a dataframe which has 3 columns [user_id ,year_month & value] , i want to calculate last 6months average for the year automatically for each individual unique user_id and assign it to new column
user_id value year_month
1 50 2021-01
1 54 2021-02
.. .. ..
1 50 2021-11
1 47 2021-12
2 36 2021-01
2 48.5 2021-05
.. .. ..
2 54 2021-11
2 30.2 2021-12
3 41.4 2021-01
3 48.5 2021-02
3 41.4 2021-05
.. .. ..
3 30.2 2021-12
Total year has 12-24 months
to get jan 2022 value[dec 2021 to july 2021]=[55+32+33+63+54+51]/6
to get feb 2022 value[jan 2022 to aug 2021] =[32+33+37+53+54+51]/6
to get mar 2022 value[feb 2022 to sep 2021] =[45+32+33+63+54+51]/6
to get apr 2022 value[mar 2022 to oct 2021] =[63+54+51+45+32+33]/6
First index, your datetime column
df = df.set_index('year_month')
Then do the following
df.groupby('UserId').rolling('6M').transform('avg')
This is the most correct way but hey here is one more intutitive
df.sort_values('year_month').groupby('UserId').rolling(6).transform('avg') # Returns wanted series
As paul h said

Pyspark Multiple Filter Dataframe [duplicate]

This question already has answers here:
Multiple condition filter on dataframe
(2 answers)
Closed 2 years ago.
My input spark dataframe is;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2019 7 1
2019 8 1
2019 9 1
2019 10 1
2019 11 1
2019 12 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
2019 7 2
2019 8 2
2019 9 2
2019 10 2
2019 11 2
2019 12 2
Dataframe is ordered by client, year and month. I want to extract the data after 2019-06 for each clients.
I shared the desired output according to the data above;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
Could you please help me about this?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Did you mean before 2019-06? (you wrote after 2019-06)
If so, you can do a filter:
df2 = df.filter('Year < 2019 or (Year = 2019 and Month <= 6)')

How to make Python calendar show 3 letters instead of 2?

I have this code
import calendar
print(calendar.month(2020, 5))
print()
and it shows
May 2020
Mo Tu We Th Fr Sa Su
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
What i want to know is how i can manage it to show Mon Tue Wen Thu Fri Sat Sun instead of 2 letters
Well i solved it by adding w=3 to code so now it looks like this
import calendar
print(calendar.month(2020, 5, w=3))
print()
and the result is
May 2020
Mon Tue Wed Thu Fri Sat Sun
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

How to return a matrix to a vector

Is there any way to return a matrix to a vector? I don't know the number of elements in the matrix, so let's say,matrix has n elements.
Below, it is an example of how I want to transform the table.
Any help, guidance, suggesting, recommendation will be very appreciated.
raw data.csv:
,January,February,March,April,May,June,July,August,September,October,November,December
2019,1,2,3,4,5,6,7,8,9,10,11,12
2018,13,14,15,16,17,18,19,20,21,22,23,24
2017,25,26,27,28,29,30,31,32,33,34,35,36
the link for csv files
raw=pd.read_csv('raw data.csv')
raw.head()
Unnamed: 0 January February March April May June July August September October November December
0 2019 1 2 3 4 5 6 7 8 9 10 11 12
1 2018 13 14 15 16 17 18 19 20 21 22 23 24
2 2017 25 26 27 28 29 30 31 32 33 34 35 36
final=pd.read_csv('Final.csv')
final.head(20)
Year&Month Value
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14
14 2018 March 15
15 2018 April 16
16 2018 May 17
17 2018 June 18
18 2018 July 19
19 2018 August 20```
You can use pandas stack
df = pd.read_csv(r'raw data.csv')
df.set_index(df.columns[0]).stack().reset_index()
Out:
Unnamed: 0 level_1 0
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14

condition after groupby: data science

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool
Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

Resources