Get data that occur in all specified years of column in Pandas - python-3.x

I need only those company name with year and estimate grp size those are present in all three year eg 2019, 2020, 2021
year Company/Account Name EstimatedGroupSize
0 2019 Unknown 19550
1 2019 Mayo Clinic 7754
2 2019 Deloitte 6432
3 2019 Rizona State 5582
4 2019 Intel Corporation 4595
5 2020 Deloitte 4063
6 2020 Unknown 3490
7 2021 Unknown 3484
8 2020 Intel Corporation 3460
9 2021 Intel Corporation 3433
10 2021 Deloitte 3250
So my output should be
year Company/Account Name EstimatedGroupSize
0 2019 Unknown 19550
2 2019 Deloitte 6432
4 2019 Intel Corporation 4595
5 2020 Deloitte 4063
6 2020 Unknown 3490
7 2021 Unknown 3484
8 2020 Intel Corporation 3460
9 2021 Intel Corporation 3433
10 2021 Deloitte 3250

Here is solution for filter year with Company/Account Name if present at least one row and filter original DataFrame by inner merge:
#if need filter ony some years first
df = df[df['year'].isin([2019, 2020, 2021])]
df1 = pd.crosstab(df['year'], df['Company/Account Name'])
df = df.merge(df1.loc[:, df1.gt(0).all()].stack().index.to_frame(index=False))
print (df)
year Company/Account Name EstimatedGroupSize
0 2019 Unknown 19550
1 2019 Deloitte 6432
2 2019 Intel Corporation 4595
3 2020 Deloitte 4063
4 2020 Unknown 3490
5 2021 Unknown 3484
6 2020 Intel Corporation 3460
7 2021 Intel Corporation 3433
8 2021 Deloitte 3250

IIUC,
years = [2019, 2020, 2021]
new_df = \
df.loc[pd.get_dummies(df['year'])
.groupby(df['Company/Account Name'])[years]
.transform('sum')
.gt(0)
.all(axis=1)]
print(new_df)
year Company/Account Name EstimatedGroupSize
0 2019 Unknown 19550
2 2019 Deloitte 6432
4 2019 Intel-Corporation 4595
5 2020 Deloitte 4063
6 2020 Unknown 3490
7 2021 Unknown 3484
8 2020 Intel-Corporation 3460
9 2021 Intel-Corporation 3433
10 2021 Deloitte 3250
Or:
years = [2019, 2020, 2021]
new_df = \
df.groupby('Company/Account Name')\
.filter(lambda x: np.isin(years, x['year']).all())

Related

Is there any way to convert columns to rows using pandas?

I have an excel file which contain data like this:-
Prod
Work
Vaction
Year
2022
2022
2023
2022
2022
2023
2022
Month
10
11
12
10
11
12
10
Name
Business?
Exclusive?
Oct
Nov
Dec
Oct
Nov
Dec
Oct
Robert
Yes
No
100
100
100
150
150
150
1.1
Maria
No
Yes
75
75
50
25
25
25
1
and I want to convert this table into this form:
Name
Business?
Exclusive?
Year
Month
Prod
Work
Vacation
Robert
Yes
No
2022
Oct
100
150
1.1
Maria
No
Yes
2022
Nov
100
150
1
Robert
No
Yes
2023
Dec
100
150
1
Maria
No
Yes
2023
Dec
50
150
1
With the help of python pandas library. I am struggling with this problem from so many days. Please Help!

Pyspark Multiple Filter Dataframe [duplicate]

This question already has answers here:
Multiple condition filter on dataframe
(2 answers)
Closed 2 years ago.
My input spark dataframe is;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2019 7 1
2019 8 1
2019 9 1
2019 10 1
2019 11 1
2019 12 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
2019 7 2
2019 8 2
2019 9 2
2019 10 2
2019 11 2
2019 12 2
Dataframe is ordered by client, year and month. I want to extract the data after 2019-06 for each clients.
I shared the desired output according to the data above;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
Could you please help me about this?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Did you mean before 2019-06? (you wrote after 2019-06)
If so, you can do a filter:
df2 = df.filter('Year < 2019 or (Year = 2019 and Month <= 6)')

How to return a matrix to a vector

Is there any way to return a matrix to a vector? I don't know the number of elements in the matrix, so let's say,matrix has n elements.
Below, it is an example of how I want to transform the table.
Any help, guidance, suggesting, recommendation will be very appreciated.
raw data.csv:
,January,February,March,April,May,June,July,August,September,October,November,December
2019,1,2,3,4,5,6,7,8,9,10,11,12
2018,13,14,15,16,17,18,19,20,21,22,23,24
2017,25,26,27,28,29,30,31,32,33,34,35,36
the link for csv files
raw=pd.read_csv('raw data.csv')
raw.head()
Unnamed: 0 January February March April May June July August September October November December
0 2019 1 2 3 4 5 6 7 8 9 10 11 12
1 2018 13 14 15 16 17 18 19 20 21 22 23 24
2 2017 25 26 27 28 29 30 31 32 33 34 35 36
final=pd.read_csv('Final.csv')
final.head(20)
Year&Month Value
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14
14 2018 March 15
15 2018 April 16
16 2018 May 17
17 2018 June 18
18 2018 July 19
19 2018 August 20```
You can use pandas stack
df = pd.read_csv(r'raw data.csv')
df.set_index(df.columns[0]).stack().reset_index()
Out:
Unnamed: 0 level_1 0
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14

want to assign id to duplicate rows

id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
3 khu 12 2018
4 she 21 2018
5 waqar 22 2015
want like this
id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
0 khu 12 2018
1 she 21 2018
2 waqar 22 2015
Use GroupBy.ngroup:
df['id'] = df.groupby('name', sort=False).ngroup()
#if need grouping by multiple columns for check duplicates
#df['id'] = df.groupby(['name','age'], sort=False).ngroup()
print (df)
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015
Using factorize as well you can check with category and cat.codes, or sklearn LabelEncoder
df['id']=pd.factorize(df['name'])[0]
df
Out[470]:
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

how get week numbers starting from sunday

Date Day_name ****WeekNum****
1/1/2016 Fri 1
1/2/2016 Sat 1
1/3/2016 Sun 2
1/4/2016 Mon 2
1/5/2016 Tue 2
1/6/2016 Wed 2
1/7/2016 Thu 2
1/8/2016 Fri 2
1/9/2016 Sat 2
1/10/2016 Sun 3
1/11/2016 Mon 3
1/12/2016 Tue 3
1/13/2016 Wed 3
1/14/2016 Thu 3
1/15/2016 Fri 3
1/16/2016 Sat 3
1/17/2016 Sun 4
1/18/2016 Mon 4
1/19/2016 Tue 4
1/20/2016 Wed 4
1/21/2016 Thu 4
1/22/2016 Fri 4
1/23/2016 Sat 4
1/24/2016 Sun 5
1/25/2016 Mon 5
1/26/2016 Tue 5
1/27/2016 Wed 5
1/28/2016 Thu 5
1/29/2016 Fri 5
1/30/2016 Sat 5
1/31/2016 Sun 6
2/1/2016 Mon 6
2/2/2016 Tue 6
2/3/2016 Wed 6
Here Your starting week day should begin with sunday and the end day of week should be saturday
Assuming you have your date in A1 your B1 and C1 are as follows:
Day_name is =TEXT(A1,"ddd")
WeekNum starting on Sunday is =WEEKNUM(A1,1)

Resources