Pyspark Multiple Filter Dataframe [duplicate] - apache-spark

This question already has answers here:
Multiple condition filter on dataframe
(2 answers)
Closed 2 years ago.
My input spark dataframe is;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2019 7 1
2019 8 1
2019 9 1
2019 10 1
2019 11 1
2019 12 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
2019 7 2
2019 8 2
2019 9 2
2019 10 2
2019 11 2
2019 12 2
Dataframe is ordered by client, year and month. I want to extract the data after 2019-06 for each clients.
I shared the desired output according to the data above;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
Could you please help me about this?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Did you mean before 2019-06? (you wrote after 2019-06)
If so, you can do a filter:
df2 = df.filter('Year < 2019 or (Year = 2019 and Month <= 6)')

Related

How to return a matrix to a vector

Is there any way to return a matrix to a vector? I don't know the number of elements in the matrix, so let's say,matrix has n elements.
Below, it is an example of how I want to transform the table.
Any help, guidance, suggesting, recommendation will be very appreciated.
raw data.csv:
,January,February,March,April,May,June,July,August,September,October,November,December
2019,1,2,3,4,5,6,7,8,9,10,11,12
2018,13,14,15,16,17,18,19,20,21,22,23,24
2017,25,26,27,28,29,30,31,32,33,34,35,36
the link for csv files
raw=pd.read_csv('raw data.csv')
raw.head()
Unnamed: 0 January February March April May June July August September October November December
0 2019 1 2 3 4 5 6 7 8 9 10 11 12
1 2018 13 14 15 16 17 18 19 20 21 22 23 24
2 2017 25 26 27 28 29 30 31 32 33 34 35 36
final=pd.read_csv('Final.csv')
final.head(20)
Year&Month Value
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14
14 2018 March 15
15 2018 April 16
16 2018 May 17
17 2018 June 18
18 2018 July 19
19 2018 August 20```
You can use pandas stack
df = pd.read_csv(r'raw data.csv')
df.set_index(df.columns[0]).stack().reset_index()
Out:
Unnamed: 0 level_1 0
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14

Replace values from one column in dataframe

import pandas as pd
import numpy as np
import ast
pd.options.display.max_columns = 20
I have dataframe column season that looks like this (first 20 entries):
season
0 2006-07
1 2007-08
2 2008-09
3 2009-10
4 2010-11
5 2011-12
6 2012-13
7 2013-14
8 2014-15
9 2015-16
10 2016-17
11 2017-18
12 2018-19
13 Career
14 season
15 2018-19
16 Career
17 season
18 2017-18
19 2018-19
It starts with season and ends with Career. I want to replace years with numbers starting with 1 and ending when there's career. I want to be like this:
season
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 Career
14 season
15 1
16 Career
17 season
18 1
19 2
So counting should reset every time there's season in column and end every time there's career.
Create consecutive groups by compare mask created by Series.isin with shifted values with GroupBy.cumcount for counter:
s = df['season'].isin(['Career', 'season'])
df['new'] = np.where(s, df['season'], df.groupby(s.ne(s.shift()).cumsum()).cumcount() + 1)
print (df)
season new
0 2006-07 1
1 2007-08 2
2 2008-09 3
3 2009-10 4
4 2010-11 5
5 2011-12 6
6 2012-13 7
7 2013-14 8
8 2014-15 9
9 2015-16 10
10 2016-17 11
11 2017-18 12
12 2018-19 13
13 Career Career
14 season season
15 2018-19 1
16 Career Career
17 season season
18 2017-18 1
19 2018-19 2
For replace column season:
s = df['season'].isin(['Career', 'season'])
df.loc[~s, 'season'] = df.groupby(s.ne(s.shift()).cumsum()).cumcount() + 1
print (df)
season
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 Career
14 season
15 1
16 Career
17 season
18 1
19 2

want to assign id to duplicate rows

id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
3 khu 12 2018
4 she 21 2018
5 waqar 22 2015
want like this
id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
0 khu 12 2018
1 she 21 2018
2 waqar 22 2015
Use GroupBy.ngroup:
df['id'] = df.groupby('name', sort=False).ngroup()
#if need grouping by multiple columns for check duplicates
#df['id'] = df.groupby(['name','age'], sort=False).ngroup()
print (df)
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015
Using factorize as well you can check with category and cat.codes, or sklearn LabelEncoder
df['id']=pd.factorize(df['name'])[0]
df
Out[470]:
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

condition after groupby: data science

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool
Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

Excel formulas for range criteria date, arranged in columns

I want to write a formula for a large data chart. The criteria which I have to choose is on rows and columns.
I attach the file with the manually written calculus.
|PRODUCT|01-feb|02-feb|03-feb|04-feb|05-feb|06-feb|07-feb|08-feb|09-ef|10-feb|11-feb|feb-12|
|PRODUCT 1|4|3|1|5|2|9|1|3|5|8|0|5|
|PRODUCT 3|2|5|7|4|4|8|3|5|7|4|4|8|
|PRODUCT 1|1|0|5|3|1|1|8|0|5|3|1|1|
|PRODUCT 2|5|4|6|6|0|7|4|4|6|6|0|7|
|PRODUCT 5|8|7|8|7|1|9|2|7|8|7|1|9|
|PRODUCT 4|4|2|9|3|5|1|7|2|9|3|5|1|
|PRODUCT 1|9|8|1|4|4|6|5|8|1|4|4|6|
|PRODUCT 2|6|4|4|7|2|8|6|4|4|7|2|8|
|PRODUCT 5|2|6|1|8|3|9|3|6|1|8|3|9|
|PRODUCT 3|3|9|5|1|7|4|7|9|5|1|7|4|
|PRODUCT 4|7|6|5|5|8|2|1|6|5|5|8|2|
The compact chart that I have to get:
|PRODUCT|04-feb|08-feb|12-feb|
|PRODUCT 1|44|48|43|
|PRODUCT 2|42|35|40|
|PRODUCT 3|36|47|40|
|PRODUCT 4|41|32|38|
|PRODUCT 5|47|40|46|
The formula that it should works:
=SUMAR.SI.CONJUNTO(C5:N15,B5:B15,H20,C4:N4,"=<"&J19)
because I want to show a range of date between 01-feb to 04-feb from the first chart in the new column 04-feb.
Please, help me.
The following might help you. The formula in the upper left cell of the table of the summary is
{=SUM((($B$1:$M$1<=B$14)*($B$1:$M$1>=A$14)*$B$2:$M$13)*($A15=$A$2:$A$13))}
and can be copied over to the over cells. The 31.01 in the summary table is used as a "helper cell", so that you don't have to alter the formula for the different cells.
Product 01. Feb 02. Feb 03. Feb 04. Feb 05. Feb 06. Feb 07. Feb 08. Feb 09. Feb 10. Feb 11. Feb 12. Feb
Product1 5 2 3 3 5 5 3 3 5 3 3 5
Product3 5 4 2 4 5 1 5 3 3 5 3 3
Product4 3 1 2 2 4 5 5 1 5 5 1 5
Product1 4 1 4 3 4 1 4 1 3 4 1 3
Product3 1 2 2 4 5 2 5 1 1 5 1 1
Product4 3 2 4 1 1 4 3 5 2 3 5 2
Product1 4 3 5 1 1 1 2 2 2 2 2 2
Product3 3 2 4 3 5 1 1 1 4 1 1 4
Product4 2 1 4 2 2 1 4 4 3 4 4 3
Product1 4 5 5 2 3 4 3 4 5 3 4 5
Product3 4 2 3 1 4 1 1 3 1 1 3 1
Product4 3 5 3 3 1 4 1 1 3 1 1 3
31. Jan 04. Feb 08. Feb 12. Feb
Product1 54 55 62
Product2 0 0 0
Product3 46 56 46
Product4 41 54 61
Product5 0 0 0
You can use sumproduct for this. B2:E12 is the range of data for Feb 1 though Feb 4, and O2 is equal to the criteria you are searching for. So in my case O2 was equal to Product 1. When you want the range for Feb 8, just change B2:E12 to the range of data corresponding to Feb 5 to Feb 8.
=SUMPRODUCT(B2:E12*(A2:A12=O2))

Resources