Calculate selection as percentage of total - spotfire

I need to let users select some (1+) categories and then calculate % of values in that category compared to all values in current filtering.
I could achieve to show marked vs total using subsets, however I can't write an expression across subsets.
e.g. in first column I would need to get result 6 / 20 = 0.3
Notes
I need a generic solution, not one based on structure of my sample dataset
Total should be total of current filtering, not total of all data
Sample data:
country
year
category
FR
2001
4
FR
2002
3
FR
2003
5
FR
2004
1
FR
2005
3
FR
2006
2
FR
2007
3
FR
2008
3
FR
2009
2
FR
2010
1
FR
2011
2
FR
2012
3
FR
2013
5
FR
2014
3
FR
2015
3
FR
2016
2
FR
2017
5
FR
2018
2
FR
2019
4
FR
2020
5
DE
2001
4
DE
2002
2
DE
2003
2
DE
2004
2
DE
2005
3
DE
2006
5
DE
2007
3
DE
2008
4
DE
2009
3
DE
2010
4
DE
2011
4
DE
2012
2
DE
2013
4
DE
2014
4
DE
2015
4
DE
2016
2
DE
2017
4
DE
2018
4
DE
2019
3
DE
2020
3
CH
2001
2
CH
2002
1
CH
2003
1
CH
2004
2
CH
2005
5
CH
2006
4
CH
2007
1
CH
2008
4
CH
2009
2
CH
2010
3
CH
2011
2
CH
2012
2
CH
2013
1
CH
2014
1
CH
2015
3
CH
2016
1
CH
2017
4
CH
2018
3
CH
2019
4
CH
2020
3
IT
2001
3
IT
2002
5
IT
2003
1
IT
2004
5
IT
2005
4
IT
2006
5
IT
2007
5
IT
2008
1
IT
2009
1
IT
2010
4
IT
2011
2
IT
2012
4
IT
2013
3
IT
2014
5
IT
2015
4
IT
2016
2
IT
2017
3
IT
2018
3
IT
2019
3
IT
2020
2
ES
2001
2
ES
2002
1
ES
2003
2
ES
2004
4
ES
2005
1
ES
2006
1
ES
2007
4
ES
2008
1
ES
2009
1
ES
2010
2
ES
2011
2
ES
2012
1
ES
2013
2
ES
2014
1
ES
2015
5
ES
2016
4
ES
2017
5
ES
2018
4
ES
2019
1
ES
2020
2

Related

Pyspark Multiple Filter Dataframe [duplicate]

This question already has answers here:
Multiple condition filter on dataframe
(2 answers)
Closed 2 years ago.
My input spark dataframe is;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2019 7 1
2019 8 1
2019 9 1
2019 10 1
2019 11 1
2019 12 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
2019 7 2
2019 8 2
2019 9 2
2019 10 2
2019 11 2
2019 12 2
Dataframe is ordered by client, year and month. I want to extract the data after 2019-06 for each clients.
I shared the desired output according to the data above;
Year Month Client
2018 1 1
2018 2 1
2018 3 1
2018 4 1
2018 5 1
2018 6 1
2018 7 1
2018 8 1
2018 9 1
2018 10 1
2018 11 1
2018 12 1
2019 1 1
2019 2 1
2019 3 1
2019 4 1
2019 5 1
2019 6 1
2018 1 2
2018 2 2
2018 3 2
2018 4 2
2018 5 2
2018 6 2
2018 7 2
2018 8 2
2018 9 2
2018 10 2
2018 11 2
2018 12 2
2019 1 2
2019 2 2
2019 3 2
2019 4 2
2019 5 2
2019 6 2
Could you please help me about this?
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Did you mean before 2019-06? (you wrote after 2019-06)
If so, you can do a filter:
df2 = df.filter('Year < 2019 or (Year = 2019 and Month <= 6)')

How to return a matrix to a vector

Is there any way to return a matrix to a vector? I don't know the number of elements in the matrix, so let's say,matrix has n elements.
Below, it is an example of how I want to transform the table.
Any help, guidance, suggesting, recommendation will be very appreciated.
raw data.csv:
,January,February,March,April,May,June,July,August,September,October,November,December
2019,1,2,3,4,5,6,7,8,9,10,11,12
2018,13,14,15,16,17,18,19,20,21,22,23,24
2017,25,26,27,28,29,30,31,32,33,34,35,36
the link for csv files
raw=pd.read_csv('raw data.csv')
raw.head()
Unnamed: 0 January February March April May June July August September October November December
0 2019 1 2 3 4 5 6 7 8 9 10 11 12
1 2018 13 14 15 16 17 18 19 20 21 22 23 24
2 2017 25 26 27 28 29 30 31 32 33 34 35 36
final=pd.read_csv('Final.csv')
final.head(20)
Year&Month Value
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14
14 2018 March 15
15 2018 April 16
16 2018 May 17
17 2018 June 18
18 2018 July 19
19 2018 August 20```
You can use pandas stack
df = pd.read_csv(r'raw data.csv')
df.set_index(df.columns[0]).stack().reset_index()
Out:
Unnamed: 0 level_1 0
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14

want to assign id to duplicate rows

id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
3 khu 12 2018
4 she 21 2018
5 waqar 22 2015
want like this
id name age year
0 khu 12 2018
1 she 21 2019
2 waqar 22 2015
0 khu 12 2018
1 she 21 2018
2 waqar 22 2015
Use GroupBy.ngroup:
df['id'] = df.groupby('name', sort=False).ngroup()
#if need grouping by multiple columns for check duplicates
#df['id'] = df.groupby(['name','age'], sort=False).ngroup()
print (df)
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015
Using factorize as well you can check with category and cat.codes, or sklearn LabelEncoder
df['id']=pd.factorize(df['name'])[0]
df
Out[470]:
id name age year
0 0 khu 12 2018
1 1 she 21 2019
2 2 waqar 22 2015
3 0 khu 12 2018
4 1 she 21 2018
5 2 waqar 22 2015

condition after groupby: data science

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool
Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

Automatically fill data from another sheet

Main Question
I would like to auto-fill Sheet A with values from Sheet B in Excel 2013. My data are in two sheets in the same workbook.
Example
=========== Sheet 1 =========== =========== Sheet 2 ===========
location year val1 val2 location year val1 val2
USA.VT 1999 USA.VT 1999 6 3
USA.VT 2000 USA.VT 2000 3 2
USA.VT 2001 USA.VT 2001 4 1
USA.VT 2002 USA.VT 2002 9 5
USA.NH 1999 USA.NH 1999 3 6
USA.NH 2000 USA.NH 2002 12 56
USA.NH 2001 USA.ME 1999 3 16
USA.NH 2002 USA.ME 2002 4 5
USA.ME 1999
USA.ME 2000
USA.ME 2001
USA.ME 2002
I would like to use some function or formula to automatically populate Sheet 1 based on the values in Sheet 2 according to: location, year, and the column (val1 or val2). Non-matches would be zero-filled.
This would result in the following:
=========== Sheet 1 ===========
location year val1 val2
USA.VT 1999 6 3
USA.VT 2000 3 2
USA.VT 2001 4 1
USA.VT 2002 9 5
USA.NH 1999 3 6
USA.NH 2000 0 0
USA.NH 2001 0 0
USA.NH 2002 12 56
USA.ME 1999 3 16
USA.ME 2000 0 0
USA.ME 2001 0 0
USA.ME 2002 4 5
I have tried VLOOKUP, INDEX, and MATCH, but I'm having no luck.
Any help would be greatly appreciated!
Put this Array formula in C2:
=IFERROR(INDEX(Sheet2!C$2:C$9,MATCH($A2&$B2,Sheet2!$A$2:$A$9&Sheet2!$B$2:$B$9,0)),0)
Being an array formula you must confirm with Ctrl-Shift-Enter to exit the edit mode instead of Enter.
Then copy over one column and down.
The picture is not exact because I left it on one sheet.

Resources