Best practice for converting less than symbol in to integer - statistics

I have an anonymised dataset that mainly contains integer values, but there are also a number of values that are expressed as <5 for anonymity purposes (values would fall between 1-4). However, I would like to express them as an integer.
I don't want to remove them or set them to 0 if possible as they still count as at least one instance.
I thought about using the range of the values 1,2,3,4 so replacing them with 3, but I cannot find any information on this particular topic. I have found info on null values etc. but since the values lie within the range 1-4 is there anyway to evenly distribute them?
Thank you!
Example of some of the data:
age year value
20-24 2000 <5
20-24 2001 <5
20-24 2002 <5
20-24 2003 <5
20-24 2004 <5
20-24 2005 <5
20-24 2006 <5
20-24 2007 <5
20-24 2008 <5
20-24 2009 17
20-24 2010 0
25-29 2000 5
25-29 2001 0
25-29 2002 14
25-29 2003 12
25-29 2004 14
25-29 2005 13
25-29 2006 17
25-29 2007 9
25-29 2008 13
25-29 2009 17
25-29 2010 17
30-34 2000 41
30-34 2001 46
30-34 2002 47
30-34 2003 30
30-34 2004 58
30-34 2005 34
30-34 2006 41
30-34 2007 37
30-34 2008 38
30-34 2009 49
30-34 2010 46

Related

Convert string date column with format of ordinal numeral day, abbreviated month name, and normal year to %Y-%m-%d

Given the following df with string date column with ordinal numbers for day, abbreviated month name for month, and normal year:
date oil gas
0 1st Oct 2021 428 99
1 10th Sep 2021 401 101
2 2nd Oct 2020 189 74
3 10th Jan 2020 659 119
4 1st Nov 2019 691 130
5 30th Aug 2019 742 162
6 10th May 2019 805 183
7 24th Aug 2018 860 182
8 1st Sep 2017 759 183
9 10th Mar 2017 617 151
10 10th Feb 2017 591 149
11 22nd Apr 2016 343 88
12 10th Apr 2015 760 225
13 23rd Jan 2015 1317 316
I'm wondering how could we parse date column to standard %Y-%m-%d format?
My ideas so far: 1. strip ordinal indicators ('st', 'nd', 'rd', 'th') from character day string while keeping the day number with re; 2. and convert abbreviated month name to numbers (seems not %b), 3. finally convert them to %Y-%m-%d.
Code may be useful for the first step:
re.compile(r"(?<=\d)(st|nd|rd|th)").sub("", df['date'])
References:
https://metacpan.org/release/DROLSKY/DateTime-Locale-0.46/view/lib/DateTime/Locale/en_US.pm#Months
pd.to_datetime already handles this case if you don't specify the format parameter:
>>> pd.to_datetime(df['date'])
0 2021-10-01
1 2021-09-10
2 2020-10-02
3 2020-01-10
4 2019-11-01
5 2019-08-30
6 2019-05-10
7 2018-08-24
8 2017-09-01
9 2017-03-10
10 2017-02-10
11 2016-04-22
12 2015-04-10
13 2015-01-23
Name: date, dtype: datetime64[ns]

Calculate selection as percentage of total

I need to let users select some (1+) categories and then calculate % of values in that category compared to all values in current filtering.
I could achieve to show marked vs total using subsets, however I can't write an expression across subsets.
e.g. in first column I would need to get result 6 / 20 = 0.3
Notes
I need a generic solution, not one based on structure of my sample dataset
Total should be total of current filtering, not total of all data
Sample data:
country
year
category
FR
2001
4
FR
2002
3
FR
2003
5
FR
2004
1
FR
2005
3
FR
2006
2
FR
2007
3
FR
2008
3
FR
2009
2
FR
2010
1
FR
2011
2
FR
2012
3
FR
2013
5
FR
2014
3
FR
2015
3
FR
2016
2
FR
2017
5
FR
2018
2
FR
2019
4
FR
2020
5
DE
2001
4
DE
2002
2
DE
2003
2
DE
2004
2
DE
2005
3
DE
2006
5
DE
2007
3
DE
2008
4
DE
2009
3
DE
2010
4
DE
2011
4
DE
2012
2
DE
2013
4
DE
2014
4
DE
2015
4
DE
2016
2
DE
2017
4
DE
2018
4
DE
2019
3
DE
2020
3
CH
2001
2
CH
2002
1
CH
2003
1
CH
2004
2
CH
2005
5
CH
2006
4
CH
2007
1
CH
2008
4
CH
2009
2
CH
2010
3
CH
2011
2
CH
2012
2
CH
2013
1
CH
2014
1
CH
2015
3
CH
2016
1
CH
2017
4
CH
2018
3
CH
2019
4
CH
2020
3
IT
2001
3
IT
2002
5
IT
2003
1
IT
2004
5
IT
2005
4
IT
2006
5
IT
2007
5
IT
2008
1
IT
2009
1
IT
2010
4
IT
2011
2
IT
2012
4
IT
2013
3
IT
2014
5
IT
2015
4
IT
2016
2
IT
2017
3
IT
2018
3
IT
2019
3
IT
2020
2
ES
2001
2
ES
2002
1
ES
2003
2
ES
2004
4
ES
2005
1
ES
2006
1
ES
2007
4
ES
2008
1
ES
2009
1
ES
2010
2
ES
2011
2
ES
2012
1
ES
2013
2
ES
2014
1
ES
2015
5
ES
2016
4
ES
2017
5
ES
2018
4
ES
2019
1
ES
2020
2

How to return a matrix to a vector

Is there any way to return a matrix to a vector? I don't know the number of elements in the matrix, so let's say,matrix has n elements.
Below, it is an example of how I want to transform the table.
Any help, guidance, suggesting, recommendation will be very appreciated.
raw data.csv:
,January,February,March,April,May,June,July,August,September,October,November,December
2019,1,2,3,4,5,6,7,8,9,10,11,12
2018,13,14,15,16,17,18,19,20,21,22,23,24
2017,25,26,27,28,29,30,31,32,33,34,35,36
the link for csv files
raw=pd.read_csv('raw data.csv')
raw.head()
Unnamed: 0 January February March April May June July August September October November December
0 2019 1 2 3 4 5 6 7 8 9 10 11 12
1 2018 13 14 15 16 17 18 19 20 21 22 23 24
2 2017 25 26 27 28 29 30 31 32 33 34 35 36
final=pd.read_csv('Final.csv')
final.head(20)
Year&Month Value
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14
14 2018 March 15
15 2018 April 16
16 2018 May 17
17 2018 June 18
18 2018 July 19
19 2018 August 20```
You can use pandas stack
df = pd.read_csv(r'raw data.csv')
df.set_index(df.columns[0]).stack().reset_index()
Out:
Unnamed: 0 level_1 0
0 2019 January 1
1 2019 February 2
2 2019 March 3
3 2019 April 4
4 2019 May 5
5 2019 June 6
6 2019 July 7
7 2019 August 8
8 2019 September 9
9 2019 October 10
10 2019 November 11
11 2019 December 12
12 2018 January 13
13 2018 February 14

condition after groupby: data science

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool
Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

Automatically fill data from another sheet

Main Question
I would like to auto-fill Sheet A with values from Sheet B in Excel 2013. My data are in two sheets in the same workbook.
Example
=========== Sheet 1 =========== =========== Sheet 2 ===========
location year val1 val2 location year val1 val2
USA.VT 1999 USA.VT 1999 6 3
USA.VT 2000 USA.VT 2000 3 2
USA.VT 2001 USA.VT 2001 4 1
USA.VT 2002 USA.VT 2002 9 5
USA.NH 1999 USA.NH 1999 3 6
USA.NH 2000 USA.NH 2002 12 56
USA.NH 2001 USA.ME 1999 3 16
USA.NH 2002 USA.ME 2002 4 5
USA.ME 1999
USA.ME 2000
USA.ME 2001
USA.ME 2002
I would like to use some function or formula to automatically populate Sheet 1 based on the values in Sheet 2 according to: location, year, and the column (val1 or val2). Non-matches would be zero-filled.
This would result in the following:
=========== Sheet 1 ===========
location year val1 val2
USA.VT 1999 6 3
USA.VT 2000 3 2
USA.VT 2001 4 1
USA.VT 2002 9 5
USA.NH 1999 3 6
USA.NH 2000 0 0
USA.NH 2001 0 0
USA.NH 2002 12 56
USA.ME 1999 3 16
USA.ME 2000 0 0
USA.ME 2001 0 0
USA.ME 2002 4 5
I have tried VLOOKUP, INDEX, and MATCH, but I'm having no luck.
Any help would be greatly appreciated!
Put this Array formula in C2:
=IFERROR(INDEX(Sheet2!C$2:C$9,MATCH($A2&$B2,Sheet2!$A$2:$A$9&Sheet2!$B$2:$B$9,0)),0)
Being an array formula you must confirm with Ctrl-Shift-Enter to exit the edit mode instead of Enter.
Then copy over one column and down.
The picture is not exact because I left it on one sheet.

Resources