Extractall dates - how to separate single years with RegEx in python? - python-3.x

I have got some dates included within the test in one of the columns in my dataframe.
for example,
sr = pd.Series(['04/20/2009', '04/20/09', '4/20/09', '4/3/09', '6/2008','12/2009','2010'])
I want to extract these dates..
and half of my year ends up in the 'month' and 'day' columns.
result = sr.str.extractall(r'(?P<month>\d{,2})[/]?(?P<day>\d{,2})[/]?(?P<year>\d{2,4})')
result
month day year
match
0 0 04 20 2009
1 0 04 20 09
2 0 4 20 09
3 0 4 3 09
4 0 6 20 08
5 0 12 20 09
6 0 20 NaN 10
how can I fix this?
I can only think of processing "'6/2008','12/2009','2010'" separately from "'04/20/2009', '04/20/09', '4/20/09'", and then appending them.

You could make the match a bit more specific for the months and days.
As there is always a year, you can make the whole group for the month and day optional.
In that optional group, you can match a month with an optional day.
(?<!\S)(?:(?P<month>1[0-2]|0?[1-9])/(?:(?P<day>3[01]|[12][0-9]|0?[1-9])/)?)?(?P<year>(?:20|19)?\d{2})(?!\S)
In parts
(?<!\S) Negative lookbehind, assert what is directly to the left is not a non whitespace char (whitespace boundary to the left)
(?: Non capture group
(?P<month>1[0-2]|0?[1-9])/ Group month followed by /
(?: Non capture group
(?P<day>3[01]|[12][0-9]|0?[1-9])/ Group day followed by /
)? Close group and make it optional
)? Close group and make it optional
(?P<year>(?:20|19)?\d{2}) Group year, optionally match either 20 or 19 and 2 digits
(?!\S) Negative lookahead, assert not a non whitespace char directly to the right (whitespace boundary to the right)
Regex demo

Related

Convert Daily dates to weekly using a specific first day of the week

I am currently working on grouping/aggregating data based on date range for a weekly plot.
Below is how my dataframe looks like for Daily data:
daily_dates
registered
attended
02/10/2022
0
0
02/09/2022
0
0
02/08/2022
1
0
02/07/2022
1
0
02/06/2022
20
06
02/05/2022
05
03
02/04/2022
15
12
02/03/2022
10
08
02/02/2022
10
05
The first day of the week I'd want is Sunday.
My current code to perform weekly group is:
weekly_df = weekly_df.resample('w').sum().reset_index()
The output I am desiring is:
weekly_dates
registered
attended
02/06/2022
22
06
01/30/2022
40
28
A bit of explanation about desired output - the reason for 02/06/2022 & 01/30/2022 is that both these dates are start date of that respective week which is a sunday. And for the week of 01/30/2022 only 02/05/2022|05|03, 02/04/2022, 02/03/2022, 02/02/2022 dates are considered as those are the one's in the daily dataframe.
My current implementation follows the instructions provided here.
I am looking for any suggestion to achieve my Desired Output
Try:
df.resample('W-SUN', label='left', closed='left').sum().reset_index()
Output:
daily_dates registered attended
0 2022-01-30 40 28
1 2022-02-06 22 6

Can anybody tell me the regex that matches twenty (20) day and 28 days but not in 27 days

I am writing a python script to match the duration of an activity. There are 2 choices- X days/months/years or (X) days/months/years.
I wrote a regex \w*\s*['(']*\d{1,4}[')']*\s*\w{3,6} and the sentence is
Ujjwal in 28 days and 40 months and 2 years or twenty (20) day
I want to match only 28 days, 40 months, 2 years and twenty (20) day.
But my regex is matching in 28 days, and 40 months, and 2 years.
Please help me.
It is probably easier to be more specific with your regex, trying to match either a word before digits in parentheses or just digits:
(?:\w+\s+\(\d+\)|\b\d+)
followed by a space and one of the date type words:
\s+(?:year|month|day)s?
In python:
import re
text = 'Ujjwal in 28 days and 40 months and 2 years or twenty (20) day'
print(re.findall(r'(?:\w+\s+\(\d+\)|\b\d+)\s+(?:year|month|day)s?', text))
Output:
['28 days', '40 months', '2 years', 'twenty (20) day']

Count rows in multiple columns left to right until specific criteria is met

I have the following table below. I will be referencing a specific number based on other extraneous information. Lets say the specific number is 30. I first need to count 30 numbers down my September list then move to October then November until count has reached 30. Then I need to count all the missing values until the next value would reach the 30th count from the previous task. So for this example the 30th number would be November 19th. The count of the missing should be 55, November 15th (if I counted that right). That value would then be stored in a cell.
I obtained the missed days with the following formula: =IFERROR(SMALL(IF(ISERROR(MATCH(ROW(L$1:INDEX(L:L,N$2)),M$2:INDEX(M:M,COUNT(M:M)+ROW(M$1)),0)),ROW(L$1:INDEX(L:L,N$2))),ROW()-ROW(L$1)),"") (see table 2 for column reference)
The max column value will be blank if there is no data in the month column, therefore the missed column will also have not data. I set that up with the following formula:
=IF(COUNTA(M:M)>1,31,"") (see table 2 for column reference)
Table 1
September max missed October max missed November max missed
1 30 4 1 31 2 2 30 1
2 6 3 6 7 3
3 7 4 7 9 4
5 11 5 8 10 5
8 12 12 9 11 6
9 13 15 10 16 8
10 14 20 11 17 12
15 16 28 13 18 13
22 17 30 14 19 14
23 18 31 16 20 15
24 19 17 22 21
25 20 18 27 23
29 21 19 28 24
26 21 25
27 22 26
28 23 29
30 24 30
25
26
27
29
Table 2
L M N O
(blank) September max missed
I have an idea of how I would write this, but do not know the syntax:
x = Select(Range("G8").Value)
'value that holds specific value (30 for above example)
If x < 31 Then
'30 days in September
y = Count(M2:M32) Until = x
'values in September
z = Count(O2:O32) Until = value of y - 1
'What if the last value is the 30th of September, how would you stop on August 31st?
Range("A1").Value = z
'value of z is stored in cell A1
Elseif x < 62 Then
'61 days in September and October
y2 = Count(M2:M32) & Count(Q2:Q32) Until = x
'Values in September and October
z2 = Count(R2:R32) & (S2:S32) Until =value of -1
'Again, if the last value is the 31st of October how would you stop on September 30th?
Range("A1").Value = z
'Value of z is stored in cell A1
Elseif
'continue for each month (12 times)
End If
There are a couple of things that could cause some problems here with my suggestions (that I just thought of). How would I dictate my starting month? Lets say I wanted to reference a specific cell and that cell contains the number 4. So I would want to start in April, even if I had data in March. Another way of thinking about this is March is in year 2019 and April is in 2018. So then how could I could I get the code to jump from say December back to January? Say column Z is December and column A is January. I wouldn't necessarily want my code to only read left to right. It would need to start in reference to another cell and then jump back to the start if the year changes.
I apologies for the lengthiness, but that's my best effort in explaining. Let me know if you have any questions or if I can provide anyone with more example, pictures, etc.
I think you should reorganize your data table to something like this:
Day Status
01.09.2018 ok
02.09.2018 ok
03.09.2018 ok
04.09.2018 missed
05.09.2018 ok
06.09.2018 missed
07.09.2018 missed
08.09.2018 ok
09.09.2018 ok
10.09.2018 ok
11.09.2018 missed
12.09.2018 missed
13.09.2018 missed
14.09.2018 missed
15.09.2018 ok
16.09.2018 missed
17.09.2018 missed
18.09.2018 missed
19.09.2018 missed
20.09.2018 missed
21.09.2018 missed
22.09.2018 ok
23.09.2018 ok
24.09.2018 ok
25.09.2018 ok
26.09.2018 missed
27.09.2018 missed
28.09.2018 missed
29.09.2018 ok
30.09.2018 missed
01.10.2018 ok
02.10.2018 ok
03.10.2018 ok
04.10.2018 ok
05.10.2018 ok
06.10.2018 ok
07.10.2018 ok
08.10.2018 ok
09.10.2018 ok
10.10.2018 ok
11.10.2018 ok
12.10.2018 ok
13.10.2018 ok
14.10.2018 ok
15.10.2018 ok
16.10.2018 ok
17.10.2018 ok
18.10.2018 ok
19.10.2018 ok
20.10.2018 ok
21.10.2018 ok
22.10.2018 ok
23.10.2018 ok
24.10.2018 ok
25.10.2018 ok
26.10.2018 ok
27.10.2018 ok
28.10.2018 ok
29.10.2018 ok
30.10.2018 ok
31.10.2018 missed
After that, you could easily manage your counts, find anything you want via filtering, specifying date start and so on

Parsing Data Output in Python

So I have this code:
si.get_stats("aapl")
which returns this junk:
0 Market Cap (intraday) 5 877.04B
1 Enterprise Value 3 966.56B
2 Trailing P/E 15.52
3 Forward P/E 1 12.46
4 PEG Ratio (5 yr expected) 1 1.03
5 Price/Sales (ttm) 3.30
6 Price/Book (mrq) 8.20
7 Enterprise Value/Revenue 3 3.64
8 Enterprise Value/EBITDA 6 11.82
9 Fiscal Year Ends Sep 29, 2018
10 Most Recent Quarter (mrq) Sep 29, 2018
11 Profit Margin 22.41%
12 Operating Margin (ttm) 26.69%
13 Return on Assets (ttm) 11.96%
14 Return on Equity (ttm) 49.36%
15 Revenue (ttm) 265.59B
16 Revenue Per Share (ttm) 53.60
17 Quarterly Revenue Growth (yoy) 19.60%
18 Gross Profit (ttm) 101.84B
19 EBITDA 81.8B
20 Net Income Avi to Common (ttm) 59.53B
21 Diluted EPS (ttm) 11.91
22 Quarterly Earnings Growth (yoy) 31.80%
23 Total Cash (mrq) 66.3B
24 Total Cash Per Share (mrq) 13.97
25 Total Debt (mrq) 114.48B
26 Total Debt/Equity (mrq) 106.85
27 Current Ratio (mrq) 1.12
28 Book Value Per Share (mrq) 22.53
29 Operating Cash Flow (ttm) 77.43B
30 Levered Free Cash Flow (ttm) 48.42B
31 Beta (3Y Monthly) 1.21
32 52-Week Change 3 5.27%
33 S&P500 52-Week Change 3 4.97%
34 52 Week High 3 233.47
35 52 Week Low 3 150.24
36 50-Day Moving Average 3 201.02
37 200-Day Moving Average 3 203.28
38 Avg Vol (3 month) 3 38.6M
39 Avg Vol (10 day) 3 42.36M
40 Shares Outstanding 5 4.75B
41 Float 4.62B
42 % Held by Insiders 1 0.07%
43 % Held by Institutions 1 61.16%
44 Shares Short (Oct 31, 2018) 4 36.47M
45 Short Ratio (Oct 31, 2018) 4 1.06
46 Short % of Float (Oct 31, 2018) 4 0.72%
47 Short % of Shares Outstanding (Oct 31, 2018) 4 0.77%
48 Shares Short (prior month Sep 28, 2018) 4 40.2M
49 Forward Annual Dividend Rate 4 2.92
50 Forward Annual Dividend Yield 4 1.51%
51 Trailing Annual Dividend Rate 3 2.72
52 Trailing Annual Dividend Yield 3 1.52%
53 5 Year Average Dividend Yield 4 1.73
54 Payout Ratio 4 22.84%
55 Dividend Date 3 Nov 15, 2018
56 Ex-Dividend Date 4 Nov 8, 2018
57 Last Split Factor (new per old) 2 1/7
58 Last Split Date 3 Jun 9, 2014
This is a third party function, scraping data off of Yahoo Finance. I need something like this
def func( si.get_stats("aapl") ):
**magic**
return Beta (3Y Monthly)
Specifically, I want it to return the number assocaited with Beta, not the actual text.
I'm assuming that the function call returns a single string or list of strings for each line in the table and is not writing to the stdout.
To get the value associated with Beta (3Y Monthly) or any of the other parameter names:
1) If the return is a single string with formatting to print as the table above it should have \n at the end of each line. So you can split this string to a list then iterate over to find the parameter name and split again to fetch the numeric associated with it
# Split the single formatted string to a list of elements, each element
# is one line in the table
str_lst = si.get_stats("aapl").split('\n')
for line in str_lst:
# change Beta (3Y Monthly) to any other parameter required.
if 'Beta (3Y Monthly)' in line:
# split this line with the default split value of white space
# this should provide a list of elements split at each white space.
# eg : ['31', 'Beta', '(3Y', 'Monthly)', '1.21'], the numeric value is the
# last element. Strip to remove trailing space/newline.
num_value_asStr = line.split()[-1].strip()
return num_value_asStr
2) If it already a list that is returned then just iterate over the list items and use the if condition as above and split the required list element to get the numeric value associated with the parameter.
str_lst = si.get_stats("aapl")
for line in str_lst:
# change Beta (3Y Monthly) to any other parameter required.
if 'Beta (3Y Monthly)' in line:
# split this line with the default split value of white space
# this should provide a list of elements split at each white space.
# eg : ['31', 'Beta', '(3Y', 'Monthly)', '1.21'], the numeric value is the
# last element. Strip to remove trailing space/newline.
num_value_asStr = line.split()[-1].strip()
return num_value_asStr

How to find pattern and make operation in another field in awk?

I have a file with 4 columns separated by space like this bellow:
1_86500000 50 1_87500000 19
1_87500000 13 1_89500000 42
1_89500000 25 1_90500000 10
1_90500000 3 1_91500000 11
1_91500000 23 1_92500000 29
1_92500000 34 1_93500000 4
1_93500000 39 1_94500000 49
1_94500000 35 1_95500000 26
2_35500000 1 2_31500000 81
2_31500000 12 2_4150000 50
The First and Third columns are not in phase so I can not divide the value of one by another.
As there are only two or one possible columns $1 or $3, a solution would be look for the pattern and divide its value in the another column or set it to 0 if there is none like this expected result shows:
P.S. the second field in this expected result is just illustrative to shown the division.
1_86500000 0/50 0
1_87500000 19/13 1.46154
1_89500000 42/25 1.68
1_90500000 10/3 3.333
1_91500000 11/23 0.47826
1_92500000 29/34 0.85294
1_93500000 4/39 0.10256
1_94500000 49/35 1.4
2_35500000 0/1 0
2_31500000 81/12 6.75
2_4150000 50/0 50
I do not archived anything by myself other than this. So I do not have any starting point by now.
I tried separate the fields merged with _ to see if I could match by subtracting the coordinates. If I got 0 would mean that the columns was in phase and correct. But I could not go further.
awk '{if( ($5-$2)==0) print $1,$2,$3,$4,$5,$6}' file
I tried to match both columns but I only got phased results:
awk '{if(($1==$3)) print $1,$4/$2}' file
Can you help me?
awk to the rescue!
$ awk '{d[$1]=$2; n[$3]=$4}
END {for(k in n)
if(k in d) {print k,n[k]"/"d[k],n[k]/d[k]; delete d[k]}
else print k,n[k]"/0",n[k];
for(k in d) print k,"0/"d[k],0}' file | sort
1_86500000 0/50 0
1_87500000 19/13 1.46154
1_89500000 42/25 1.68
1_90500000 10/3 3.33333
1_91500000 11/23 0.478261
1_92500000 29/34 0.852941
1_93500000 4/39 0.102564
1_94500000 49/35 1.4
1_95500000 26/0 26
2_31500000 81/12 6.75
2_35500000 0/1 0
2_4150000 50/0 50
your division by zero result is little strange though!
Explanation keep two arrays for numerator and denominator. Once scanned the file, go over numerator array and find the corresponding denominator and make the division. For the denominators not used apply the convention given.

Resources