Creating missed records - Hive/PySpark - apache-spark

I've this situation of identifying previously available records and duplicate them whenever the corresponding month's record is not available.
Here is the table structure:
metrics table:
Metric_id
Frequency
Metrics_results
293
Monthly
151
293
Monthly
152
293
Monthly
153
294
quarterly
173
294
quarterly
174
295
Annually
195
Metrics_results table:
Metrics_results
Month
year
value
151
Jan
2020
98
152
Feb
2020
98
153
Mar
2020
99
173
Dec
2019
87
174
Mar
2020
86
195
Jan
2020
90
Join metrics and metrics_results table on Metrics_results column for every month and average the value per month.
Expected results:
Metric_id
Metrics_results
Month
year
value
Flag
293
151
Jan
2020
98
Existing
293
152
Feb
2020
98
Existing
293
153
Mar
2020
99
Existing
294
173
Dec
2019
87
Existing
294
173
Jan
2020
87
Copied
294
173
Feb
2020
87
Copied
294
174
Mar
2020
86
Existing
295
195
Jan
2020
90
Existing
295
195
Feb
2020
90
Copied
295
195
Mar
2020
90
Copied
For the metrics which are evaluated monthly, there will be a corresponding record in the metrics_results table. For the ones which are evaluated quarterly(Mar, Jun, Sep, Dec) and annually(Jan), there will be only selected months records available in metrics_results. For such records, have to copy over the previous available month's record if the current month value is not available in averaging.
Eg:
For Metric id = 294, there is no record for Jan 2020 & Feb 2020. In this case for Jan 2020 & Feb 2020, have to copy the record of Dec 2019 and change the month to Jan 2020 & Feb 2020 as that's the last value available.
For metric id = 295, there is no record for any other month other than Jan 2020. This Jan 2020 record must be copied and replace the month for the rest of the year.
I'm looking for a solution either in hive query or in PySpark. Any ideas or suggestions will be appreciated.

Related

Is there any way to convert columns to rows using pandas?

I have an excel file which contain data like this:-
Prod
Work
Vaction
Year
2022
2022
2023
2022
2022
2023
2022
Month
10
11
12
10
11
12
10
Name
Business?
Exclusive?
Oct
Nov
Dec
Oct
Nov
Dec
Oct
Robert
Yes
No
100
100
100
150
150
150
1.1
Maria
No
Yes
75
75
50
25
25
25
1
and I want to convert this table into this form:
Name
Business?
Exclusive?
Year
Month
Prod
Work
Vacation
Robert
Yes
No
2022
Oct
100
150
1.1
Maria
No
Yes
2022
Nov
100
150
1
Robert
No
Yes
2023
Dec
100
150
1
Maria
No
Yes
2023
Dec
50
150
1
With the help of python pandas library. I am struggling with this problem from so many days. Please Help!

using python want to calculate last 6 months average for each month

I have a dataframe which has 3 columns [user_id ,year_month & value] , i want to calculate last 6months average for the year automatically for each individual unique user_id and assign it to new column
user_id value year_month
1 50 2021-01
1 54 2021-02
.. .. ..
1 50 2021-11
1 47 2021-12
2 36 2021-01
2 48.5 2021-05
.. .. ..
2 54 2021-11
2 30.2 2021-12
3 41.4 2021-01
3 48.5 2021-02
3 41.4 2021-05
.. .. ..
3 30.2 2021-12
Total year has 12-24 months
to get jan 2022 value[dec 2021 to july 2021]=[55+32+33+63+54+51]/6
to get feb 2022 value[jan 2022 to aug 2021] =[32+33+37+53+54+51]/6
to get mar 2022 value[feb 2022 to sep 2021] =[45+32+33+63+54+51]/6
to get apr 2022 value[mar 2022 to oct 2021] =[63+54+51+45+32+33]/6
First index, your datetime column
df = df.set_index('year_month')
Then do the following
df.groupby('UserId').rolling('6M').transform('avg')
This is the most correct way but hey here is one more intutitive
df.sort_values('year_month').groupby('UserId').rolling(6).transform('avg') # Returns wanted series
As paul h said

Horizontal SUMIFS with two vertical criteria

I am given the following sales table which provide the sales that each employee made, but instead of their name I have their ID and each ID may have more than 1 row.
To map the ID back to the name, I have a look up table with each employee's name and ID.
Sales Table:
Year
ID
North
South
West
East
2020
A
58
30
74
72
2020
A
85
40
90
79
2020
B
9
82
20
5
2020
B
77
13
49
21
2020
C
85
55
37
11
2020
C
29
70
21
22
2021
A
61
37
21
42
2021
A
22
39
2
34
2021
B
62
55
9
72
2021
B
59
11
2
37
2021
C
41
22
64
47
2021
C
83
18
56
83
ID table:
ID
Name
A
Allison
B
Brandon
C
Chris
I am trying to sum up each employee's sales by a given year, and aggregate all their transactions by their name (rather than ID), so that my result looks like the following:
Result:
Report
2021
Allison
258
Brandon
307
Chris
414
I want the user to be able to select the year, and the report would automatically sum up each person's sales by the year and their name.
Any ideas on how I can accomplish this?
With FILTER:
=SUM(FILTER($C$2:$F$13,($B$2:$B$13=INDEX($I$2:$I$4,MATCH(N3,$J$2:$J$4,0)))*($A$2:$A$13=$N$2)))
With SUMPRODUCT:
=SUMPRODUCT($C$2:$F$13*($B$2:$B$13=INDEX($I$2:$I$4,MATCH(N3,$J$2:$J$4,0)))*($A$2:$A$13=$N$2))

Convert string date column with format of ordinal numeral day, abbreviated month name, and normal year to %Y-%m-%d

Given the following df with string date column with ordinal numbers for day, abbreviated month name for month, and normal year:
date oil gas
0 1st Oct 2021 428 99
1 10th Sep 2021 401 101
2 2nd Oct 2020 189 74
3 10th Jan 2020 659 119
4 1st Nov 2019 691 130
5 30th Aug 2019 742 162
6 10th May 2019 805 183
7 24th Aug 2018 860 182
8 1st Sep 2017 759 183
9 10th Mar 2017 617 151
10 10th Feb 2017 591 149
11 22nd Apr 2016 343 88
12 10th Apr 2015 760 225
13 23rd Jan 2015 1317 316
I'm wondering how could we parse date column to standard %Y-%m-%d format?
My ideas so far: 1. strip ordinal indicators ('st', 'nd', 'rd', 'th') from character day string while keeping the day number with re; 2. and convert abbreviated month name to numbers (seems not %b), 3. finally convert them to %Y-%m-%d.
Code may be useful for the first step:
re.compile(r"(?<=\d)(st|nd|rd|th)").sub("", df['date'])
References:
https://metacpan.org/release/DROLSKY/DateTime-Locale-0.46/view/lib/DateTime/Locale/en_US.pm#Months
pd.to_datetime already handles this case if you don't specify the format parameter:
>>> pd.to_datetime(df['date'])
0 2021-10-01
1 2021-09-10
2 2020-10-02
3 2020-01-10
4 2019-11-01
5 2019-08-30
6 2019-05-10
7 2018-08-24
8 2017-09-01
9 2017-03-10
10 2017-02-10
11 2016-04-22
12 2015-04-10
13 2015-01-23
Name: date, dtype: datetime64[ns]

Rearrange and regroup stacked data

I have Excel data as follows:
Mon 34
Mon 76
Mon 86
Tue 24
Tue 34
Tue 66
Wed 88
Wed 89
Wed 87
Is there a way with a formula to rewrite this data as follows:
Mon Tue Wed
34 24 88
76 66 89
86 66 87
Assuming 76 is in B2, insert a column on the left and a row above. Label the columns (say ID, day and value) and in A2 enter 1 and series fill down to A4. Then select A2:A4 and series fill down to suit.
Build a PivotTable with ID for ROWS, day for COLUMNS and value for VALUES.
Won't give quite the result you show from the data sample:

Resources