I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72
select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+
Related
I have a user table as follows
|------------|-----------------|
| user_id | visited |
|------------|-----------------|
| 1 | 12-23-2021 |
| 1 | 11-23-2021 |
| 1 | 10-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-19-2021 |
| 3 | 02-25-2021 |
|------------|-----------------|
I'm trying to create an incremental model to get the user's recent visited date.
Since the incremental model needs an unique key, I'm concatenating user_id||visited -> unique_id
DBT + Spark
{{ config(
materialized='incremental',
file_format='delta',
unique_key='unique_id',
incremental_strategy='merge'
) }}
with CTE as (
select user_id,
visited,
user_id||visited as unique_id
from my_table
{% if is_incremental() %}
where visited >= date_add(current_date, -1)
{% endif %}
)
select user_id,
unique_id,
max(visited) as recent_visited_date
from CTE
group by 1,2
This above model is giving me the result as follows
|------------|------------------|-----------------------|
| user_id | unique_id |recent_visited_date |
|------------|------------------|-----------------------|
| 1 | 112-23-2021 | 12-23-2021 |
| 1 | 111-23-2021 | 11-23-2021 |
| 1 | 110-23-2021 | 10-23-2021 |
| 2 | 201-21-2021 | 01-21-2021 |
| 3 | 302-19-2021 | 02-19-2021 |
| 3 | 302-25-2021 | 02-25-2021 |
|------------|------------------|-----------------------|
The output what I wanted is
|------------|------------------------|
| user_id | recent_visited_date |
|------------|------------------------|
| 1 | 12-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-25-2021 |
|------------|------------------------|
I know that for the incremental model with merge strategy, the unique_id should be in the final table in order to compare
but having the unique_id is giving the wrong output
Is there any other way around to get the max(visited) for the user?
I have hive table with below structure
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
I need to write a spark sql query to filter based on Key column with NOT IN condition with commination of both keys.
The following query works fine in HiveQL in Beeline
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
But when i try the same query in Spark SQL. I am getting error
cannot resolve due to data type mismatch: map<int,string>
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
Please help!
I got the answer from different question which i raised before. This query is working fine
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );
I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.
I am working in Excel 2016. I am trying to figure out how many projects I have that have not had any part of it started. For instance if my project id is 203784 and it has 3 parts to it where 2 are Complete and 1 was Not Started. I would not want to count that. If the project had 3 parts and 2 were Not Started 1 was assigned. I would want to count that as 1. Thank you in advance you your assistance.
+----+------------+------------------+-------------+
| | A | B | C |
+----+------------+------------------+-------------+
| 1 | Project ID | Position | Status |
| 2 | 203784 | Staff | Complete |
| 3 | 203784 | Staff | Complete |
| 4 | 203784 | Staff | Not Started |
| 5 | 203785 | Maintenance | Complete |
| 6 | 203785 | Maintenance | In Progress |
| 7 | 203786 | Grounds | Complete |
| 8 | 203787 | Nurse | Complete |
| 9 | 203788 | Teacher | Complete |
| 10 | 203788 | Teacher | Complete |
| 11 | 203788 | Teacher | Complete |
| 12 | 203789 | Transportation | Complete |
| 13 | 203789 | Transportation | Complete |
| 14 | 203789 | Transportation | Complete |
| 15 | 203790 | Evacuation | Complete |
| 16 | 203790 | Evacuation | Complete |
| 17 | 203791 | Implementation | Complete |
| 18 | 203792 | Knowledge Base | Not Started |
| 19 | 203792 | Knowledge Base | Not Started |
| 20 | 203793 | Janitor | Not Started |
| 21 | 203794 | Public Relations | In Progress |
| 22 | 203795 | HR | Complete |
| 23 | 203796 | Admin | Complete |
+----+------------+------------------+-------------+
In this example. I would only want the count to show a total of 2. For project numbers 203792 and 203793.
One way would be to add a column (say Count) populated as:
=COUNTIFS(A:A,A2,C:C,"Complete")+COUNTIFS(A:A,A2,C:C,"In Progress")
and then create a PivotTable with Count as Filters and Project ID for Rows. Select 0 for the filter.
I have a data table as below. I would like to get the last [Action] of each [Sstage] of each [ID] based on the [Time].
I tried: last(action)over intersect([id],[stage],[time]) but it is not giving me what I want. does anyone have any idea?
+----+---------+-----------------+------------+-----------------+
| ID | Stage | Action | Time | Last_Action |
+----+---------+-----------------+------------+-----------------+
| 1 | CEO | Decline | 11/01/2016 | requestmoreinfo |
| 1 | CEO | Approve | 11/02/2016 | requestmoreinfo |
| 1 | CEO | Approve | 11/03/2016 | requestmoreinfo |
| 1 | CEO | requestmoreinfo | 11/04/2016 | requestmoreinfo |
| 1 | Manager | requestmoreinfo | 11/05/2016 | Decline |
| 1 | Manager | requestmoreinfo | 11/06/2016 | Decline |
| 1 | Manager | Approve | 11/07/2016 | Decline |
| 1 | Manager | Decline | 11/08/2016 | Decline |
| 2 | User | Decline | 11/09/2016 | Approve |
| 2 | User | Decline | 11/10/2016 | Approve |
| 2 | User | Approve | 11/11/2016 | Approve |
+----+---------+-----------------+------------+-----------------+
This one probably isn't as obvious as most.
We have to find out what the status is for the Max([Time]) over the [ID] and [Stage]. You are close using Last() but that method is to get the logical last row. If your data isn't sorted... then this would give undesired results. Thus, use the Max() method to get the most recent date.
Max([Time]) OVER (Intersect([ID],[Stage]))
Now... this would put the [Time] in your calculated column... since you want the correlated [Action], we need to nest that in an IF() statement to find the [Action]
If([Time]=Max([Time]) OVER (Intersect([ID],[Stage])),[Action])
Now, this would put the correct [Action] in your calculated column, but only on the row containing the Max([Time]).
The last step is to apply this value across your [ID],[Stage] grouping with one more Over() method
First(If([Time]=Max([Time]) OVER (Intersect([ID],[Stage])),[Action])) OVER (Intersect([ID],[Stage]))