Find all occurrences from a string - Presto - presto

I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72

select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+

Related

Creating an incremental model in DBT+Spark with no unique_key

I have a user table as follows
|------------|-----------------|
| user_id | visited |
|------------|-----------------|
| 1 | 12-23-2021 |
| 1 | 11-23-2021 |
| 1 | 10-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-19-2021 |
| 3 | 02-25-2021 |
|------------|-----------------|
I'm trying to create an incremental model to get the user's recent visited date.
Since the incremental model needs an unique key, I'm concatenating user_id||visited -> unique_id
DBT + Spark
{{ config(
materialized='incremental',
file_format='delta',
unique_key='unique_id',
incremental_strategy='merge'
) }}
with CTE as (
select user_id,
visited,
user_id||visited as unique_id
from my_table
{% if is_incremental() %}
where visited >= date_add(current_date, -1)
{% endif %}
)
select user_id,
unique_id,
max(visited) as recent_visited_date
from CTE
group by 1,2
This above model is giving me the result as follows
|------------|------------------|-----------------------|
| user_id | unique_id |recent_visited_date |
|------------|------------------|-----------------------|
| 1 | 112-23-2021 | 12-23-2021 |
| 1 | 111-23-2021 | 11-23-2021 |
| 1 | 110-23-2021 | 10-23-2021 |
| 2 | 201-21-2021 | 01-21-2021 |
| 3 | 302-19-2021 | 02-19-2021 |
| 3 | 302-25-2021 | 02-25-2021 |
|------------|------------------|-----------------------|
The output what I wanted is
|------------|------------------------|
| user_id | recent_visited_date |
|------------|------------------------|
| 1 | 12-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-25-2021 |
|------------|------------------------|
I know that for the incremental model with merge strategy, the unique_id should be in the final table in order to compare
but having the unique_id is giving the wrong output
Is there any other way around to get the max(visited) for the user?

Error while querying hive table with map datatype in Spark SQL. But working while executing in HiveQL

I have hive table with below structure
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
I need to write a spark sql query to filter based on Key column with NOT IN condition with commination of both keys.
The following query works fine in HiveQL in Beeline
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
But when i try the same query in Spark SQL. I am getting error
cannot resolve due to data type mismatch: map<int,string>
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
Please help!
I got the answer from different question which i raised before. This query is working fine
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );

Remove groups from pandas where {condition}

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Excel Formula to count all items in a group to see if the status is an Open status

I am working in Excel 2016. I am trying to figure out how many projects I have that have not had any part of it started. For instance if my project id is 203784 and it has 3 parts to it where 2 are Complete and 1 was Not Started. I would not want to count that. If the project had 3 parts and 2 were Not Started 1 was assigned. I would want to count that as 1. Thank you in advance you your assistance.
+----+------------+------------------+-------------+
| | A | B | C |
+----+------------+------------------+-------------+
| 1 | Project ID | Position | Status |
| 2 | 203784 | Staff | Complete |
| 3 | 203784 | Staff | Complete |
| 4 | 203784 | Staff | Not Started |
| 5 | 203785 | Maintenance | Complete |
| 6 | 203785 | Maintenance | In Progress |
| 7 | 203786 | Grounds | Complete |
| 8 | 203787 | Nurse | Complete |
| 9 | 203788 | Teacher | Complete |
| 10 | 203788 | Teacher | Complete |
| 11 | 203788 | Teacher | Complete |
| 12 | 203789 | Transportation | Complete |
| 13 | 203789 | Transportation | Complete |
| 14 | 203789 | Transportation | Complete |
| 15 | 203790 | Evacuation | Complete |
| 16 | 203790 | Evacuation | Complete |
| 17 | 203791 | Implementation | Complete |
| 18 | 203792 | Knowledge Base | Not Started |
| 19 | 203792 | Knowledge Base | Not Started |
| 20 | 203793 | Janitor | Not Started |
| 21 | 203794 | Public Relations | In Progress |
| 22 | 203795 | HR | Complete |
| 23 | 203796 | Admin | Complete |
+----+------------+------------------+-------------+
In this example. I would only want the count to show a total of 2. For project numbers 203792 and 203793.
One way would be to add a column (say Count) populated as:
=COUNTIFS(A:A,A2,C:C,"Complete")+COUNTIFS(A:A,A2,C:C,"In Progress")
and then create a PivotTable with Count as Filters and Project ID for Rows. Select 0 for the filter.

spotfire how to get the last value in a column?

I have a data table as below. I would like to get the last [Action] of each [Sstage] of each [ID] based on the [Time].
I tried: last(action)over intersect([id],[stage],[time]) but it is not giving me what I want. does anyone have any idea?
+----+---------+-----------------+------------+-----------------+
| ID | Stage | Action | Time | Last_Action |
+----+---------+-----------------+------------+-----------------+
| 1 | CEO | Decline | 11/01/2016 | requestmoreinfo |
| 1 | CEO | Approve | 11/02/2016 | requestmoreinfo |
| 1 | CEO | Approve | 11/03/2016 | requestmoreinfo |
| 1 | CEO | requestmoreinfo | 11/04/2016 | requestmoreinfo |
| 1 | Manager | requestmoreinfo | 11/05/2016 | Decline |
| 1 | Manager | requestmoreinfo | 11/06/2016 | Decline |
| 1 | Manager | Approve | 11/07/2016 | Decline |
| 1 | Manager | Decline | 11/08/2016 | Decline |
| 2 | User | Decline | 11/09/2016 | Approve |
| 2 | User | Decline | 11/10/2016 | Approve |
| 2 | User | Approve | 11/11/2016 | Approve |
+----+---------+-----------------+------------+-----------------+
This one probably isn't as obvious as most.
We have to find out what the status is for the Max([Time]) over the [ID] and [Stage]. You are close using Last() but that method is to get the logical last row. If your data isn't sorted... then this would give undesired results. Thus, use the Max() method to get the most recent date.
Max([Time]) OVER (Intersect([ID],[Stage]))
Now... this would put the [Time] in your calculated column... since you want the correlated [Action], we need to nest that in an IF() statement to find the [Action]
If([Time]=Max([Time]) OVER (Intersect([ID],[Stage])),[Action])
Now, this would put the correct [Action] in your calculated column, but only on the row containing the Max([Time]).
The last step is to apply this value across your [ID],[Stage] grouping with one more Over() method
First(If([Time]=Max([Time]) OVER (Intersect([ID],[Stage])),[Action])) OVER (Intersect([ID],[Stage]))

Resources