Creating an incremental model in DBT+Spark with no unique_key - databricks

I have a user table as follows
|------------|-----------------|
| user_id | visited |
|------------|-----------------|
| 1 | 12-23-2021 |
| 1 | 11-23-2021 |
| 1 | 10-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-19-2021 |
| 3 | 02-25-2021 |
|------------|-----------------|
I'm trying to create an incremental model to get the user's recent visited date.
Since the incremental model needs an unique key, I'm concatenating user_id||visited -> unique_id
DBT + Spark
{{ config(
materialized='incremental',
file_format='delta',
unique_key='unique_id',
incremental_strategy='merge'
) }}
with CTE as (
select user_id,
visited,
user_id||visited as unique_id
from my_table
{% if is_incremental() %}
where visited >= date_add(current_date, -1)
{% endif %}
)
select user_id,
unique_id,
max(visited) as recent_visited_date
from CTE
group by 1,2
This above model is giving me the result as follows
|------------|------------------|-----------------------|
| user_id | unique_id |recent_visited_date |
|------------|------------------|-----------------------|
| 1 | 112-23-2021 | 12-23-2021 |
| 1 | 111-23-2021 | 11-23-2021 |
| 1 | 110-23-2021 | 10-23-2021 |
| 2 | 201-21-2021 | 01-21-2021 |
| 3 | 302-19-2021 | 02-19-2021 |
| 3 | 302-25-2021 | 02-25-2021 |
|------------|------------------|-----------------------|
The output what I wanted is
|------------|------------------------|
| user_id | recent_visited_date |
|------------|------------------------|
| 1 | 12-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-25-2021 |
|------------|------------------------|
I know that for the incremental model with merge strategy, the unique_id should be in the final table in order to compare
but having the unique_id is giving the wrong output
Is there any other way around to get the max(visited) for the user?

Related

How to use order_by when using group_by in Dajngo-orm, and take out all fields

I used Django-orm,postgresql, Is it possible to query by group_by and order_by?
this table
| id | b_id | others |
| 1 | 2 | hh |
| 2 | 2 | hhh |
| 3 | 6 | h |
| 4 | 7 | hi |
| 5 | 7 | i |
I want the query result to be like this
| id | b_id | others |
| 1 | 2 | hh |
| 3 | 6 | h |
| 4 | 7 | hi |
or
| id | b_id | others |
| 4 | 7 | hi |
| 3 | 6 | h |
| 1 | 2 | hh |
I tried
Table.objects.annotate(count=Count('b_id')).values('b_id', 'id', 'others')
Table.objects.values('b_id', 'id', 'others').annotate(count=Count('b_id'))
Table.objects.extra(order_by=['id']).values('b_id','id', 'others')
Try window function and subquery in the following way:
from django.db.models import Window, F, Subquery, Count
from django.db.models.functions import FirstValue
queryset = A.objects.annotate(count=Count('b_id')).filter(pk__in=Subquery(
A.objects.annotate(
first_id=Window(expression=FirstValue('id'), partition_by=[F('b_id')]), order_by=F('id'))
.values('first_id')))
You can try this:
from django.db.models import Count
result = Table.objects
.values('b_id')
.annotate(count=Count('b_id'))

Select rows from array of uuid when dealing with two tables

I have products and providers. Each product has an uuid and each provider has a list of uuid of products that they can provide.
How do I select all the products that a given (i.e. by provider uuid) provider can offer?
Products:
+------+------+------+
| uuid | date | name |
+------+------+------+
| 0 | - | - |
| 1 | - | - |
| 2 | - | - |
+------+------+------+
Providers:
+------+----------------+
| uuid | array_products |
+------+----------------+
| 0 | [...] |
| 1 | [...] |
| 2 | [...] |
+------+----------------+
select p.name, u.product_uuid
from products p
join
(
select unnest(array_products) as product_uuid
from providers where uuid = :target_provider_uuid
) u on p.uuid = u.product_uuid;
Please note however that your data design is not efficient and much harder to work with than a normalized one.

Oracle: update table where number column in a string variable

Here is what I want to do:
current table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | max |
| 2 | linda |
| 3 | sam |
| 4 | henry |
+----+-------------+
I have a id_str=1,3,4
Mystery Query - something like:
UPDATE table SET data = 'jen' where id in (id_str)
resulting table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | jen |
| 2 | lindaa |
| 3 | jen |
| 4 | jen |
+----+-------------+
Starting from a list of ids given as a CSV string, say :id_str, you can do:
update mytable
set data = 'jen'
where ',' || :id_str || ',' like ',%' || id || ',%'
An alternative is a regex functions:
where regexp_like(:id_str, '(^|,)' || id || '(,|$)')
Both solutions work, but are rather inefficient. A much better solution would be not to pass the serch parameters as a proper list of values rather than a CSV string.

Remove groups from pandas where {condition}

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Find all occurrences from a string - Presto

I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72
select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+

Resources