Oracle: update table where number column in a string variable - string

Here is what I want to do:
current table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | max |
| 2 | linda |
| 3 | sam |
| 4 | henry |
+----+-------------+
I have a id_str=1,3,4
Mystery Query - something like:
UPDATE table SET data = 'jen' where id in (id_str)
resulting table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | jen |
| 2 | lindaa |
| 3 | jen |
| 4 | jen |
+----+-------------+

Starting from a list of ids given as a CSV string, say :id_str, you can do:
update mytable
set data = 'jen'
where ',' || :id_str || ',' like ',%' || id || ',%'
An alternative is a regex functions:
where regexp_like(:id_str, '(^|,)' || id || '(,|$)')
Both solutions work, but are rather inefficient. A much better solution would be not to pass the serch parameters as a proper list of values rather than a CSV string.

Related

Creating an incremental model in DBT+Spark with no unique_key

I have a user table as follows
|------------|-----------------|
| user_id | visited |
|------------|-----------------|
| 1 | 12-23-2021 |
| 1 | 11-23-2021 |
| 1 | 10-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-19-2021 |
| 3 | 02-25-2021 |
|------------|-----------------|
I'm trying to create an incremental model to get the user's recent visited date.
Since the incremental model needs an unique key, I'm concatenating user_id||visited -> unique_id
DBT + Spark
{{ config(
materialized='incremental',
file_format='delta',
unique_key='unique_id',
incremental_strategy='merge'
) }}
with CTE as (
select user_id,
visited,
user_id||visited as unique_id
from my_table
{% if is_incremental() %}
where visited >= date_add(current_date, -1)
{% endif %}
)
select user_id,
unique_id,
max(visited) as recent_visited_date
from CTE
group by 1,2
This above model is giving me the result as follows
|------------|------------------|-----------------------|
| user_id | unique_id |recent_visited_date |
|------------|------------------|-----------------------|
| 1 | 112-23-2021 | 12-23-2021 |
| 1 | 111-23-2021 | 11-23-2021 |
| 1 | 110-23-2021 | 10-23-2021 |
| 2 | 201-21-2021 | 01-21-2021 |
| 3 | 302-19-2021 | 02-19-2021 |
| 3 | 302-25-2021 | 02-25-2021 |
|------------|------------------|-----------------------|
The output what I wanted is
|------------|------------------------|
| user_id | recent_visited_date |
|------------|------------------------|
| 1 | 12-23-2021 |
| 2 | 01-21-2021 |
| 3 | 02-25-2021 |
|------------|------------------------|
I know that for the incremental model with merge strategy, the unique_id should be in the final table in order to compare
but having the unique_id is giving the wrong output
Is there any other way around to get the max(visited) for the user?

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Find all occurrences from a string - Presto

I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72
select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+

How to pivot data using Informatica when you have variable amount of pivot rows?

Based on my earlier questions, how can I pivot data using Informatica PowerCenter Designer when I have variable amount of Addresses in my data. I would like to Pivot e.g four addresses from my data. This is the structure of the source data file:
+---------+--------------+-----------------+
| ADDR_ID | NAME | ADDRESS |
+---------+--------------+-----------------+
| 1 | John Smith | JohnsAddress1 |
| 1 | John Smith | JohnsAddress2 |
| 1 | John Smith | JohnsAddress3 |
| 2 | Adrian Smith | AdriansAddress1 |
| 2 | Adrian Smith | AdriansAddress2 |
| 3 | Ivar Smith | IvarAddress1 |
+---------+--------------+-----------------+
And this should be the resulting table:
+---------+--------------+-----------------+-----------------+---------------+----------+
| ADDR_ID | NAME | ADDRESS1 | ADDRESS2 | ADDRESS3 | ADDRESS4 |
+---------+--------------+-----------------+-----------------+---------------+----------+
| 1 | John Smith | JohnsAddress1 | JohnsAddress2 | JohnsAddress3 | NULL |
| 2 | Adrian Smith | AdriansAddress1 | AdriansAddress2 | NULL | NULL |
| 3 | Ivar Smith | IvarAddress1 | NULL | NULL | NULL |
+---------+--------------+-----------------+-----------------+---------------+----------+
I guess I can use
SOURCE --> SOURCE_QUALIFIER --> SORTER --> AGGREGATOR --> EXPRESSION --> TARGET TABLE
But what kind of port should I use in AGGREGATOR and EXPRESSION transforms?
You should use something along the lines of this:
Source->Expression->Aggregator->Target
In the expression, add a variable port:
v_count expr: IIF(ISNULL(v_COUNT) OR v_COUNT=3, 1, v_COUNT + 1)
OR
v_count expr: IIF(ADDR_ID=v_PREVIOUS_ADDR_ID, v_COUNT + 1, 1)
And 3 output ports:
o_addr1 expr: DECODE(TRUE, v_COUNT=1, ADDR_IN, NULL)
o_addr2 expr: DECODE(TRUE, v_COUNT=2, ADDR_IN, NULL)
o_addr3 expr: DECODE(TRUE, v_COUNT=3, ADDR_IN, NULL)
Then use the aggregator, group by ID and select always the Max,
e.g.
agg_addr1: expr: MAX(O_ADDR1)
agg_addr2: expr: MAX(O_ADDR2)
agg_addr3: expr: MAX(O_ADDR3)
If you need more denormalized ports, add additional ports and set the initial state
of the v_count variable accordingly.
Try this:
SOURCE --> SOURCE_QUALIFIER --> RANK --> AGGREGATOR -->TARGET
In RANK transformation, group by on ADDR_ID and select ADDRESS as rank port. In properties tab, select Number of ranks as 4.
In AGGREGATOR transformation group by on ADDR_ID and use the following output port expressions (RANKINDEX will be generated by RANK transformation):
ADDRESS1 = MAX(ADDRESS,RANKINDEX=1)
ADDRESS2 = MAX(ADDRESS,RANKINDEX=2)
ADDRESS3 = MAX(ADDRESS,RANKINDEX=3)
ADDRESS4 = MAX(ADDRESS,RANKINDEX=4)

Resources