AWS Athena: Get part of the String after last delimiter - presto

I have the this table in AWS Athena
+----------------------------------------------------------------------------+
| URL |
+----------------------------------------------------------------------------+
| stag.v1.abc.in/beauty/hair/go-abc-girl-a57-20200001?ref=home_feed_1 |
| stag.v1.abc.in/ |
| stag.v1.abc.ph/eatdrink/cheap/76027/dairy-free-upsize-a1046-20190515?ref=ar|
| stag.v1.abc.in/beauty/hair/go-abc-girl-a57-20200003?ref=home_feed_1 |
+-----------------------------------------------------------------------------+
I need to extract the part (id) of the string from columns between two delimiters(after last '-' and before '?')
I should get
+------------------------+
| ID |
+------------------------+
| 20200001 |
| - |
| 20190515 |
| 20200003 |
+------------------------+
I tried SUBSTRING_INDEX() But athena does not support it.Could anyone help me out in this. Thanks in advance

url_extract_path + regexp_extract
select regexp_extract(url_extract_path(url),'([^-]*)$') from "tableabc"
limit 5;

Related

Error while querying hive table with map datatype in Spark SQL. But working while executing in HiveQL

I have hive table with below structure
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
I need to write a spark sql query to filter based on Key column with NOT IN condition with commination of both keys.
The following query works fine in HiveQL in Beeline
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
But when i try the same query in Spark SQL. I am getting error
cannot resolve due to data type mismatch: map<int,string>
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
Please help!
I got the answer from different question which i raised before. This query is working fine
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );

Oracle: update table where number column in a string variable

Here is what I want to do:
current table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | max |
| 2 | linda |
| 3 | sam |
| 4 | henry |
+----+-------------+
I have a id_str=1,3,4
Mystery Query - something like:
UPDATE table SET data = 'jen' where id in (id_str)
resulting table:
+----+-------------+
| id | data |
+----+-------------+
| 1 | jen |
| 2 | lindaa |
| 3 | jen |
| 4 | jen |
+----+-------------+
Starting from a list of ids given as a CSV string, say :id_str, you can do:
update mytable
set data = 'jen'
where ',' || :id_str || ',' like ',%' || id || ',%'
An alternative is a regex functions:
where regexp_like(:id_str, '(^|,)' || id || '(,|$)')
Both solutions work, but are rather inefficient. A much better solution would be not to pass the serch parameters as a proper list of values rather than a CSV string.

I want to count IF both condition are true Logically count by AND excel

I have two Columns both are categorical columns. Like Age_group and Engagement_category. And I want to get count no. of each engagement_category in each Age_group.
This is like GROUP BY function in SQL.
| Engagement_category | Age_group |
|:-------------------:|:---------:|
| Nearly Engaged | 21-26 |
| Not Engaged | 31-36 |
| Disengaged | 36-41 |
| Nearly Engaged | 21-26 |
| Engaged | 21-26 |
| Engaged | 26-31 |
I tried Excel COUNTIFS function but it is showing the count of each unique value in the criteria range that I have provided.
Expected OUTPUT is something like this.
| Age_group | Engaged | Nearly Engaged | Not Engagaged | Disengaged |
|:---------:|:-------:|----------------|---------------|------------|
| 21-26 | 1 | | | |
| 26-31 | | | | |
| 31-36 | | | | |
| 36-41 | | | | |
| 41-46 | | | | |
| 46-51 | | | | |
Thanks!
Use COUNTIFS function, see document: https://support.office.com/en-us/article/countifs-function-dda3dc6e-f74e-4aee-88bc-aa8c2a866842
Please try:
=COUNTIFS(B:B, "21-26", A:A, "Engaged")
Try inserting a pivot table:
Highlight the source data which is the 2-Column table including headers, go to Insert tab and click the Pivot Table button, set up the Rows, Columns and Values column as below:
The key is to drag the Engagement_Category to both Columns and Values field.

Find all occurrences from a string - Presto

I have the following as rows in HIVE (HDFS) and using Presto as the Query Engine.
1,#markbutcher72 #charlottegloyn Not what Belinda Carlisle thought. And yes, she was singing about Edgbaston.
2,#tomkingham #markbutcher72 #charlottegloyn It's true the garden of Eden is currently very green...
3,#MrRhysBenjamin #gasuperspark1 #markbutcher72 Actually it's Springfield Park, the (occasional) home of the might
The requirement is to do get the following through Presto Query. How can we get this please
1,markbutcher72
1,charlottegloyn
2,tomkingham
2,markbutcher72
2,charlottegloyn
3,MrRhysBenjamin
3,gasuperspark1
3,markbutcher72
select t.id
,u.token
from mytable as t
cross join unnest (regexp_extract_all(text,'(?<=#)\S+')) as u(token)
;
+----+----------------+
| id | token |
+----+----------------+
| 1 | markbutcher72 |
| 1 | charlottegloyn |
| 2 | tomkingham |
| 2 | markbutcher72 |
| 2 | charlottegloyn |
| 3 | MrRhysBenjamin |
| 3 | gasuperspark1 |
| 3 | markbutcher72 |
+----+----------------+

'where' in apache spark

df:
-----------+
| word|
+-----------+
| 1609|
| |
| the|
| sonnets|
| |
| by|
| william|
|shakespeare|
| |
| fg|
This is my data frame. How to remove the empty rows (to remove the rows that contain '') using the 'where' clause.
code:
df.where(trim(df.word) == "").show()
output:
----+
|word|
+----+
| |
| |
| |
| |
| |
| |
| |
| |
| |
Any help is appreciated.
You can trim and check if result is empty:
>>> from pyspark.sql.functions import trim
>>> df.where(trim(df.word) != "")
Apart from where, you can also use filter to achieve this.
from pyspark.sql.functions import trim
df.filter(trim(df.word) != "").show()
df.where(trim(df.word) != "").show()

Resources