AWS Athena working with nested arrays, trying to search for a field within the array - presto

I have a sql query:
SELECT id_str, entities.hashtags
FROM tweets, unnest(entities.hashtags) as t(hashtag)
WHERE cardinality(entities.hashtags)=2 and id_str='1248585590573948928'
limit 5
which returns:
id_str hashtags
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
The unnesting has returned the row twice which originally was one row, this is because there are 2 objects in this array.
The next part I wanted to add to the sql query was
select hashtag['text'] as htag to the existing select which should return 2 rows still but this time returning LUCAS and WayV in the separate rows in same column, named htag.
But I get this error - any idea what I am doing wrong?
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: '[]' cannot be applied to row(text varchar,indices array(bigint)), varchar(4)
I assume it is because I have another array within this array.. ?
Thanks in advance

I'm not entirely sure where you're adding the hashtag['text'] expression, so I can't say with confidence what your problem is, but I have two suggestions for you to try:
The error says that hashtag is of type row(text varchar, …), which suggests that hashtag.text should work.
If that doesn't work, you can try using element_at e.g. element_at(hashtag, 'text').

I came across this issue as well and since there is no solution provided I like to chip in:
After you unnest an array, you can address the result with a . reference instead of ['']:
WITH dataset AS (
SELECT ARRAY[
CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
] AS users
)
SELECT
user,
user.name
FROM dataset
cross join unnest (users) as t(user)

Related

mariadb python - executemany using SELECT

Im trying to input many rows to a table in a mariaDB.
For doing this i want to use executemany() to increase speed.
The inserted row is dependent on another table, which is found with SELECT.
I have found statements that SELECT doent work in a executemany().
Are there other ways to sole this problem?
import mariadb
connection = mariadb.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
cursor = connection.cursor()
query="""INSERT INTO [db].[table1] ([col1], [col2] ,[col3])
VALUES ((SELECT [colX] from [db].[table2] WHERE [colY]=? and
[colZ]=(SELECT [colM] from [db].[table3] WHERE [colN]=?)),?,?)
ON DUPLICATE KEY UPDATE
[col2]= ?,
[col3] =?;"""
values=[input_tuplets]
When running the code i get the same value for [col1] (the SELECT-statement) which corresponds to the values from the from the first tuplet.
If SELECT doent work in a executemany() are there another workaround for what im trying to do?
Thx alot!
I think that reading out the tables needed,
doing the search in python,
use exeutemany() to insert all data.
It will require 2 more queries (to read to tables) but will be OK when it comes to calculation time.
Thanks for your first question on stackoverflow which identified a bug in MariaDB Server.
Here is a simple script to reproduce the problem:
CREATE TABLE t1 (a int);
CREATE TABLE t2 LIKE t1;
INSERT INTO t2 VALUES (1),(2);
Python:
>>> cursor.executemany("INSERT INTO t1 VALUES \
(SELECT a FROM t2 WHERE a=?))", [(1,),(2,)])
>>> cursor.execute("SELECT a FROM t1")
>>> cursor.fetchall()
[(1,), (1,)]
I have filed an issue in MariaDB Bug tracking system.
As a workaround, I would suggest reading the country table once into an array (according to Wikipedia there are 195 different countries) and use these values instead of a subquery.
e.g.
countries= {}
cursor.execute("SELECT country, id FROM countries")
for row in cursor:
countries[row[0]]= row[1]
and then in executemany
cursor.executemany("INSERT INTO region (region,id_country) values ('sounth', ?)", [(countries["fra"],) (countries["ger"],)])

extract array of arrays in presto

I have a table in Athena (presto) with just one column named individuals and this is the type of then column:
array(row(individual_id varchar, ids array(row(type varchar, value varchar, score integer))))
I want to extract value from inside the ids and return them as a new array. As an example:
[{individual_id=B1Q, ids=[{type=H, value=efd3, score=1}, {type=K, value=NpS, score=1}]}, {individual_id=D6n, ids=[{type=A, value=178, score=6}, {type=K, value=NuHV, score=8}]}]
and I want to return
ids
[efd3, NpS, 178, NuHV]
I tried multiple solutions like
select * from "test"
CROSS JOIN UNNEST(individuals.ids.value) AS t(i)
but always return
Expression individuals is not of type ROW
select
array_agg(ids.value)
from test
cross join unnest(test.individuals) t(ind)
cross join unnest(ind.ids) t(ids)
result:
[efd3, NpS, 178, NuHV]
that will return all the id values as one row, which may or may not be what you want
if you want to return an array of individual values by individual_id:
select
ind.individual_id,
array_agg(ids.value)
from test
cross join unnest(test.individuals) t(ind)
cross join unnest(ind.ids) t(ids)
group by
ind.individual_id

RedShift Correlated Sub-query

Need your help. I am trying to convert below SQL query into RedShift, but getting error message "Invalid operation: This type of correlated subquery pattern is not supported yet"
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
( SELECT MAX(X) FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X
UNION ALL
SELECT dim_date.calendar_date
) a
) as start_dt
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
WHERE Comp_Key =50906055
The same query works fine in SQL Server. Could you please help me to run it in RedShift?
Regards,
Kiru
Kiru - you need to convert the correlated query into a join structure. Not knowing the data content of your tables and the exact expected out put I'm just guessing but here's a swag:
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
Max_X
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
JOIN ( SELECT MAX(X) as Max_X, MAX(calendar_date) as date FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X FROM stage5
cross join
SELECT dim_date.calendar_date from dim_date
) a
) as start_dt ON a.date = dim_date.calendar_date
WHERE Comp_Key =50906055
This is just a starting guess but might get you started.
However, you are likely better off rewriting this query to use window functions as they are the fastest way to perform these types of looping queries in Redshift.
Thanks Bill. It won't work in RedShift as it still has correalted sub-query.
However I have modified query in another method and it works fine.
I am closing ticket.

cassandra : name provided was not in the list of valid column labels error

i'm using cassandra 1.2.8. i have a column family like below:
CREATE TABLE word_probability (
word text,
category text,
probability double,
PRIMARY KEY (word,category)
);
when i use query like this:
String query = "SELECT * FROM word_probability WHERE word='%s' AND category='%s';";
it works well but for some words i get this message:
name provided was not in the list of valid column labels error
every thing is ok and i don't know why i get this error :(
You're not doing anything wrong except mixing up cql with sql. Cql doesn't support % wildcards.

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources