Aws Athena / Presto - nest two tables - presto

I'd like to use aws athena to nest two parquet tables such that:
Table A
|document_id| name|
+-----------+-----+
| 1| aaa|
| 2| bbb|
Table B
| topic_id| name|document_id|
+-----------+-----+-----------+
| 1| xxx| 1|
| 2| yyy| 2|
| 3| zzz| 2|
Nest table B into table A to get something like
[
{
"document_id": 1,
"name": "aaa",
"topics": [
{
"topic_id": 1,
"name": "xxx"
}
]
},
{
"document_id": 2,
"name": "bbb",
"topics": [
{
"topic_id": 2,
"name": "yyy"
},
{
"topic_id": 3,
"name": "zzz"
}
]
}
]
Is it possible? Any ideas?
Note: I got the example from this stakoverflow thread

This query should give you the result you're after:
SELECT
a.document_id,
ARBITRARY(a.name) AS name,
ARRAY_AGG(
CAST(
ROW(topic_id, b.name)
AS ROW(topic_id INTEGER, name VARCHAR)
)
) AS topics
FROM table_a a
LEFT JOIN table_b b USING (document_id)
GROUP BY a.document_id
If you run that query and convert the result to a JSON array you should get the desired result from your question.
The core of the solution is to build up topic structs with ROW and aggregate these structs for each document. This is a confusing detail with Athena; DDL statements use Hive SQL where there's a type called struct, and queries uses Presto SQL where the equivalent concept is ROW, and it doesn't help that the names for the integer and string types are also different. In DDL the type would be struct<topic_id:int,name:string> but in DML it's ROW(topic_id INTEGER, name VARCHAR).
I'm using ARBITRARY for the document name, but grouping by both document ID and name works too, the result will be the same.
When I run the query above on the data from your question I get this result:
document_id | name | topics
------------+------+-------
1 | aaa | [{topic_id=1, name=xxx}]
2 | bbb | [{topic_id=3, name=zzz}, {topic_id=2, name=yyy}]
And if you read that result set as JSON it you should get exactly what you posted, modulo ordering.

I think it's possible using map_concat and array_agg - but I needed to cast topic_id as varchar:
with intermediate as
(
select a.document_id,a.name
,MAP(array['topic_id'], array[cast(b.topic_id as varchar)]) topic_map
,MAP(array['name'], array[b.name]) name_map
from table_a a
left join table_b b
on a.document_id = b.document_id
)
select i.document_id,i.name
,array_agg(cast(map_concat(topic_map,name_map) as json)) topics
from intermediate i
group by i.document_id,i.name;

Related

Prisma - Update one resource under more condition

I would like to know if it is possible to update a resource under multiple condition.
For example, consider the following two table:
+----------+----------+
| Table1 | Table2 |
+----------+----------+
| id | id |
| param1T1 | param1T2 |
| param2T1 | param2T2 |
| idTable2 | |
+----------+----------+
I would like to update a record into table Table1 where I know the filed id that is the primary key AND where idTable2 have a specific value.
Using the .update() method I have an error during compile time but I don't understand why I can't update a single identity using more condition.
This is my update:
table1.update({
where: {
AND: [
{ id: 1 },
{ Table2: { id: 1 } }
]
},
data: toUpdate
});
Now, I resolve the problem using .updateMany() instead the .update(). There is a solution or is not possible?

How to get COUNT of records return by measures in JSON query of CubeJS

I want to get a count of records return by the below JSON query of CubeJS.
{
"measures": [
"Employee.salaryTotal"
],
"timeDimensions": [
{
"dimension": "Employee.createdat"
}
],
"dimensions": [
"Employee.isactive"
],
"filters": []
}
This JSON query generates below SQL in CubeJS:
SELECT
`employee`.`isActive` `employee__isactive`,
SUM(salary) `employee__salary_total`
FROM
DEMO.Employee AS `employee`
GROUP BY
1
ORDER BY
2 DESC
LIMIT
10000
Output of SQL is:
+---------------------+------------------------+
| employee__isactive | employee__salary_total |
+---------------------+------------------------+
| Y | 17451 |
| N | 1249 |
+---------------------+------------------------+
But what if I want count of records return by above SQL.
example:
SELECT COUNT(*) FROM
(SELECT
`employee`.`isActive` `employee__isactive`,
SUM(salary) `employee__salary_total`
FROM
DEMO.Employee AS `employee`
GROUP BY
1
ORDER BY
2 DESC
LIMIT
10000) AS EMPSAL
Expected result should be like this:
+------------+
| # COUNT(*) |
+------------+
| 2 |
+------------+
Currently there's no way to perform COUNT operation on top of final Cube.js query. If you're trying to implement paging functionality you can use offset and limit query parameters to check if there's a next page available by means of limit and row count comparison: https://cube.dev/docs/query-format#query-properties.

Maintain the sequence order of the key from a MapType when extracting the key value

How do I maintain the same sequence order of the key from a MapType when extracting the key value? The data is loaded from an avro file and the schema of the avro file is as below.
df = spark.createDataFrame(
[
(
"a-key",
{"FName": "John", "LName": "Citizen", "Age":"30","Mobile":"00000000" },
"John"
)
],
["somekey", "metadata", "name", ]
)
df.select(explode(df.metadata)).show()
I believe out of sequence order of the key is due to partitioning
+------+--------+
| key| value|
+------+--------+
| LName| Citizen|
|Mobile|00000000|
| FName| John|
| Age| 30|
+------+--------+
I am expecting the below output i.e. the same sequence as defined in the DataFrame.
+------+--------+
| key| value|
+------+--------+
| FName| John|
| LName| Citizen|
| Age| 30|
|Mobile|00000000|
+------+--------+
The change in the order is due to Python dict type. A Python dictionary is not an ordered object, and therefore, the data are sent to spark in a certain order which may be different from the one you wanted.
If you read directly a file, or a table with Spark, the same issue should not appear.
But after that, the explode does not respect the order in the map. Therefore you need to use posexplode function.

How to add multiple empty columns to a PySpark Dataframe at specific locations

I tried researching for this a lot but I am unable to find a way to execute and add multiple columns to a PySpark Dataframe at specific positions.
I have the dataframe that looks like this:
Customer_id First_Name Last_Name
I want to add 3 empty columns at 3 different positions and my final resulting dataframe needs to look like this:
Customer_id Address First_Name Email_address Last_Name Phone_no
Is there an easy way around it, like the way you can do with reindex on python?
# Creating a DataFrame.
from pyspark.sql.functions import col, lit
df = sqlContext.createDataFrame(
[('1','Moritz','Schulz'),('2','Sandra','Schröder')],
('Customer_id','First_Name','Last_Name')
)
df.show()
+-----------+----------+---------+
|Customer_id|First_Name|Last_Name|
+-----------+----------+---------+
| 1| Moritz| Schulz|
| 2| Sandra| Schröder|
+-----------+----------+---------+
You can use lit() function to add empty columns and once created you can use SQL's select to reorder the columns in the order you wish.
df = df.withColumn('Address',lit(''))\
.withColumn('Email_address',lit(''))\
.withColumn('Phone_no',lit(''))\
.select(
'Customer_id', 'Address', 'First_Name',
'Email_address', 'Last_Name', 'Phone_no'
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+
As suggested by user #Pault, a more concise & succinct way -
df = df.select(
"Customer_id", lit('').alias("Address"), "First_Name",
lit("").alias("Email_address"), "Last_Name", lit("").alias("Phone_no")
)
df.show()
+-----------+-------+----------+-------------+---------+--------+
|Customer_id|Address|First_Name|Email_address|Last_Name|Phone_no|
+-----------+-------+----------+-------------+---------+--------+
| 1| | Moritz| | Schulz| |
| 2| | Sandra| | Schröder| |
+-----------+-------+----------+-------------+---------+--------+
If you want even more succinct, that I feel shorter :
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
df = df.withColumn(col, F.lit(None))
You can then select the same array for the order.
(edit) Note : withColumn in a for loop is usually quite slow. Don't do that for a high number of columns and prefer a select statement, like :
select_statement = []
for col in ["mycol1", "mycol2", "mycol3", "mycol4", "mycol5", "mycol6"]:
select_statement.append(F.lit(None).alias(col))
df = df.select(*df.columns, *select_statement)

Why pyspark sql does not count correctly with group by clause?

I load parquet file into sql context like this:
sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")
Then I run this simple query:
sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()
The result:
+----------------+----------+
|count(name) |name |
+----------------+----------+
| 0| null|
| 226307| x|
+----------------+----------+
However, if I use groupBy from rdd set. I got a different result:
sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()
+----------+------+
| name | count|
+----------+------+
| x|226307|
| null|586822|
+----------+------+
The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?
Thanks,
count(name) will exclude null values , if you give count(*) it will give you the null values as well .
Try below.
sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()

Resources