AWS Athena query question;
I have a nested map in my rows, of which I would like to transpose the keys to columns.
I could name the columns explicitly like items['label_a'], but in this case the keys are actually dynamic...
From these rows:
{id=1, items={label_a=foo, label_b=foo}}
{id=2, items={label_a=bar, label_c=bar}}
{id=3, items={label_b=baz, label_c=baz}}
I would like to get a table like so:
| id | label_a | label_b | label_c |
------------------------------------
| 1 | foo | foo | |
| 2 | bar | | bar |
| 3 | | baz | baz |
Is that possible and how to do this in aws athena (presto version 0.172)?
Thanks!
This is not possible in a dynamic manner due to the fact that output columns need to be know to the planner before the query execution starts.
See the previous discussion here: https://github.com/prestosql/presto/issues/2448 and https://github.com/prestosql/presto/issues/1206.
Related
Ive been trying to find a formula which would count me the match between two tables (like inner join) in excel.
I have a table1 with columns(ID,UserName,Function) and table2 (UserName,Function, etc...) need to count an explicit matches of table1(UserName&Function) and table2(UserName&Function)
tried sumproduct(--(table1[UserName:Function]=table2[UserName:Function]) but it seems like it compares it column by column and returns incorrect value, i tried to concatenate those columns within sumproduct, but still doesnt work.
Is it possible to make it in one formula or shall i build udf with sql query?
Would it be possible to return the records and list it as an array by using FILTERXML formula?
sample data:
table1:
| ID | UserName | Function |
| -- | -------- | ----------|
| 1 | oopz | FCA4001 |
| 2 | oopz | FCA4002 |
| 3 | arronT | FCA4001 |
table2:
| UserName | Function |
| -------- | ----------|
| randalO | FCA4001 |
| oopz | FCA4001 |
| arronT | FCA4005 |
Thanks in advance!:)
I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.
Assume i have a table like this,
table: qa_list
id | question_id | question | answer |
---------+--------------+------------+-------------
1 | 100 | question1 | answer |
2 | 101 | question2 | answer |
3 | 102 | question3 | answer |
4 | ...
... | ...
and a query that gives below result (since I couldn't find a direct way to transpose the table),
table: qa_map
id | qa_map
--------+---------
1 | {question1=answer,question2=answer,question3=answer, ....}
Where qa_map is the result of a map_agg of arbitrary number of questions and answers.
Is there a way to UNNEST qa_map to an arbitrary number of columns as shown below?
id | Question_1 | Answer_1 | Question_2 | Answer_2 | Question_3 | ....
---------+-------------+-----------+-------------+-----------+-------------+
1 | question | answer | question | answer | question | ....
AWS Athena/Presto-0.172
No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts. The map you have is as close as you are going to get.
If you include your motivation for wanting to do this there may be other ways we can help you achieve your end goal.
I'm using mysql. I want a column to have unique values just in some cases.
Example, the table can have the next vales:
+----+-----------+----------+------------+
| id | user_id | col1 | col2 |
+----+-----------+----------+------------+
| 1 | 2 | no | no |
| 2 | 2 | no | no |
| 3 | 3 | no | yes |
| 4 | 2 | yes | no |
| 5 | 2 | no | yes |
+----+-----------+----------+------------+
I want the no|no to be able to repeat for the same user but no the yes|no combination. Is this possible in mysql? And with knex?
My migration fot that table looks like this
return knex.schema.createTable('myTable', table => {
table.increments('id').unsigned().primary();
table.integer('uset_id').unsigned().notNullable().references('id').inTable('table_user').onDelete('CASCADE').index();
table.string('col1').defaultTo('yes');
table.string('col2').defaultTo('no');
});
That doesn't seem to be easy task to do. You would need partial unique index over multiple columns.
I couldn't spot that mysql would support partial indexes https://dev.mysql.com/doc/refman/8.0/en/create-index.html
So it could Something like what is described here, but using triggers for that seems a bit overkill https://dba.stackexchange.com/questions/41030/creating-a-partial-unique-constraint-for-mysql
In customDimensions I have x number of key-value pair data (currently only two Name and Channel as an example in the screenshot below)
and I would like to project them to columns without explicitly specify the name of the key, so that in the future, if a new key-value pair added to the log, I don't have to go back and modify my query in order to display it as a new column.
Thank you!
The Kusto query language includes the bag_unpack() plugin: https://learn.microsoft.com/en-us/azure/kusto/query/bag-unpackplugin
Here's an example:
datatable(anotherColumn:int, customDimensions:dynamic)
[
1, dynamic({"Name":"mfdg", "Channel":"wer"}),
2, dynamic({"Name":"mfdg2", "Channel":"wer2"}),
3, dynamic({"NotAName":2.22, "NotAChannel":7}),
]
| evaluate bag_unpack(customDimensions)
Which yields:
| anotherColumn | Name | Channel | NotAName | NotAChannel |
|---------------|-------|---------|----------|-------------|
| 1 | mfdg | wer | | |
| 2 | mfdg2 | wer2 | | |
| 3 | | | 2.22 | 7 |