Athena/Presto - UNNEST MAP to columns - presto

Assume i have a table like this,
table: qa_list
id | question_id | question | answer |
---------+--------------+------------+-------------
1 | 100 | question1 | answer |
2 | 101 | question2 | answer |
3 | 102 | question3 | answer |
4 | ...
... | ...
and a query that gives below result (since I couldn't find a direct way to transpose the table),
table: qa_map
id | qa_map
--------+---------
1 | {question1=answer,question2=answer,question3=answer, ....}
Where qa_map is the result of a map_agg of arbitrary number of questions and answers.
Is there a way to UNNEST qa_map to an arbitrary number of columns as shown below?
id | Question_1 | Answer_1 | Question_2 | Answer_2 | Question_3 | ....
---------+-------------+-----------+-------------+-----------+-------------+
1 | question | answer | question | answer | question | ....
AWS Athena/Presto-0.172

No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts. The map you have is as close as you are going to get.
If you include your motivation for wanting to do this there may be other ways we can help you achieve your end goal.

Related

How to do a rolling sum in PySpark? [duplicate]

This question already has answers here:
Python Spark Cumulative Sum by Group Using DataFrame
(4 answers)
Closed 2 years ago.
Given the column A as shown in the following example I'd like to have the column B where each record is the sum of the current record in A and previous record in B:
+-------+
| A | B |
+-------+
| 0 | 0 |
| 0 | 0 |
| 1 | 1 |
| 0 | 1 |
| 1 | 2 |
| 1 | 3 |
| 0 | 3 |
| 0 | 3 |
So in a way I would be interested into consider previous record into my operation. I'm aware of the F.lag function but I don't see how it can work in this way. Any ideas on how to get this operation done?
I'm open to rephrasing if the idea can be expressed in a better way.
It seems you're trying to do a rolling sum of A. You can do a sum over a window, e.g.
from pyspark.sql import functions as F, Window
df2 = df.withColumn('B', F.sum('A').over(Window.orderBy('ordering_col')))
But you would need a column to order by, otherwise the "previous record" is not well-defined because Spark dataframes are unordered.

add uniqueness to a table column just for some cases in mysql using knex

I'm using mysql. I want a column to have unique values just in some cases.
Example, the table can have the next vales:
+----+-----------+----------+------------+
| id | user_id | col1 | col2 |
+----+-----------+----------+------------+
| 1 | 2 | no | no |
| 2 | 2 | no | no |
| 3 | 3 | no | yes |
| 4 | 2 | yes | no |
| 5 | 2 | no | yes |
+----+-----------+----------+------------+
I want the no|no to be able to repeat for the same user but no the yes|no combination. Is this possible in mysql? And with knex?
My migration fot that table looks like this
return knex.schema.createTable('myTable', table => {
table.increments('id').unsigned().primary();
table.integer('uset_id').unsigned().notNullable().references('id').inTable('table_user').onDelete('CASCADE').index();
table.string('col1').defaultTo('yes');
table.string('col2').defaultTo('no');
});
That doesn't seem to be easy task to do. You would need partial unique index over multiple columns.
I couldn't spot that mysql would support partial indexes https://dev.mysql.com/doc/refman/8.0/en/create-index.html
So it could Something like what is described here, but using triggers for that seems a bit overkill https://dba.stackexchange.com/questions/41030/creating-a-partial-unique-constraint-for-mysql

Excel, How to use average with indirect?

I've been trying to use AVERAGE with INDIRECT but keeps giving me errors.
Now I am using Average like this:
AVERAGE(Results!C2:C51)
I need to get data from another sheet "Results". But in my current sheet I got the range of the rows set in two cells.
+-------------------+
| ... E F |
| +-------+-------+
| 2 |...| 2 | 51| |
| +---------------+
| 3 | | 52|101| |
| +---------------+
| 4 | | | | |
+---+---+---+---+---+
I've tried like this, but it's not working:
AVERAGE(Results!INDIRECT("C"&E2):INDIRECT("C"&F2))
This should do it:
=AVERAGE(INDIRECT("Results!C"&E2&":C"&F2))
The answer posted by zipa is correct. Here is an alternaive that will allow you to avoid INDIRECT() entirely:
=AVERAGE(INDEX(Results!C:C,E2):INDEX(Results!C:C,F2))
This is based on Scott Craner's Answer to a question I asked previously.

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Divide two "Calculated Values" within Spofire Graphical Table

I have a spotfire question. Is it possible to divide two "calculated value" columns in a "graphical table".
I have a Count([Type]) calculated value. I then limit the data within the second calculated value to arrive at a different number of Count[Type].
I would like to divide the two in a third calculated value column.
ie.
Calculated value column 1:
Count([Type]) = 100 (NOT LIMITED)
Calculated value column 2:
Count([Type]) = 50 (Limited to [Type]="Good")
Now I would like to say 50/100 = 0.5 in the third calculated value column.
If it is possible to do this all within one calculated column value that is even better. Graphical Tables do not let you have if statements in the custom expression, the only way is to limit data. So I am struggling, any help is appreciated.
Graphical tables do allow IF() in custom expressions. In order to accomplish this you are going to have to move your logic away from the Limit Data Using Expressions and into your expression directly. Here should be your three Axes expressions:
Count([Type])
Count(If([Type]="Good",[Type]))
Count(If([Type]="Good",[Type])) / Count([Type])
Data Set
+----+------+
| ID | Type |
+----+------+
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
+----+------+
Results

Resources