What is the best way to paginate redshift query results?
SELECT * FROM my_table ORDER BY N LIMIT N OFFSET N
Related
As I am just starting out in the Big Data field, I am looking for advice on how it would be most efficient way to get some data into Spark in order to analyze it.
The SQL query is rather large, with multiple sub-queries, each with it's own "when", "group by" etc.
THe final data would have somewhere between 1 million and 20 million rows.
Is it the same thing (performance wise) if I run a spark sql query and save it into a dataframe using pyspark, or if I extract each subquery into different spark dataframes and use spark to do the grouping / filtering / etc. ?
For example, are these two methods equivalent on the amount of resources / time they use to process my data?
method 1:
df_final = spark.sql("""
With subquery 1 as(...),
subquery 2 as(...),
subquery 3 as(...),
...
select * from subquery n
"""
method 2:
df1 = spark.sql(subquery 1)
df2 = spark.sql(subquery 2)
...
df_final = *spark manipulation of dataframes here"
I would appreciate any advice. Thanks
Spark will create a DAG which should be equivalent in both cases. Performance should be the same in both cases.
There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.
I have a query in spark-sql with a lot of values in the IN clause:
select * from table where x in (<long list of values>)
When i run this query i get a TransportException from the MetastoreClient in spark.
Column x is the partition column of the table. The hive metastore is on Oracle.
Is there a hard limit on how many values can be in the in clause?
Or can i maybe set the timeout value higher to give the metastore more time to answer.
yes,you can pass upto 1000 values inside IN clause.
However, you can use OR operator inside IN clause and slice the list of values into multiple 1000 windows.
Say I have a RDBMS table with 10,000 records which has a column(pk_key) which is a sequence value starting from 1 to 10,000. I am planning to read it via spark.
I am planning to split into 10 partitions.
So in DataFrameReader jdbc method,my columnName will be "pk_key" and numPartitions will be 10.
What should be the lowerBound and upperBound for these ?
PS: My actual Record count is much higher,i just need to understand how it works?
Have you got any natural key? It may be non-unique. It's hard to determine lowerBound and upperBound to Long values, it will be different in different days.
One thing you can do is to run two queries:
select min(pk_key) from table;
select max(pk_key) from table;
via normal JDBC Connection. The first query will return lowerBound, the second one - upperBound
Given a simple table with the columns
id(partition), timestamp(clustering column) and value(a long)
, whats the best way to get the sum of values for each id? I'd try to select all distinct ids in a query and then use this list of ids to run a query for each id
SELECT sum(value) FROM mytable WHERE id = ?
Unfortunately I cant figure out how to write the spark job and I am not really sure this is the best way. This is how far I got:
sc.cassandraTable("mykeyspace", "mytable")
.select("select distinct id")
.select("select sum(value)")
.where("id=?", ???)
Any hints on how I should proceed would be really appreciated.
Edit: Also here is an working example of how I currently do the aggregation: https://gist.github.com/Phil-Ba/72a7e762c8ab1ff1f3c9e8cff92cb223#file-cassandrasum-scala
The performance is lackluster though :/
This is called group by.
it can achieved with sql
select sum(value) from mytable group by id
it can achieved with function call in Spark
import org.apache.spark.sql.functions._
val df = sqlContext.table("mytable")
df.groupBy("id").agg(sum($"value"))