Computing the size of a derived table in Spark SQL query - apache-spark

Is it possible to approximate the size of a derived table (in kb/mb/gb etc) in a Spark SQL query ? I don't need the exact size but an approximate value will do, which would allow me to plan my queries better by determining if a table could be broadcast in a join, or if using a filtered subquery in a Join will be better than using the entire table etc.
For e.g. in the following query, is it possible to approximate the size (in MB) of the derived table named b ? This will help me figure out if it will be better to use the derived table in the Join vs using the entire table with the filter outside -
select
a.id, b.name, b.cust
from a
left join (select id, name, cust
from tbl
where size > 100
) b
on a.id = b.id
We use Spark SQL 2.4. Any comments appreciated.

I have had to something similar before (to work out how many partitions to split to when writing).
What we ended up doing was working out an average row size and doing a count on the DataFrame then multiplying it by the row count.

Related

Fetch distinct field values from frozen set column in Cassandra columnfamily

Hi please help me to get cql query for below requirement
- Column family contains columns: deptid (datatype:uuid emplList (datatype: set frozen(employee) )
How would I get all distinct employees name from employee object where it is stored at set as column value for emplList.
Such queries couldn't be expressed in the pure CQL - Cassandra is optimized to read data by primary key, and aggregation operations are very limited. You have 2 choices:
Read all data from table by your program, and extract distinct values
Use Spark with Spark Cassandra Connector - it will read all the data from table, but you'll have higher level abstraction to work with data, and it could perform more optimized scanning of your table.

what is the row id equivalent in pyspark?

In our legacy DWH process, we find duplicates and track that duplicate records based on rowid in traditional RDBMS.
For ex.
select pkey_columns, max(rowdid) from table group by pkey_columns
will return only the duplicate records corresponding max records. Even when we identify the duplicate records, this helps in identifying/tracking the record.
Is there an equivalent in pySpark ? How is this handled in dwh to pyspark dwh translation projects ?
I would suggest that you use the analytic function library, perhaps a
ROW_NUMBER()
OVER( PARTITION BY group pkey_columns
ORDER BY sort columns)

Athena sub-query and LEFT JOIN data scanned optimization

There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data.
select columns from table1 where id in (id1, id2, idn)
If same query is executed with a sub-query such as -
select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.
Same happen with LEFT JOIN.
SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?
Similar questions- Question -1, Question -2
Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?
This is most commonly covered by "Dynamic filtering".
Currently there is no way to do this.
Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52
Athena is based on Presto 0.172 currently, so it still needs to upgrade.

Spark sql limit in IN clause

I have a query in spark-sql with a lot of values in the IN clause:
select * from table where x in (<long list of values>)
When i run this query i get a TransportException from the MetastoreClient in spark.
Column x is the partition column of the table. The hive metastore is on Oracle.
Is there a hard limit on how many values can be in the in clause?
Or can i maybe set the timeout value higher to give the metastore more time to answer.
yes,you can pass upto 1000 values inside IN clause.
However, you can use OR operator inside IN clause and slice the list of values into multiple 1000 windows.

Range based Search on date in astyanax

I have however, one more situation. In my Column Family I have rows with column for example name, salary, and dob(date of birth), All the columns are indexed. I want to do a range base Index search on dob. Will appreciate if you can let me know how can we do it.
You could move from astyanax to playOrm and just do
#NoSqlQuery(name="findByDate", query="PARTITIONS p(:partitionId) SELECT p FROM TABLE p where p.date > :date and p.date <= :data");
you need a schema though that can be partitioned if you want to scale and you just query that single partition. Some partition by customer, and some by time. There are many ways to divide up the schema.
OR if you really want to use astyanax, playOrm does a batched ranged query where it gets all the columns in batches each time(so you don't blow out memory). The code is on line 326 for setting up astyanax range builder and line 385 for using the builder to create your query.
https://github.com/deanhiller/playorm/blob/8a4f3405631ad78e6822795633da8c59cb25bb29/input/javasrc/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
Realize playOrm is doing batching as well so you see it setting the batch sizes too.
later,
Dean

Resources