I use Pyspark 2.1 to compute table and column statistics out of hive tables.
When I do:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS")
I am able to collect the stats with a DESCRIBE EXTENDED SQL command:
spark.sql("DESCRIBE EXTENDED <table name>").show()
However, when computing column statistics like so:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS FOR COLUMNS")
The job gets sent and successfully done, but I am unable to collect the stats using the SQL command as advised by this answer:
spark.sql("DESCRIBE EXTENDED <table name> <column name>").show()
I get :
ParseException Unsupported SQL statement
Reissuing the same DESCRIBE EXTENDED query (without the column name) does not
return any changes in the CatalogTable : I can only see the table statistics (i.e. sizeInBytes and rowCount).
This other answer suggests to retrieve Spark statistics from a "table" in the metastore, but this is cryptic to me...
How can I access these column-level statistics within Spark ?
Edit : I have investigated this further which allows me to refine the scope of my question :
It looks like my Hive client (Hive View 2.0) and Spark SQL do not write the statistics in the same location.
When using Spark SQL's ANALYZE TABLE method, -only- table statistics show up in a Statistics key of the CatalogTable:
Statistics: sizeInBytes=15196959528, rowCount=39763901, isBroadcastable=false
However, Hive View is blind to these stats, which are not listed in my table statistics
Conversely, when computing table or columns statistics within Hive View, I can collect table statistics with Spark SQL's DESCRIBE EXTENDED method, but they appear in the Properties key of my CatalogTable. It also indicates if some column statistics have been computed :
Properties: [numFiles=20, transient_lastDdlTime=1540381765, totalSize=15196959528, COLUMN_STATS_ACCURATE={"COLUMN_STATS":{"bubble_level":"true","bubble_level_n1s":"true","timestamp":"true"}}]
Thus these pieces of information appear to be independent, and my question then becomes : which piece can be used by Spark to optimize execution plan ? I understand that some of these questions could be solved by upgrading to latest version of Spark, but this is not on my schedule for the moment.
The above-mentioned answer of Jacek Laskowski suggests Hive's statistics can be used if they are available through Spark SQL DESCRIBE EXTENDED method.
Can anybody confirm this ?
Many thanks in advance for helping me to clear this up.
Related
I am using Spark 2.4.7 and I have implemented normal pyspark cassandra connector, but there is a use case where I need to implement key based connector, I am not getting useful blogs/tutorials around it, Someone please help me with it.
I have tried normal pyspark-cassandra connector and it is working good.
Now I am expecting to implement keybased connector, which I am unable to find.
Normally Cassandra Loads entire table but I want not to load entire table but run a query on source and get the required data
By keybased I want to get data using some keys i.e. using where condition like
Select *
From <table_name>
Where <column_name>!=0
should run on source and load those data only which satisfies this condition.
To have this functionality you need to understand how both Spark & Cassandra works separately & together:
When you do spark.read, Spark doesn't load all data - it just fetches metadata, like, table structure, column names & types, partitioning schema, etc.
When you perform query with condition (where or filter), Spark Cassandra Connector tries to perform so-called predicate pushdown - convert Spark SQL query into corresponding CQL query, but it really depends on the condition. And if it's not possible, then it goes through all data, and perform filtering on the Spark side. For example, if you have condition on the column that is partition key - then it will be converted into CQL expression SELECT ... FROM table where pk = XXX. Similarly, there are some optimizations for queries on the clustering columns - Spark will need to go through all partitions, but it's still will be more optimized as it may filter data based on the clustering columns. Use a link above to understand what conditions could be pushed down into Cassandra and which aren't. The rule of thumb is - if you can execute query in CQLSH without ALLOW FILTERING, then it will be pushed down.
In your specific example, you're using inequality predicate (<> or !=) that isn't supported by Cassandra, so Spark Cassandra connector will need to go through all data, and filtering will happen on the Spark side.
I know we could explicitly ANALYZE the table in Spark SQL so we could get some exact statistics.
However, is it possible that there exists some utilities in Catalyst which does not require explicitly scan the entire table but it could give me some rough statistics. I don't really care about the real size of a table, I only care about the relative size between tables. So I could use this info to decide which table is larger than others during query compilation.
There are two utilities in Catalyst:
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.{BasicStatsPlanVisitor,SizeInBytesOnlyStatsPlanVisitor}
But it looks like they both require explicitly scan the table.
Thanks.
There are two ways, either the stats will be taken from metastore, which requires running ANALYZE in advance (scan over data) or the stats (only SizeInBytes actually) will be estimated using InMemoryFileIndex which does not require scanning over the data but using Hadoop api Spark gathers size of each file.
Which of these methods will be used depends on more settings. For example if the SizeInBytes is available in metastore and CBO (cost based optimization) is enabled by configuration setting
spark.cbo.enabled
, Spark will take it from metastore. If CBO is off (which is default value in Spark 2.4), Spark will use InMemoryFileIndex. If SizeInBytes is not available in metastore Spark can still use either CatalogFileIndex or InMemoryFileIndex. CatalogFileIndex will be used for example if your table is partitioned, more specifically if this is satisfied (taken directly from the Spark source code):
val useCatalogFileIndex = sparkSession.sqlContext.conf.manageFilesourcePartitions && catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog && catalogTable.get.partitionColumnNames.nonEmpty
In this case if the stats are not in metastore, Spark will use defaultSizeInBytes from a configuration setting:
spark.sql.defaultSizeInBytes
which is by default Long.MaxValue, so the size will be overestimated to maximum value. I guess this is the worst scenario, the stats are not in metastore, but Spark is looking for them there using CatalogFileIndex, it does not find it and thus uses very large unrealistic value.
We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code}
spark.sql.cbo.enabled true spark.experimental.extrastrategies
intervaljoin spark.sql.cbo.joinreorder.enabled true {code}
The tables that we are using are not partitioned.
Please let me know if you need further details.
You provide little detail. Check if all steps have been followed as set out below pls.
From Spark 2.2 when I last looked at this excellent article: https://www.waitingforcode.com/apache-spark-sql/spark-sql-cost-based-optimizer/read
the following:
Spark SQL implementation
At the time of writing (2.2.0 released) Spark SQL Cost Based Optimization is disabled by default and can be activated through spark.sql.cbo.enabled property. When enabled, it applies in: filtering, projection, joins and aggregations, as we can see in corresponding estimation objects from org.apache.spark.sql.catalyst.plans.logical.statsEstimation package: FilterEstimation, ProjectEstimation, JoinEstimation and AggregateEstimation.
Even if at first glance the use of estimation objects seems to be conditioned only by the configuration property, it's not always the case. The Spark's CBO is applied only when the statistics about manipulated data are known (read more about them in the post devoted to Statistics in Spark SQL). This condition is expressed by EstimationUtils method:
def rowCountsExist(conf: SQLConf, plans: LogicalPlan*): Boolean =
plans.forall(_.stats(conf).rowCount.isDefined)
The filtering is an exception because it's checked against the number of rows existence:
if (childStats.rowCount.isEmpty) return None
The statistics can be gathered by the execution of ANALYZE TABLE $TABLE_NAME COMPUTE STATISTICS command before the processing execution. When ANALYZE command is called, it's executed by:
org.apache.spark.sql.execution.command.AnalyzeTableCommand#run(SparkSession) that updates
org.apache.spark.sql.catalyst.catalog.SessionCatalog statistics of processed data.
The only problem with ANALYZE command is that it can be called only for Hive and in-memory data stores.
Also, CBO does not work properly with Partitioned Hive Parquet tables; CBO only gives the size and not the number of rows estimated.
I'm trying to compute HIVE table statistic from Apache Spark:
`sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS')`
I also execute statement to see what was collected:
sqlCtx.sql('DESC FORMATTED t1')
I can see my stats was collected.
However when I execute same staement in HIVE client (Ambari) - there are no statistics displayed. Is it available only to Spark if it's collected by Spark?
Does spark store it somewhere else?
Another question.
I also computing stats for all columns in that table:
sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS c1,c2')
But when I want to see this stats in spark, it failed with unsupported sql statement exception:
sqlCtx.sql('DESC FORMATTED t1 c1')
According to docs it's valid hive queries.
What is wrong with it?
Thanks for help.
Apache Spark stores statistics as "Table parameters".
To be able retrieve these stats, we need to connect to HIVE metastore and . execute query like following
select param_key, param_value
from table_params tp, tbls t
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>'
and param_key like 'spark.sql.stat%';
just uppercase the name of table will be ok.
select param_key, param_value
from TABLE_PARAMS tp, TBLS t
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>'
and param_key like 'spark.sql.stat%';
After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo