I'm trying to compute HIVE table statistic from Apache Spark:
`sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS')`
I also execute statement to see what was collected:
sqlCtx.sql('DESC FORMATTED t1')
I can see my stats was collected.
However when I execute same staement in HIVE client (Ambari) - there are no statistics displayed. Is it available only to Spark if it's collected by Spark?
Does spark store it somewhere else?
Another question.
I also computing stats for all columns in that table:
sqlCtx.sql('ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS c1,c2')
But when I want to see this stats in spark, it failed with unsupported sql statement exception:
sqlCtx.sql('DESC FORMATTED t1 c1')
According to docs it's valid hive queries.
What is wrong with it?
Thanks for help.
Apache Spark stores statistics as "Table parameters".
To be able retrieve these stats, we need to connect to HIVE metastore and . execute query like following
select param_key, param_value
from table_params tp, tbls t
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>'
and param_key like 'spark.sql.stat%';
just uppercase the name of table will be ok.
select param_key, param_value
from TABLE_PARAMS tp, TBLS t
where tp.tbl_id=t.tbl_id and tbl_name = '<table_name>'
and param_key like 'spark.sql.stat%';
Related
I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want
I want to execute Cassandra CQL query using PySpark.But I am not finding the way to execute it.I can load whole table to dataframe and create Tempview and query it.
df = spark.read.format("org.apache.spark.sql.cassandra").
options(table="country_production2",keyspace="country").load()
df.createOrReplaceTempView("Test")
Please suggest any better way to so that I can execute CQL query in PySpark.
Spark SQL doesn't support Cassandra's cql dialects directly. It only allows you to load the table as a Dataframe and operate on it.
If you are concerned about reading a whole table to query it, then you may use the filters as given below to let Spark push the predicates the load only the data you need.
from pyspark.sql.functions import *
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()\
.filter(col("id")=="A")
df.createOrReplaceTempView("Test")
In pyspark you're using SQL, not CQL. If the SQL query somehow matches the CQL, i.e., you're querying by partition or primary key, then Spark Cassandra Connector (SCC) will transform query into that CQL, and execute (so-called predicates pushdown). If it doesn't match, then Spark will load all data via SCC, and perform filtering on the Spark level.
So after you're registered temporary view, you can do:
val result = spark.sql("select ... from Test where ...")
and work with results in result variable. To check if predicates pushdown happens, execute result.explain(), and check for the * marker in the conditions in the PushedFilters section.
I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.
I use Pyspark 2.1 to compute table and column statistics out of hive tables.
When I do:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS")
I am able to collect the stats with a DESCRIBE EXTENDED SQL command:
spark.sql("DESCRIBE EXTENDED <table name>").show()
However, when computing column statistics like so:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS FOR COLUMNS")
The job gets sent and successfully done, but I am unable to collect the stats using the SQL command as advised by this answer:
spark.sql("DESCRIBE EXTENDED <table name> <column name>").show()
I get :
ParseException Unsupported SQL statement
Reissuing the same DESCRIBE EXTENDED query (without the column name) does not
return any changes in the CatalogTable : I can only see the table statistics (i.e. sizeInBytes and rowCount).
This other answer suggests to retrieve Spark statistics from a "table" in the metastore, but this is cryptic to me...
How can I access these column-level statistics within Spark ?
Edit : I have investigated this further which allows me to refine the scope of my question :
It looks like my Hive client (Hive View 2.0) and Spark SQL do not write the statistics in the same location.
When using Spark SQL's ANALYZE TABLE method, -only- table statistics show up in a Statistics key of the CatalogTable:
Statistics: sizeInBytes=15196959528, rowCount=39763901, isBroadcastable=false
However, Hive View is blind to these stats, which are not listed in my table statistics
Conversely, when computing table or columns statistics within Hive View, I can collect table statistics with Spark SQL's DESCRIBE EXTENDED method, but they appear in the Properties key of my CatalogTable. It also indicates if some column statistics have been computed :
Properties: [numFiles=20, transient_lastDdlTime=1540381765, totalSize=15196959528, COLUMN_STATS_ACCURATE={"COLUMN_STATS":{"bubble_level":"true","bubble_level_n1s":"true","timestamp":"true"}}]
Thus these pieces of information appear to be independent, and my question then becomes : which piece can be used by Spark to optimize execution plan ? I understand that some of these questions could be solved by upgrading to latest version of Spark, but this is not on my schedule for the moment.
The above-mentioned answer of Jacek Laskowski suggests Hive's statistics can be used if they are available through Spark SQL DESCRIBE EXTENDED method.
Can anybody confirm this ?
Many thanks in advance for helping me to clear this up.
After days thinking about it I'm still stuck with this problem: I have one table where "timestamp" is the partition key. This table contains billions of rows.
I also have "timeseries" tables that contain timestamps related to specific measurement processes.
With Spark I want to analyze the content of the big table. Of course it is not efficient to do a full table scan, and with a rather fast lookup in the timeseries table I should be able to target only, say, 10k partitions.
What is the most efficient way to achieve this?
Is SparkSQL smart enough to optimize something like this
sqlContext.sql("""
SELECT timeseries.timestamp, bigtable.value1 FROM timeseries
JOIN bigtable ON bigtable.timestamp = timeseries.timestamp
WHERE timeseries.parameter = 'xyz'
""")
Ideally I would expect Cassandra to fetch the timestamps from the timeseries table and then use that to query only that subset of partitions from bigtable.
If you add an "Explain" call to your query you'll see what the Catalyst planner will do for your query but I know it will not do the optimizations you want.
Currently Catalyst has no support for pushing down joins to DataSources which means the structure of your query is most likely got to look like.
Read Data From Table timeseries with predicate parameter = 'xyz'
Read Data From Table bigtable
Join these two results
Filter on bigtable.timestamp == timeseries.timestamp
The Spark Cassandra Connector will be given the predicate from the timeseries table read and will be able to optimize it if is a clustering key or a partition key. See the Spark Cassandra Connector Docs. If it doesn't fit into one of those pushdown categories it will require a Full Table Scan followed by a filter in Spark.
Since the Read Data From Table bigtable has no restrictions on it, Spark will instruct the Connector to read the entire table (Full Table Scan).
I can only take a guess on the optimizations done by the driver, but I'd surely expect a query such as that to restrict the JOIN on the WHERE, which means that your simple query will be optimized.
What I will do as well is point you in the general direction of optimizing Spark SQL. Have a look at Catalyst for Spark SQL, which is a tool for greatly optimizing queries all the way down to the physical level.
Here is a breakdown of how it works:
Deep Dive into Spark SQL Catalyst Optimizer
And the link to the git-repo: Catalyst repo