This page StatsDev - Apache Hive - Apache Software Foundation tells us we could get the following statistics once we get input tables & partitions information in Hive:
Number of rows
Number of files
Size in Bytes
However, can we get those statistics through just an unparsed HQL directly? Or we could only get them after parsing input tables & partitions information from HQL?
Keyword EXPLAIN is what I want: LanguageManual Explain - Apache Hive - Apache Software Foundation
Related
We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code}
spark.sql.cbo.enabled true spark.experimental.extrastrategies
intervaljoin spark.sql.cbo.joinreorder.enabled true {code}
The tables that we are using are not partitioned.
Please let me know if you need further details.
You provide little detail. Check if all steps have been followed as set out below pls.
From Spark 2.2 when I last looked at this excellent article: https://www.waitingforcode.com/apache-spark-sql/spark-sql-cost-based-optimizer/read
the following:
Spark SQL implementation
At the time of writing (2.2.0 released) Spark SQL Cost Based Optimization is disabled by default and can be activated through spark.sql.cbo.enabled property. When enabled, it applies in: filtering, projection, joins and aggregations, as we can see in corresponding estimation objects from org.apache.spark.sql.catalyst.plans.logical.statsEstimation package: FilterEstimation, ProjectEstimation, JoinEstimation and AggregateEstimation.
Even if at first glance the use of estimation objects seems to be conditioned only by the configuration property, it's not always the case. The Spark's CBO is applied only when the statistics about manipulated data are known (read more about them in the post devoted to Statistics in Spark SQL). This condition is expressed by EstimationUtils method:
def rowCountsExist(conf: SQLConf, plans: LogicalPlan*): Boolean =
plans.forall(_.stats(conf).rowCount.isDefined)
The filtering is an exception because it's checked against the number of rows existence:
if (childStats.rowCount.isEmpty) return None
The statistics can be gathered by the execution of ANALYZE TABLE $TABLE_NAME COMPUTE STATISTICS command before the processing execution. When ANALYZE command is called, it's executed by:
org.apache.spark.sql.execution.command.AnalyzeTableCommand#run(SparkSession) that updates
org.apache.spark.sql.catalyst.catalog.SessionCatalog statistics of processed data.
The only problem with ANALYZE command is that it can be called only for Hive and in-memory data stores.
Also, CBO does not work properly with Partitioned Hive Parquet tables; CBO only gives the size and not the number of rows estimated.
I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.
I'm using Apache Nifi 1.9.2 to load data from a relational database into Google Cloud Storage. The purpose is to write the outcome into Parquet files as it stores data in columnar way. To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor). The problem with these resulting files is that I cannot read Decimal typed columns when consuming the files in Spark 2.4.0 (scala 2.11.12): Parquet column cannot be converted ... Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
Links to parquet/avro example files:
https://drive.google.com/file/d/1PmaP1qanIZjKTAOnNehw3XKD6-JuDiwC/view?usp=sharing
https://drive.google.com/file/d/138BEZROzHKwmSo_Y-SNPMLNp0rj9ci7q/view?usp=sharing
As I know that Nifi works with the Avro format in between processors within the flowfile, I have also written the avro file (like it is just before the ConvertAvroToParquet processor) and this I can read in Spark.
It is also possible to not use logical types in Avro, but then I lose the column types in the end and all columns are Strings (not preferred).
I have also experimented with the PutParquet processor without success.
val arhg_parquet = spark.read.format("parquet").load("ARHG.parquet")
arhg_parquet.printSchema()
arhg_parquet.show(10,false)
printSchema() gives proper result, indicating ARHG3A is a decimal(2,0)
Executing the show(10,false) results in an ERROR: Parquet column cannot be converted in file file:///C:/ARHG.parquet. Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor)
Try upgrading to NiFi 1.12.1, our latest release. Some improvements were made to handling decimals that might be applicable here. Also, you can use the Parquet reader and writer services to convert from Avro to Parquet now as of ~1.10.0. If that doesn't work, it may be a bug that should have a Jira ticket filed against it.
I use Pyspark 2.1 to compute table and column statistics out of hive tables.
When I do:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS")
I am able to collect the stats with a DESCRIBE EXTENDED SQL command:
spark.sql("DESCRIBE EXTENDED <table name>").show()
However, when computing column statistics like so:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS FOR COLUMNS")
The job gets sent and successfully done, but I am unable to collect the stats using the SQL command as advised by this answer:
spark.sql("DESCRIBE EXTENDED <table name> <column name>").show()
I get :
ParseException Unsupported SQL statement
Reissuing the same DESCRIBE EXTENDED query (without the column name) does not
return any changes in the CatalogTable : I can only see the table statistics (i.e. sizeInBytes and rowCount).
This other answer suggests to retrieve Spark statistics from a "table" in the metastore, but this is cryptic to me...
How can I access these column-level statistics within Spark ?
Edit : I have investigated this further which allows me to refine the scope of my question :
It looks like my Hive client (Hive View 2.0) and Spark SQL do not write the statistics in the same location.
When using Spark SQL's ANALYZE TABLE method, -only- table statistics show up in a Statistics key of the CatalogTable:
Statistics: sizeInBytes=15196959528, rowCount=39763901, isBroadcastable=false
However, Hive View is blind to these stats, which are not listed in my table statistics
Conversely, when computing table or columns statistics within Hive View, I can collect table statistics with Spark SQL's DESCRIBE EXTENDED method, but they appear in the Properties key of my CatalogTable. It also indicates if some column statistics have been computed :
Properties: [numFiles=20, transient_lastDdlTime=1540381765, totalSize=15196959528, COLUMN_STATS_ACCURATE={"COLUMN_STATS":{"bubble_level":"true","bubble_level_n1s":"true","timestamp":"true"}}]
Thus these pieces of information appear to be independent, and my question then becomes : which piece can be used by Spark to optimize execution plan ? I understand that some of these questions could be solved by upgrading to latest version of Spark, but this is not on my schedule for the moment.
The above-mentioned answer of Jacek Laskowski suggests Hive's statistics can be used if they are available through Spark SQL DESCRIBE EXTENDED method.
Can anybody confirm this ?
Many thanks in advance for helping me to clear this up.
I find that Apache spark is much slower then a MySQL server for the same query and the same table query on a spark data frame.
So where would be spark more efficient then MySQL?
Note : tried on a table with 1 million rows all of 10 columns of type text.
The size of table in json is about 10GB
Using a standalone pyspark notebook with Xeon 16 core and 64gb RAM and on same server MySql
In general I would like to know guidelines on when to use SPARK vs SQL server in terms of the size of target data to get real snappy results from analytic queries.
Ok, so going to try and help here even though it's still very difficult to answer this without knowing more. Assuming there is no contention for resources, there are a number of things that are going on here. If you're running on yarn and your json is stored in hdfs. It is likely split into many blocks, those blocks are then processed in different partitions. Since json doesn't split very well, you'd lose alot of parallel capabilities. Also, spark isn't meant to really have the super low latency queries like a tuned rdbms. Where you benefit from spark is on heavy data processing, large amounts of data (TB or PB). If you are looking for low latency queries you should use Impala or Hive with Tez. You should also consider changing your file format to avro, parquet or ORC.