I hope you will be able to help me with my question. Basically I found that there is a significant difference in Hive query (Spark SQL) execution time when I execute it directly on the Edge Node and when I execute on my local machine connected to remote Hive metastore.
When I execute query like :
select max(column) from table
It seems like, first it fetch the whole table to my PC, and then do the query execution for MAX. Because nothing happens for 2 minutes or so, and then it goes to stage phase and this takes just 2 seconds. When I looked and the query execution on the Web UI, it seems like nothing happened for 2 minutes and then the query started execution locally.
I am wondering if you could advise how Spark processes remote queries? Is that way how I suspect, so basically it fetch all the data from the table first, and then execute a query on it locally? Seems like this is a real bottleneck in my case.
Thanks
Tom
Related
I have a simple code which will read entire Hive table and load it to SQL Server in Azure Databricks.
df = spark.sql("select * from Mechanics")
df.write.format("com.microsoft.sqlserver.jdbc.spark")\
.mode("overwrite").option("url",sql_connection_properties)\
.option("dbtable",glb.mechanics_invoices)\
.option("user",username).option("password",pwd)\
.option("bulkCopyBatchSize",10000).save()
On executing this cell, the command keeps on running for minutes. I am unable to kill. Because in SparkUI neither Job or Stage gets created. I can only see Running Queries(1). No option to kill.
If I try the same for another table or glb.mechanics_invoices_temp it
runs successfully.
Tried to check locks in SQL Server but I dont have privileges to check.
What is the mistake I am doing here?
Hive table contains 2700490 rows.
How to kill such processes either by pyspark code or from UI
Any suggestions to above code Performant,Fail safe especially by using the options like bulkCopyTableLock, bulkCopyTimeout, partitions etc.
Any resources to understand above situation better.
Appreciate your feedback and suggestions.
Thank You
How to kill such processes either by pyspark code or from UI
you can kill the Spark application from the Web UI, by accessing the application master page of spark job.
Opening Spark cluster UI- master.
Expand Running Application.
Find a job you wanted to kill.
Select kill to stop the Job.
I am running spark sql queries on hive table stored at a remote HDFS. But I am observing that the same sql query is taking different times to execute.
Now I wanted to do a POC between our old configuration and new configuration, but I am not able to figure out how I can do this if the execution times differ by that much?
I develop spark sql to run against hadoop. Today I must run a spark job that invokes my query. Is there another way to do this? I find I spend too much time fixing side issues with running the jobs in spark. Ideally I want to be able to compose and execute Spark SQL queries directly against hadoop/hbase and bypass the spark job altogether. This would permit much more rapid iteration when debugging or trying alternate queries.
Note that my queries are often 100 lines long or more so working from the command line is challenging.
I have to do this from a WIndows workstation
The best thing you can use for HBase is to use Apache Phoenix. It provides an SQL interface.
As an example, on my last project I used NIFI with Phoenix to read and mutate HBase data. Worked great from command line. I did discover a bug in my usage of it.
See https://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html. You van use an SQL file. In addition you can use Hue.
Never tried the following for Windows, but it is possible. See https://community.cloudera.com/t5/Community-Articles/How-to-connect-a-Windows-JDBC-client-to-Cluster-enabled-with/ta-p/247787
Here is the short story:
A BI tool (PowerBI) connects to Spark cluster and uses HiveThriftServer2 application to get aggregated data via hive queries.
However, each query takes a lot of time since every time it reads data from files. I would like to cache my table in this application and looking for the way to send query "cache table myTable" through same channel, so next queries will run quick.
What would be a solution to send hive query to specific application? If it matters, the application is a thrift service of Spark.
Thanks a lot!
Looks like I succeed to do it, by installing Spark Odbc driver and using it to connect to thift server and send the sql query "cache table xxx". I wonder if there is more elegant way
I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?
as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.