I wanted to analyze sql queries executed by users from spark. I checked spark history server logs. And seems like it logs only info partially. For example when I run select statements. But doesnt log statements like create, drop or for example when I do INSERT INTO TABLE SELECT....., then it just logs the select statement. But doesnt say to which table the data was inserted. I am wondering if there is something wrong in my logs settings or this is correct behaviour. If yes, do you know what would be the best way to get historical data of queries running thru spark.
Thanks
Related
I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.
First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.
I'm trying to optimize one program with Spark SQL, this program is basically a HUGE SQL query (joins like 10 tables with many cases etc etc). I'm more used to more DF-API-oriented programs, and those did show the different stages much better.
It's quite well structured and I understand it more or less. However I have a problem, I always use Spark UI SQL view to get hints on where to focus the optimizations.
However in this kind of program Spark UI SQL shows nothing, is there a reason for this? (or a way to force it to show).
I'm expecting to see each join/scan with the number of output rows after it and such.... but I only see a full "WholeStageCodeGen" for a "Parsed logical plan" which is like 800lines
I can't show code, it has the following "points":
1- Action triggering it, its "show"(20)
3- Takes like 1 hour of execution (few executors yet)
2- has a persist before the show/action.
3- Uses Kudu, Hive and In-memory tables (registered before this query)
4- Has like 700 lines logical plan
Is there a way to improve the tracing there? (maybe disabling WholeStageCodegen?, but that may hurt performance...)
Thanks!
I get the correct count after I run the ANALYZE statement.
But my problem is, it needs to be run every time of the count is updated. Technically I should be able to update the count for the same partition.
But it returns the same count if I don't execute the ANALYZE statement.
This is the query I execute for the count to be updated.
ANALYZE TABLE bi_events_identification_carrier_sam PARTITION(year, month, day) COMPUTE STATISTICS;
And executing is not convenient at all. any ideas?
Your count(*) query is using stats to get the result.
If you are using spark to write data, then you can set spark.sql.statistics.size.autoUpdate.enabled to true. This makes sure that Spark updates the table stats automatically after the write is done.
If you are using Hive, you can set set hive.stats.autogather=true;.
Once you enable these settings, then the write query will automatically update the stats and the subsequent read query will work fine.
I'm trying to figure out why activity that I know is occurring isn't showing up in the SQL tab of the Spark UI. I am using Spark 1.6.0.
For example, we have a load of activity occurring today between 11:06 & 13:17, and I know for certain that the code being executed is using Spark dataframes API:
Yet if I hop over to the SQL tab I don't see any activity occurring for between those times:
So...I'm trying to figure out what influences whether not activity appears in that SQL tab, because the information presented in that SQL tab is (arguably) the most useful information in the whole UI - and when there's activity occurring that isn't showing up it becomes kinda annoying. The only distinguishing characteristic seems to be that the jobs that are showing up in the SQL tab use actions that don't write any data (e.g. count()), the jobs that do write data don't seem to be showing up. I'm puzzled as to why.
Any pearls of wisdom?