Spark SHC Core - Log region/regionserver - apache-spark

Im using the SHC spark connector by hortonworks to read an HBase table
https://github.com/hortonworks-spark/shc
I have some tasks that take a very long time to complete and I suspect its because of region size skew but would like to confirm it by logging which region/region server each task is reading.
I tried turning on debug logs by doing the following in the driver
Logger.getLogger("org").setLevel(Level.DEBUG);
Logger.getLogger("akka").setLevel(Level.DEBUG);
But it didnt seem to have any effect.
Is it possible to log the above somehow?

it didn't seem to have any effect.
Yes, unfortunately, SHC itself does not log the region/region server name information anywhere during the execution. That's why enabling DEBUG log would not help at all.
Is it possible to log the above somehow?
Yes, and only if you know where and how to customize shc's source code. You might need to insert your own log command, rebuild, test, package, and ship it with your application.
It depends on your goal. i.e. you might want to call logDebug() or logInfo() of the region name info during a task of table scanning. here is source code HBaseTableScan
The build, test, ship, .etc details are here in SHC's repo doc.

Related

Databricks Connect: DependencyCheckWarning: The java class may not be present on the remote cluster

I was performing yet another execution of local Scala code against the remote Spark cluster on Databricks and got this.
Exception in thread "main" com.databricks.service.DependencyCheckWarning: The java class <something> may not be present on the remote cluster. It can be found in <something>/target/scala-2.11/classes. To resolve this, package the classes in <something>/target/scala-2.11/classes into a jar file and then call sc.addJar() on the package jar. You can disable this check by setting the SQL conf spark.databricks.service.client.checkDeps=false.
I have tried reimporting, cleaning and recompiling the sbt project to no avail.
Anyone know how to deal with this?
Apparently the documentation has that covered:
spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
I guess it makes sense that everything that you are writing as code external to Spark is considered a dependency. So a simple sbt publishLocal and then pointing to the jar path in above command will sort you out.
My main confusion came from the fact that I didn't need to do this for a very long while until at some point this mechanism kicked it. Rather inconsistent behavior I'd say.
A personal observation after working with this setup is that it seems you only need to publish a jar a single time. I have been changing my code multiple times and changes are reflected even though I have not been continuously publishing jars for the new changes I made. That makes the whole task a one off. Still confusing though.

How do I pass in the google cloud project to the SHC BigTable connector at runtime?

I'm trying to access BigTable from Spark (Dataproc). I tried several different methods and SHC seems to be the cleanest for what I am trying to do and performs well.
https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
However this approach requires that I put the Google cloud project ID in hbase-site.xml which means I need to build separate versons of the fat jar file with my spark code for each env I run on (prod, staging, etc.) which is something I'd like to avoid.
Is there a way for me to pass in the google cloud project id at runtime?
As far as I can tell, the SHC library does not let you pass through hbase configs (looking in here).
The easiest thing would be to run an init action that gets the VM's project id from VM metadata and sets it in hbase-site.xml. We are working on an initialization that does that and installs the Hbase client for Bigtable. Check out the in-progress pull request, which would be a good starting point if you needed to write one immediately. Otherwise, I expect the PR to get merged in the next couple weeks.
Alternatively, consider adding an option in SHC for passing through properties to the HBaseConfiguration it creates. That would be a valuable feature for the broader community.

Stop Spark executor logs from getting gzipped

I have a Spark job with some very long running tasks. When the tasks start I can go to the executors tab and see all my executors and their tasks. I can click on the stderr link to see the logs for those tasks which helps a lot for monitoring. However, after a few hours the stderr link stops working. If you click on it you get java.lang.Exception: Cannot find this log on the local disk.. I dug into a bit and the issue seems to be that something has decided to gzip the logs. That is, I can still manually find the log by ssh-ing to the worker node and looking in the correct directory (e.g. /mnt/var/log/hadoop-yarn/containers/application_1486407288470_0005/container_1486407288470_0005_01_000002/stderr.gz). It's annoying that this happens since I now can't monitor my job from the UI. Also, the files are pretty tiny so the compression doesn't seem helpful (40k uncompressed). It seems like there's a lot of things that could be causing this to happen: yarn, a logroller cron job, the log4j config in my Yarn/Spark distro, AWS (since EMR zips logs and saves 'em to S3), etc. so I'm hoping someone can point me in the right direction so I don't have to search a ton of docs.
I'm using AWS EMR at emr-5.3.0 without any custom bootstrap steps.
Just had a similar issue. I haven't searched how to stop gzip from happening, but you can access the logs using the hadoop interface.
On the left menu, under Tools > Local logs
Then browse to find the log you are interested in.
For my case, the gzip from the gui at /node/containerlogs/container_1498033803655_0037_01_000001/hadoop/stderr.gz/?start=-4096
And using local logs menu, it was in
/logs/containers/application_1498033803655_0037/container_1498033803655_0037_01_000001/stderr.gz
Hope it helps

How can I view CruiseControl.Net logs in real time?

I use CruiseControl.Net for continuous integration and I would like to read the log output of the current project in real time. For example, if it is running a compile command, I want to be able to see all the compile output so far. I can see where the log files are stored but it looks like they are only created once the project finishes. Is there any way to get the output in real time?
The CCTray app will allow you to see a snapshot of the last 5 or so lines of output of any command on a regular interval.
It's not a live update as that would be too resource intensive, as would be a full output of the log to-date.
Unless you write something to capture and store the snapshots you're out of luck. Doing this also presents to possibility of missing messages that appear between snapshots, so it would not be entirely reliable. It would however give you a slightly better idea of what is going on.
You can run ccnet.exe as a command line application instead of running ccservice as a Windows service. It will output to the terminal as it runs. It's useful for debugging.

Script to incremental backup MySQL workbench in linux

I have an issue related to how to incremental backup MySQL workbench.
Can anyone tell me the script to backup this?
I want to back up all day and keep incremental with difference file.
Can anyone give me the sample script about that?
Thank,
Veasna.
The binary log (mysql-bin.log) is essentially an incremental back-up. It allows you to revert back to a previously stable database state.
see http://dev.mysql.com/doc/mysql-backup-excerpt/5.0/en/backup-policy.html
Making Incremental Backups by Enabling the Binary Log,
http://dev.mysql.com/doc/refman/5.6/en/backup-methods.html
May i know from where you are restarting the service, through command prompt Or from your control panel.
Share your error message here, you Will get further details if anyone knows.

Resources