Databricks Connect: DependencyCheckWarning: The java class may not be present on the remote cluster - apache-spark

I was performing yet another execution of local Scala code against the remote Spark cluster on Databricks and got this.
Exception in thread "main" com.databricks.service.DependencyCheckWarning: The java class <something> may not be present on the remote cluster. It can be found in <something>/target/scala-2.11/classes. To resolve this, package the classes in <something>/target/scala-2.11/classes into a jar file and then call sc.addJar() on the package jar. You can disable this check by setting the SQL conf spark.databricks.service.client.checkDeps=false.
I have tried reimporting, cleaning and recompiling the sbt project to no avail.
Anyone know how to deal with this?

Apparently the documentation has that covered:
spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
I guess it makes sense that everything that you are writing as code external to Spark is considered a dependency. So a simple sbt publishLocal and then pointing to the jar path in above command will sort you out.
My main confusion came from the fact that I didn't need to do this for a very long while until at some point this mechanism kicked it. Rather inconsistent behavior I'd say.
A personal observation after working with this setup is that it seems you only need to publish a jar a single time. I have been changing my code multiple times and changes are reflected even though I have not been continuously publishing jars for the new changes I made. That makes the whole task a one off. Still confusing though.

Related

How to update classes of functional objects (Callable) in hazelcast without restarting

I found 2 options how to add classes to hazelcast:
CodeDeployment
clientUserCodeDeploymentConfig.addClass(cz.my.DemoTask.class);
problem is when I change code in this task I got exception:
java.lang.IllegalStateException: Class com.model.myclass is already in a local cache and conflicting byte code representation
Use some serialization like IdentifiedDataSerializable or Portable and add jar to client and server hazelcast with configuration.
so even this is versioned when you need to change our Task you need to update jar and restart server
so is there some other options ?
I found similar issue which is almost 2 years old where is mention:
For the functional objects, we don't have a solution in place but it
is on the road map.
so I am curious if there is some update about this.

gitlab runner errors occasionally

I have gitlab setup with runners on dedicated VM machine (24GB 12 vCPUs and very low runner concurrency=6).
Everything worked fine until I've added more Browser tests - 11 at the moment.
These tests are in stage browser-test and start properly.
My problem is that, it sometimes succeeds and sometimes not, with totally random errors.
Sometimes it cannot resolve host, other times unable to find element on page..
If I rerun these failed tests, all goes green always.
Anyone has an idea on what is going wrong here?
BTW... I've checked, this dedicated VM is not overloaded...
I have resolved all my initial issues (not tested with full machine load so far), however, I've decided to post some of my experiences.
First of all, I was experimenting with gitlab-runner concurrency (to speed things up) and it turned out, that it really quickly filled my storage space. So for anybody experiencing storage shortcomings, I suggest installing this package
Secondly, I was using runner cache and artifacts, which in the end were cluttering my tests a bit, and I believe, that was the root cause of my problems.
My observations:
If you want to take advantage of cache in gitlab-runner, remember that by default it is accessible on host where runner starts only, and remember that cache is retrieved on top of your installation, meaning it overrides files from your project.
Artifacts are a little bit more flexible, cause they are stored/fetched from your gitlab installation. You should develop your own naming convention (using vars) for them to control, what is fetched/cached between stages and to make sure all is working, as you would expect.
Cache/Artifacts in your tests should be used with caution and understanding, cause they can introduce tons of problems, if not used properly...
Side note:
Although my VM machine was not overloaded, certain lags in storage were causing timeouts in the network and finally in Dusk, when running multiple gitlab-runners concurrently...
Update as of 2019-02:
Finally, I have tested this on a full load, and I can confirm my earlier side note, about machine overload is more than true.
After tweaking Linux parameters to handle big load (max open files, connections, sockets, timeouts, etc.) on hosts running gitlab-runners, all concurrent tests are passing green, without any strange, occasional errors.
Hope it helps anybody with configuring gitlab-runners...

How do I pass in the google cloud project to the SHC BigTable connector at runtime?

I'm trying to access BigTable from Spark (Dataproc). I tried several different methods and SHC seems to be the cleanest for what I am trying to do and performs well.
https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
However this approach requires that I put the Google cloud project ID in hbase-site.xml which means I need to build separate versons of the fat jar file with my spark code for each env I run on (prod, staging, etc.) which is something I'd like to avoid.
Is there a way for me to pass in the google cloud project id at runtime?
As far as I can tell, the SHC library does not let you pass through hbase configs (looking in here).
The easiest thing would be to run an init action that gets the VM's project id from VM metadata and sets it in hbase-site.xml. We are working on an initialization that does that and installs the Hbase client for Bigtable. Check out the in-progress pull request, which would be a good starting point if you needed to write one immediately. Otherwise, I expect the PR to get merged in the next couple weeks.
Alternatively, consider adding an option in SHC for passing through properties to the HBaseConfiguration it creates. That would be a valuable feature for the broader community.

Node-Windows service starts multiple instances

I'm running some file management tasks through a node script. The node-windows package is included to allow me to run this script as a windows service. I encountered a serious error this morning when I realized that the service had started a duplicate instance of the same script. This is very bad, it corrupted 24-hrs worth of data because both scripts were trying to process the same data sets and ended up shredding them. I've never seen the windows service allow something like this. Has anyone else had this problem or have any idea what is causing it?
See my comment about node-windows instances.
The real problem, which is data corruption, doesn't have anything to do with node-windows. The node script should have fault tolerance for this. More specifically, it should be implementing file locking, which is a standard practice to prevent this exact scenario.
There are a couple of file locking modules available. lockfile is what npm uses. There is also another project called proper-lockfile, which solves the problem in a slightly different (more Windows-friendly) way.

Running mesos-local for testing a framework fails with Permission denied

I am sharing a linux box with some coworkers, all of them developing in the mesos ecosphere. The most convenient way to test a framework that I am hacking around with commonly is to run mesos-local.sh (combining both master and slaves in one).
That works great as long as none of my coworkers do the same. As soon as one of them did use that shortcut, no other can do that anymore as the master specific temp-files are stored in /tmp/mesos and the user that ran that instance of mesos will have the ownership of those files and folders. So when another user tries to do the same thing something like the following will happen when trying to run any task from a framework;
F0207 05:06:02.574882 20038 paths.hpp:344] CHECK_SOME(mkdir): Failed
to create executor directory
'/tmp/mesos/0/slaves/201402051726-3823062160-5050-31807-0/frameworks/201402070505-3823062160-5050-20015-0000/executors/default/runs/d46e7a7d-29a2-4f66-83c9-b5863e018fee'Permission
denied
Unfortunately, mesos-local.sh does not offer a flag for overriding that path whereas mesos-master.sh does via --work_dir=VALUE.
Hence the obvious workaround is to not use mesos-local.sh but master and slave as separate instances. Not too convenient though...
The easiest workaround for preventing that problem, no matter if you run mesos-master.sh or mesos-local.sh is to patch the environment setup within bin/mesos-master-flags.sh.
That file is used by both, the mesos-master itself as well as mesos-local, hence it is the perfect place to override the work-directory.
Edit bin/mesos-master-flags.sh and add the following to it;
export MESOS_WORK_DIR=/tmp/mesos-"$USER"
Now run bin/mesos-local.sh and you should see something like this in the beginning of its log output;
I0207 05:36:58.791069 20214 state.cpp:33] Recovering state from
'/tmp/mesos-tillt/0/meta'
By that all users that patched their mesos-master-flags.sh accordingly will have their personal work-dir setup and there is no more stepping on each others feet.
And if you prefer not to patch any files, you could just as well simply prepend the startup of that mesos instance by setting the environment variable manually:
MESOS_WORK_DIR=/tmp/mesos-foo bin/mesos-local.sh

Resources