Databricks: Jar file changed but does not reflect to cluster - apache-spark

I made a 'a' job with 'A' cluster and uploaded 'former.jar' file on dependent libraries.
I found that I had a typo on 'former.jar' so I changed it with 'latter.jar' file on the dependent libraries.
However, when I see the log I can see same error log with 'former.jar'
I tried to made a new job named 'b' with 'A' cluster but it does not reflect change of 'latter.jar' too. But when I made a new job named 'c' with 'B' cluster. it worked.
And then, I made a 'latterlatter.jar' and uploaded in 'c', it does not reflect too.
I don't know what is the problem

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Change temporary path for individual job from spark code

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning.
The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up sharing the same temp folder and cause conflict, which can cause one job to delete temp files, and the other job fail with an error saying an expected temp file doesn't exist.
Can we change temporary path for individual job or is there any alternate way to avoid issue
To change the temp location you can do this:
/opt/spark/bin/spark-shell --conf "spark.local.dir=/local/spark-temp"
spark.local.dir changes where all temp files are read and written to, I would advise building and opening the positions of this location via command line before the first session with this argument is run.

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

gridgain zero dedeployment for GridRunnable class

I am using 6.5.6. On same desktop, running 2 nodes. One is patitioned, and one is client_only runing from IDE(eclipse). using 'CONTINUOUS' as deployment mode. Only one cache named 'partitioned'.
my issue is: I have a GridRunnable static class defined in class which start the 'client_only' node, and in the run method, only print 'hello world'. First time it runs fine. Hello string print out in 'partitioned' node. Keep 'partitioned' node running. Then I changed string to 'hello world x'. Save in my IDE, restart 'client_only', I saw 'partitioned' one still print 'hello world'. restart 'client_only' one again. this time it start printing 'hello world x' in my 'client_only' node now.
It looks it probably should deploy the code change in GridRunnable by itself. I am not sure anywhere I did wrong ? Please help !
For the GridRunnable to be redeployed automatically every time you change the code, you should change the deployment mode to SHARED. This means that once all cluster members that initiated the deployment are gone, the closure will be undeployed - in your case it is the CLIENT_ONLY node.
However, the same will apply to the data residing in caches - it will be undeployed, as well and the cache will be cleared. To avoid it, you should include the data you plan to cache in the classpath of each data node. Since classes on the local classpath do not get undeployed, the caches will not be cleared in this case.

Old logs are not imported into ES by logstash

When I start logstash, the old logs are not imported into ES.
Only the new request logs are recorded in ES.
Now I've see this in the doc.
Even if I set the start_position=>"beginning", old logs are not inserted.
This only happens when I run logstash on linux.
If I run it with the same config, old logs are imported.
I don't even need to set start_position=>"beginning" on windows.
Any idea about this ?
When you read an input log to Logstash, Logstash will keep an record about the position it read on this file, that's call sincedb.
Where to write the sincedb database (keeps track of the current position of monitored log files).
The default will write sincedb files to some path matching "$HOME/.sincedb*"
So, if you want to import old log files, you must delete all the .sincedb* at your $HOME.
Then, you need to set
start_position=>"beginning"
at your configuration file.
Hope this can help you.
Please see this line also.
This option only modifies "first contact" situations where a file is new and not seen before. If a file has already been seen before, this option has no effect.

Resources