local:/ URI scheme for "spark.yarn.jars" path

local:/ URI scheme for "spark.yarn.jars" path - apache-spark

I am new to spark and trying to understand the code in my project and to work on it. While creating spark session , i see in code one entry for config as - .config("spark.yarn.jars", "local:/cloudera/opt/xx/xxjars/*") .
I could not understand the URI scheme mention as "local:/". What does it mean by , can someone please help ?
I did some google and find one page mentioning it as scheme , but couldn't find any detail that what it is referring to ?

As I understand it, "local://path/to/file" means that the file-path is expected to be on the local filesystem of each worker node as opposed to the hdfs for example (hdfs:///path/to/file).
So in the former case the file has to reside on each node's individual filesystem, in the latter case it is enough if it is somewhere in hdfs and will be downloaded to the nodes when firing up the spark context.
The behaviour is explained in the Spark Documentation:
Spark uses the following URL scheme to allow different strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
For large files it is better to use local mode or to have them in hdfs, but have the replication factor = number of nodes so the hdfs-replication-location of the file is indeed always the same node your container is running on.

Related

Sharing hazelcast cache between multiple application and using write behind and read through

Question - Can I share the same hazelcast cluster (cache) between the multiple application while using the write behind and read through functionality using map store and map loaders
Details
I have enterprise environment have the multiple application and want to use the single cache
I have multiple application(microservices) ie. APP_A, APP_B and APP_C independent of each other.
I am running once instance of each application and each node will be the member node of the cluster.
APP_A has MAP_A, APP_B has MAP_B and APP_C has MAP_C. Each application has MapStore for their respective maps.
If a client sends a command instance.getMap("MAP_A").put("Key","Value") . This has some inconsistent behavior. Some time I see data is persistent in database but some times not.
Note - I wan to use the same hazelcast instance across all application, so that app A and access data from app B and vice versa.
I am assuming this is due to the node who handles the request. If request is handle by node A then it will work fine, but fails if request is handled by node B or C. I am assuming this is due to Mapstore_A implementation is not available with node B and C.
Am I doing something wrong? Is there something we can do to overcome this issue?
Thanks in advance.

Hazelcast is a clustered solution. If you have multiple nodes in the cluster, the data in each may get moved from place to place when data rebalancing occurs.
As a consequence of this, map store and map loader operations can occur from any node.
So all nodes in the cluster need the same ability to connect to the database.

Cluster design for downloading/streaming a dataset to a user

In our system, we classically have two components: A Cloudera Hadoop cluster (CDH) and an OpenShift "backend" system. In HDFS, we have some huge .parquet files.
We now have a business requirement to "export the data by a user given filter criterion" to a user in "realtime" as downloadable file. So the flow is: The user enters a SQL like filter string, for instance user='Theo' and command='execution'. He then sends a GET /export request to our backend service with the filter string as parameter. The user shall now get a "download file" from his web browser and immediately start downloading that file as CSV (even if its multiple terrabytes or even petabytes in size, thats the user's choice if he wants to try out and wait that long). In fact, the cluster should respond synchronously but not cache the entire response on a single node before sending the result but only receive data at "internet speed" of the user and directly stream it to the user. (With a buffer of e.g. 10 oder 100 MB).
I now face the problem on how to best approach this requirement. My considerations:
I wanted to use Spark for that. Spark would read the Parquet file, apply the filter easily and then "coalesce" the filtered result to the driver which in turn streams the data back to the requesting backend/client. During this task, the driver should of course not run out of memory if the data is sent too slowly back to the backend/user, but just have the executors deliver the data in the same speed as it is "consumed").
However, I face some problems here:
The standard use case is that the user has fine grained filters so that his exported file contains something like 1000 lines only. If I'd submit a new spark job via spark-submit for each request, I already come into latencies of multiple seconds due to initialization and query plan creation (Even if its just as simple as reading and filtering the data). I'd like to avoid that.
The cluster and the backend are strictly isolated. The operation guys ideally don't want us to reach the cluster from the backend at all, but the cluster should just call the backend. We are able to "open" maybe one port, but we'll possibly not able to argue something like "our backend will run the spark driver but being connected to the cluster as execution backend".
Is it a "bad design smell" if we run a "server spark job", i.e. we submit an application with mode "client" to the cluster master which also opens a port for HTTP requests and only runs a spark pipeline on requests, but holds the spark context open all the time (and is reachable from our backend via a fixed URL)? I know that there is "spark-job-server" project which does this, but it still feels a bit weird due to the nature of Spark and Jobs, where "naturally" a job would be to download a file and not be a 24h running server waiting to execute some pipeline steps from time to time.
I have no idea on how to limit sparks result fetching so that the executors send in a speed so that the driver won't run out of memory if the user requested petabytes.. Any suggestion on this?
Is Spark a good choice for this task after all or do you have any suggestions for better tooling here? (At best in CDH 5.14 environment as we don't get the operation team to install any additional tool).

Is there a way to gather stats after spark-submit of resources used?

I'm working with spark and Yarn as my resource manager. I'm trying top find a way to gather resources allocated for the job after a run. The resource manager only reports current usage so after it's complete it's zeroed out.
If I can't get them after the fact is there a way to have the Spark Job accumulate stats as it outs to output/store at the end?

Try to use Spark History Server:
Viewing After the Fact
It is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode/shared/spark-logs

How to 'bundle' referenced FHIR resources on a localhost SPARK server

I have SPARK FHIR server setup locally and I've tried all sorts to query (GET) a patient resource that is referenced in Observation, Condition, Immunization etc., so that the resulting bundle contains the index Patient resource as well as all resource instances containing a reference to that Patient resource.
This comes close (online SPARK server):
http://fhir3.healthintersections.com.au/open/Patient/1/$everything
This is what I'm looking for (online HAPI server):
GET http://fhirtest.uhn.ca/baseDstu2/Patient?_id=14676&_revinclude=Immunization:patient&_revinclude=MedicationStatement:patient&_format=xml
Neither works on my localhost SPARK server. Any help?

Both the $everything operation and the _revinclude functionality are not implemented on Spark at this time. So other than modifying your own copy of the server, Spark cannot do this for you right now.

Which couchbase node will serve request?

I am having NodeJS service which talks to couchbase cluster to fetch the data. The couchbase cluster has 4 nodes(running on ip1, ip2, ip3, ip4) and service also is running on same 4 servers. On all the NodeJS services my connection string looks like this:
couchbase://ip1,ip2,ip3,ip4
but whenever I try to fetch some document from bucket X, console shows node on ip4 is doing that operation. No matter which NodeJS application is making request the same ip4 is serving all the request.
I want each NodeJS server to use their couchbase node so that RAM and CPU consumption on all the servers are equal so I changed the order of IPs in connection string but every time request is being served by same ip4.
I created another bucket and put my data in it and try to fetch it but again it went to same ip4. Can someone explain why is this happening and can it cause high load on one of the node?

What do you mean by "I want each NodeJS server to use their couchbase node"?
In Couchbase, part of the active dataset is on each node in the cluster. The sharding is automatic. When you have a cluster, the 1024 active vBuckets (shards) for each Bucket are spread out across all the nodes of the cluster. So with your 4 nodes, there will be 256 vBuckets on each node. Given the consistent hashing algorithm used by the Couchbase SDK, it will be able to tell from the key which vBucket that object goes into and combined with the cluster map it got from the cluster, know which node that vBucket lives in the cluster. So an app will be getting data from each of the nodes in the cluster if you have it configured correctly as the data is evenly spread out.
On the files system there will be as part of the Couchbase install a CLI tool call vbuckettool that takes an objectID and clustermap as arguments. All it does is the consistent hashing algorithm + the clustermap. So you can actually predict where an object will go even if it does not exist yet.
On a different note, best practice in production is to not run your application on the same nodes as Couchbase. It really is supposed to be separate to get the most out of its shared nothing architecture among other reasons.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string