How to 'bundle' referenced FHIR resources on a localhost SPARK server

How to 'bundle' referenced FHIR resources on a localhost SPARK server - reference

I have SPARK FHIR server setup locally and I've tried all sorts to query (GET) a patient resource that is referenced in Observation, Condition, Immunization etc., so that the resulting bundle contains the index Patient resource as well as all resource instances containing a reference to that Patient resource.
This comes close (online SPARK server):
http://fhir3.healthintersections.com.au/open/Patient/1/$everything
This is what I'm looking for (online HAPI server):
GET http://fhirtest.uhn.ca/baseDstu2/Patient?_id=14676&_revinclude=Immunization:patient&_revinclude=MedicationStatement:patient&_format=xml
Neither works on my localhost SPARK server. Any help?

Both the $everything operation and the _revinclude functionality are not implemented on Spark at this time. So other than modifying your own copy of the server, Spark cannot do this for you right now.

Related

On kubernetes my spark worker pod is trying to access thrift pod by name

Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"

Cluster design for downloading/streaming a dataset to a user

In our system, we classically have two components: A Cloudera Hadoop cluster (CDH) and an OpenShift "backend" system. In HDFS, we have some huge .parquet files.
We now have a business requirement to "export the data by a user given filter criterion" to a user in "realtime" as downloadable file. So the flow is: The user enters a SQL like filter string, for instance user='Theo' and command='execution'. He then sends a GET /export request to our backend service with the filter string as parameter. The user shall now get a "download file" from his web browser and immediately start downloading that file as CSV (even if its multiple terrabytes or even petabytes in size, thats the user's choice if he wants to try out and wait that long). In fact, the cluster should respond synchronously but not cache the entire response on a single node before sending the result but only receive data at "internet speed" of the user and directly stream it to the user. (With a buffer of e.g. 10 oder 100 MB).
I now face the problem on how to best approach this requirement. My considerations:
I wanted to use Spark for that. Spark would read the Parquet file, apply the filter easily and then "coalesce" the filtered result to the driver which in turn streams the data back to the requesting backend/client. During this task, the driver should of course not run out of memory if the data is sent too slowly back to the backend/user, but just have the executors deliver the data in the same speed as it is "consumed").
However, I face some problems here:
The standard use case is that the user has fine grained filters so that his exported file contains something like 1000 lines only. If I'd submit a new spark job via spark-submit for each request, I already come into latencies of multiple seconds due to initialization and query plan creation (Even if its just as simple as reading and filtering the data). I'd like to avoid that.
The cluster and the backend are strictly isolated. The operation guys ideally don't want us to reach the cluster from the backend at all, but the cluster should just call the backend. We are able to "open" maybe one port, but we'll possibly not able to argue something like "our backend will run the spark driver but being connected to the cluster as execution backend".
Is it a "bad design smell" if we run a "server spark job", i.e. we submit an application with mode "client" to the cluster master which also opens a port for HTTP requests and only runs a spark pipeline on requests, but holds the spark context open all the time (and is reachable from our backend via a fixed URL)? I know that there is "spark-job-server" project which does this, but it still feels a bit weird due to the nature of Spark and Jobs, where "naturally" a job would be to download a file and not be a 24h running server waiting to execute some pipeline steps from time to time.
I have no idea on how to limit sparks result fetching so that the executors send in a speed so that the driver won't run out of memory if the user requested petabytes.. Any suggestion on this?
Is Spark a good choice for this task after all or do you have any suggestions for better tooling here? (At best in CDH 5.14 environment as we don't get the operation team to install any additional tool).

local:/ URI scheme for "spark.yarn.jars" path

I am new to spark and trying to understand the code in my project and to work on it. While creating spark session , i see in code one entry for config as - .config("spark.yarn.jars", "local:/cloudera/opt/xx/xxjars/*") .
I could not understand the URI scheme mention as "local:/". What does it mean by , can someone please help ?
I did some google and find one page mentioning it as scheme , but couldn't find any detail that what it is referring to ?

As I understand it, "local://path/to/file" means that the file-path is expected to be on the local filesystem of each worker node as opposed to the hdfs for example (hdfs:///path/to/file).
So in the former case the file has to reside on each node's individual filesystem, in the latter case it is enough if it is somewhere in hdfs and will be downloaded to the nodes when firing up the spark context.
The behaviour is explained in the Spark Documentation:
Spark uses the following URL scheme to allow different strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
For large files it is better to use local mode or to have them in hdfs, but have the replication factor = number of nodes so the hdfs-replication-location of the file is indeed always the same node your container is running on.

Spark job to work in two different HDFS environments

I have a requirement, I need to write a spark job to connect in Prod(Source-Hive)Server A
and get the data into Local(Temp hive server) do the transform and load it back into TargetProd(Server B)
In earlier cases, we have our Target DB as Oracle, so we use to give like below, which will overwrite the table
AAA.write.format("jdbc").option("url", "jdbc:oracle:thin:#//uuuuuuu:0000/gsahgjj.yyy.com").option("dbtable", "TeST.try_hty").option("user", "aaaaa").option("password", "dsfdss").option("Truncate","true").mode("Overwrite").save().
In terms of SPARK overwrite from Server A to B, what should be syntax we need to give.
when I try to establish the connection through jdbc from one hive(ServerA) to Server B. It is not working.. please help.

You can connect to hive by using jdbc if it’s a remote one. Please get your hive thrift server url and port details and connect via jdbc. It should work.

What's the service name mesos-dns provides to access Cassandra when using the Cassandra-mesos framework?

We're using a slightly modified community mesosphere cluster. This has mesos-dns installed - so we can resolve master.mesos, and x.marathon.mesos, no problem.
The question is which name we should use to access the Cassandra database (whether with cqlsh or with another application)?
I've found the following in the documentation: cassandra-dcos-node.cassandra.dcos.mesos (https://docs.mesosphere.com/services/cassandra/) but what if we change the cluster name (to say, "foo")? Which bit gets modified? I've played around with all combos, but haven't worked it out.

In the case of Cassandra running on DCOS (which the docs refer to) the Cluster name is dcos. The framework name registered with Mesos is cassandra.dcos. The task name for a running Cassandra server is cassandra.dcos.node,
If you were to change the cluster name to "foo", the framework name would now be cassandra.foo and the server task names would now be cassandra.foo.node.
To access your "foo" Cassandra cluster you would use cassandra-foo-node.cassandra.foo.mesos.
Now an explanation of how:
The DNS names that are created by mesos-dns follow a specific schema, all of which can be found in the official documentation[1].
To summarize the documentation here, mesos-dns creates a DNS name with the following format: taskName.frameworkName.mesos.
In the case of Cassandra the task name is cassandra.dcos.node which mesos-dns turns into cassandra-dcos-node since it doesn't all . in task names. The framework name cassandra.dcos is allowed to have . in it so that stays the same. And mesos is the default value for TLD.
When we put it altogether this is cassandra-dcos-node.cassandra.dcos.mesos.
The original intent was to have a name of node.dcos.cassandra.mesos but due to time constraints and a misunderstanding of how mesos-dns worked, this is what we're left with. Hopefully it can be cleaned up in the future.
[1] http://mesosphere.github.io/mesos-dns/docs/naming.html

The default DNS name for the cassandra server provided by cassandra-mesos framework is cassandra.dcos.node. that is prepended according to the mesos-dns spec to the service cassandra and then to the domain dcos.mesos to form cassandra-dcos-node.cassandra.dcos.mesos.
If you are still unclear, the way to confirm the services name is to:
ssh into the server with mesos-dns (I'll assume it is the mesos-master)
follow the dns service: journalctl -u mesos-dns -f
register a cassandra-mesos service
you are looking for an A record entry similar to:
Jul 14 13:43:09 ip-10-0-7-2.us-west-2.compute.internal mesos-dns[1331]: VERY VERBOSE: 2015/07/14 13:43:09 generator.go:364: [A] cassandra-dcos-node.cassandra.dcos.mesos.: 10.0.1.171

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string