Spark Thrift Server force metadata refresh - apache-spark

I'm using spark to create a table in the hive metastore, then connect with MSSQL to the Spark Thrift server to interrogate that table.
The table is created with:
df.write.mode("overwrite").saveAsTable("TableName")
The problem is that every time, after I overwrite the table (it's a daily job) when I connect with MSSQL I get an error. If I restart the Thrift Server it works OK but I want to automate this and restarting the server every time seems a bit extreme.
The most likely culprit is the Thrift cached metadata which is no longer valid after the table overwrite. How can I force Thrift to refresh the metadata after I overwrite the table, before it's accessed by any of the clients?
I can settle for a solution for MSSQL but there are other "clients" to the table, not just MSSQL. If I can force the metadata refresh from spark (or linux terminal), after I finish the overwrite, rather than ask each client to run a refresh command before it requests the data, I would prefer that.
Note:
spark.catalog.refreshTable("TableName")
Does not work for all clients, just for Spark
SQL REFRESH TABLE `TableName`;
Works for Qlick but again, if I ask each client to refresh it might mean extra work for Thrift and mistakes can happen (such as a dev forgetting to add the refresh).

Related

Connection vs Query in Excel data model

What is the difference between a 'connection' and a 'query' as Excel defines it in their data model? For example, if I load a csv file from local, it stores that as a query (it shows a 'select *' as the query against the file). What then would be considered a connection and what's the difference between the two? The only thing I can think of is that a connection would not return data without specifying the table/query to use-- for example, a database connection where it has multiple tables, and possibly, connecting to another Excel file if it had more than one tab.
Reference: https://support.microsoft.com/en-us/office/create-edit-and-manage-connections-to-external-data-89d44137-f18d-49cf-953d-d22a2eea2d46
Every query is also a connection, but not all connections are queries.
Connections have been in Excel for a long time, typically associated with commands that access data source like SQL server etc. The page you linked to has more links at the bottom. You may want to read up there.
The term "query" is now typically associated with Power Query, where the data connection to the data source is made via the Power Query engine, and then further refined in the query editor.
So, each query (Power Query) also has a connection, but you can only edit Power Queries in the Power Query editor, whereas legacy connections can be edited in the properties dialog of the connection.
Edit: Let's put it this way: The connection is just that. It connects your workbook to a data source. Like a highway connecting two cities.
A query is the request for actual data that you spell out, calling from your workbook (via the connection) into the data source. The data source then sends the data back (via the connection). The mechanics of asking for, receiving, and manipulating the received data (for e.g. cleaning it up, storing it in the workbook) is what the query does, but it can't do this without the connection. The query is the actual traffic on the highway.
Before Power Query, you could also connect to SQL Server and return data. The query details are visible in a tab in the connections dialog, so connection and query were used synonymously. These legacy data tools are now hidden by default and must be activated in the Excel Advanced options.
With Power Query, the brand name influences the use of the terminology. The term "query", more often than not now means Power Query, whereas some people may use "connection" (which is always a part of any query) for old style, legacy data connections (which also contain queries).
However, when you use Power Query, each of these queries will use connections. These are established when you first create the query. Your workbook may have a number of connections to different data sources. The credentials for each data source are stored with the connections (on your computer), not in the Power Query. This is like your toll fee for the highway. By storing the credentials with the connection, you establish the permission to use the connection and it doesn't matter how many people you bring back in your bus.
You can even use the same connection (to a specific SQL Server) for several different queries. When you create the first query to the SQL Server, you are prompted for credentials for that new connection (your toll for the highway). When you create another query to the same SQL Server, the connection already exists and you are not prompted for your credentials.
You can drive your bus along that same highway several times and pick up people from different suburbs of the city that the highway connects you to.
Your highway toll fee is only valid for a limited time. You can take as many trips as you want, but it will expire after some time. (This happens with SharePoint credential after 90 days, after which you have to provide your credentials again. Don't know about SQL Server, though.)
When you send a workbook with a query to SQL Server to someone else, they need to provide their own credentials in order to use the connection. Your toll fee does not cover their bus.
I'm going to stop now before this turns into a children's book.
Hope it helps.
In addition, Connections is a dynamic link and can be set to enable on:
background refresh
when the file opened
refresh every X minutes
or
Refresh when the queries are refreshed.
However, Query is a more static link and needs to be refreshed manually to load the latest data.

How to change property key in gremlin with graph engine as Janusgraph and storage backend as cassandra?

I am using cassandra 4.0 with Janusgraph 6.0, I have n nodes with Label as "januslabel" and property as "janusproperty", I want to change property name to "myproperty", I have tried the answer of this link, Rename property with Gremlin in Azure Cosmos DB
but I was not able to do this permanently, what I mean with permanently is that whenever I do restart cassandra or janusgraph I get the old property name "janusproperty".
How can I change this permanently?
When using JanusGraph, if no transaction is currently open, one will be automatically started once a Gremlin query is issued. Subsequent queries are also part of that transaction. Transactions need to be explicitly committed for any changes to be persisted using something like graph.tx().commit(). Transactions that are not committed will eventually time out and changes will be lost.

Persistent storage in JanusGraph using Cassandra

I'm playing with JanusGraph and Cassandra backend but I have some doubts.
I have a Cassandra server running on my machine (using Docker) and in my API I have this code:
GraphTraversalSource g = JanusGraphFactory.build()
.set("storage.backend", "cql")
.set("storage.hostname", "localhost")
.open()
.traversal();
Then, through my API, I'm saving and fetching data using Gremlin. It works fine, and I see data saved in Cassandra database.
The problem comes when I restart my API and try to fetch data. Data is still stored in Cassandra but JanusGraph query returns empty. Why?
Do I need to load backend storage data into memory or something like that? I'm trying to understand how it works.
EDIT
This is how I add an item:
Vertex vertex = g.addV("User")
.property("username", username)
.property("email", email)
.next();
And to fetch all:
List<Vertex> all = g.V().toList()
Commit your Transactions
You are using JanusGraph right now embedded as a library in your application which gives you access to the full API of JanusGraph. This means that you have to manage transactions on your own which also includes the necessity to commit your transactions in order to persist your modifications to the graph.
You can simply do this by calling:
g.tx().commit();
after you have iterated your traversal with the modifications (the addV() traversal in your case).
Without the commit, the changes are only available locally in your transaction. When you restart your Docker container(s), all data will be lost as you haven't committed it.
The Recommended Approach: Connecting via Remote
If you don't have a good reason to embed JanusGraph as a library in your JVM application, then it's recommended to deploy it independently as JanusGraph Server to which you can send your traversals for execution.
This has the benefit that you can scale JanusGraph independently of your application and also that you can use it from non-JVM languages.
JanusGraph Server then also manages transactions for you transparently by executing each traversal in its own transaction. If the traversal succeeds, then the results are committed and they are also rolled back automatically if an exception occurs.
The JanusGraph docs contain a section about how to connect to JanusGraph Server from Java but the important part is this code to create a graph traversal source g connected to your JanusGraph Server(s):
Graph graph = EmptyGraph.instance();
GraphTraversalSource g = graph.traversal().withRemote("conf/remote-graph.properties");
You can start JanusGraph Server of course also as a Docker container:
docker run --rm janusgraph/janusgraph:latest
More information about the JanusGraph Docker image and how it can be configured to connect to your Cassandra backend can be found here.
The part below is not directly relevant for this question any more given the comments to my first version of the answer. I am still leaving it here in case that others have a similar problem where this could actually be the cause.
Persistent Storage with Docker Containers
JanusGraph stores the data in your storage backend which is Cassandra in your case. That means that you have to ensure that Cassandra persists the data. If you start Cassandra in a Docker container, then you have to mount a volume where Cassandra stores the data to persist it beyond restarts of the container.
Otherwise, the data will be lost once you stop the Cassandra container.
To do this, you can start the Cassandra container for example like this:
docker run -v /my/own/datadir:/var/lib/cassandra -d cassandra
where /my/own/datadir is the directory of your host system where you want the Cassandra data to be stored.
This is explained in the docs of the official Cassandra Docker image under Caveats > Where to Store Data.

Cluster design for downloading/streaming a dataset to a user

In our system, we classically have two components: A Cloudera Hadoop cluster (CDH) and an OpenShift "backend" system. In HDFS, we have some huge .parquet files.
We now have a business requirement to "export the data by a user given filter criterion" to a user in "realtime" as downloadable file. So the flow is: The user enters a SQL like filter string, for instance user='Theo' and command='execution'. He then sends a GET /export request to our backend service with the filter string as parameter. The user shall now get a "download file" from his web browser and immediately start downloading that file as CSV (even if its multiple terrabytes or even petabytes in size, thats the user's choice if he wants to try out and wait that long). In fact, the cluster should respond synchronously but not cache the entire response on a single node before sending the result but only receive data at "internet speed" of the user and directly stream it to the user. (With a buffer of e.g. 10 oder 100 MB).
I now face the problem on how to best approach this requirement. My considerations:
I wanted to use Spark for that. Spark would read the Parquet file, apply the filter easily and then "coalesce" the filtered result to the driver which in turn streams the data back to the requesting backend/client. During this task, the driver should of course not run out of memory if the data is sent too slowly back to the backend/user, but just have the executors deliver the data in the same speed as it is "consumed").
However, I face some problems here:
The standard use case is that the user has fine grained filters so that his exported file contains something like 1000 lines only. If I'd submit a new spark job via spark-submit for each request, I already come into latencies of multiple seconds due to initialization and query plan creation (Even if its just as simple as reading and filtering the data). I'd like to avoid that.
The cluster and the backend are strictly isolated. The operation guys ideally don't want us to reach the cluster from the backend at all, but the cluster should just call the backend. We are able to "open" maybe one port, but we'll possibly not able to argue something like "our backend will run the spark driver but being connected to the cluster as execution backend".
Is it a "bad design smell" if we run a "server spark job", i.e. we submit an application with mode "client" to the cluster master which also opens a port for HTTP requests and only runs a spark pipeline on requests, but holds the spark context open all the time (and is reachable from our backend via a fixed URL)? I know that there is "spark-job-server" project which does this, but it still feels a bit weird due to the nature of Spark and Jobs, where "naturally" a job would be to download a file and not be a 24h running server waiting to execute some pipeline steps from time to time.
I have no idea on how to limit sparks result fetching so that the executors send in a speed so that the driver won't run out of memory if the user requested petabytes.. Any suggestion on this?
Is Spark a good choice for this task after all or do you have any suggestions for better tooling here? (At best in CDH 5.14 environment as we don't get the operation team to install any additional tool).

Spark job to work in two different HDFS environments

I have a requirement, I need to write a spark job to connect in Prod(Source-Hive)Server A
and get the data into Local(Temp hive server) do the transform and load it back into TargetProd(Server B)
In earlier cases, we have our Target DB as Oracle, so we use to give like below, which will overwrite the table
AAA.write.format("jdbc").option("url", "jdbc:oracle:thin:#//uuuuuuu:0000/gsahgjj.yyy.com").option("dbtable", "TeST.try_hty").option("user", "aaaaa").option("password", "dsfdss").option("Truncate","true").mode("Overwrite").save().
In terms of SPARK overwrite from Server A to B, what should be syntax we need to give.
when I try to establish the connection through jdbc from one hive(ServerA) to Server B. It is not working.. please help.
You can connect to hive by using jdbc if it’s a remote one. Please get your hive thrift server url and port details and connect via jdbc. It should work.

Resources