Spark job to work in two different HDFS environments - apache-spark

I have a requirement, I need to write a spark job to connect in Prod(Source-Hive)Server A
and get the data into Local(Temp hive server) do the transform and load it back into TargetProd(Server B)
In earlier cases, we have our Target DB as Oracle, so we use to give like below, which will overwrite the table
AAA.write.format("jdbc").option("url", "jdbc:oracle:thin:#//uuuuuuu:0000/gsahgjj.yyy.com").option("dbtable", "TeST.try_hty").option("user", "aaaaa").option("password", "dsfdss").option("Truncate","true").mode("Overwrite").save().
In terms of SPARK overwrite from Server A to B, what should be syntax we need to give.
when I try to establish the connection through jdbc from one hive(ServerA) to Server B. It is not working.. please help.

You can connect to hive by using jdbc if it’s a remote one. Please get your hive thrift server url and port details and connect via jdbc. It should work.

Related

ConfiguredGraphFactory.open() on JanusGraph returned Cassandra DriverTimeoutException

I am new to Janusgraph. We have janusgraph setup with cassandra as backend.
We are using ConfiguredGraphFactory to dynamically create graphs at runtime. But when trying to open the created graph using ConfiguredGraphFactory.open("graphName") getting the below error
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
at com.datastax.oss.driver.api.core.DriverTimeoutException.copy(DriverTimeoutException.java:34)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
We are using single cassandra node and not a cluster. If we are not using ConfiguredGraphFactory we are able to connect to cassandra & it is not a network/wrong port issue.
Any Help would be appreciated.
JanusGraph uses the Java driver to connect to Cassandra. This error comes from the driver and indicates that the nodes didn't respond:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
The DriverTimeoutException is different from a read or write timeout. It gets thrown when a request from the driver timed out because it didn't get a response from the Cassandra nodes after 2 seconds (PT2S).
You'll need to check if there's a network route between the JanusGraph server and the Cassandra nodes. One thing to check for is that a firewall is not blocking access to the CQL client port on the C* nodes (default is 9042). Cheers!

Is there a way to select the cassandra keyspace to use in a gremlin query?

Normally, the janusgraph.properties file specifies the storage backend params to which the janus instance is pointing to:
// For a Cassandra backend
storage.backend=cql
storage.hostname=cassandraHost
storage.cql.keyspace=myKeyspace
// ... port, password, username, and so on
Now, once the Janusgraph instance is created, any gremlin query requested to Janusgraph will create/read the graph info from that specified keyspace named "myKeyspace".
Since I need to use a Janusgraph instance and a Cassandra instance that already running (cannot change the keyspace), but I need the queries to return the graph contained in another keyspace called "secondKeySpace" my question is:
Is there a way to select a different Cassandra keyspace to which to point the Janusgraph gremlin queries within the gremlin query itself?
Instead of doing
g.V().has(label, 'service').has('serviceId','%s').out().has(label,'user')```
Can I do something like the next?
g.keySpace('secondKeySpace').V().has(label, 'service').has('serviceId','%s').out().has(label,'user')
Thanks in advance for any help you all, I'm new to Janusgraph and I don't know if this is even possible.

How to query Cassandra from a certain node and get the data from only that node own?

Cassandra use consistent hash to manage data, and after we use Cassandra driver to connect the cluster, the node we connect to may query from other nodes in the cluster to get the result. But for my current situation, I'm doing some testing for my algorithm, I want to give a certain tokenRange and query the data in the tokenRange and on a certain node, if some data in the tokenRange isn't in this node, I don't want the node query other node to get the result. Is it possible and how to achieve it?
I find Cassandra Python driver: force using a single node but this solution only provide the client's connection pool connect to a certain node, the node will still query other nodes.
Use the WhiteListRoundRobinPolicy and CL.ONE like linked in other question.
You can also extend the Statement to include a host and a custom load balancing policy to send the request to the host in the wrapper. Extend a policy and override make_query_plan, something like (untested just scratch, consider following pseudo code)
class StatementSingleHostRouting(DCAwareRoundRobinPolicy):
def make_query_plan(self, working_keyspace=None, query=None):
if query.host:
return [query.host]
return DCAwareRoundRobinPolicy.make_query_plan(self, working_keyspace, query)
If that host doesn't own the data it will still query other replicas though.

Apache Cassandra 3.10 : CQL query to check remote application connections in Cassandra DB

I want to know if there is a cqlsh query to check remote application connections in Cassandra DB, just like V$session in oracle, or processlists in mysql.
I don't think there is a cqlsh query to do that, but you can use cassandra java-diver to do a manual pooling. This link: http://docs.datastax.com/en/developer/java-driver/3.0/manual/pooling/#monitoring-and-tuning-the-pool, gives a simple example that will print the number of open connections, active requests, and maximum capacity for each host, every 5 seconds.

Spark Thrift Server force metadata refresh

I'm using spark to create a table in the hive metastore, then connect with MSSQL to the Spark Thrift server to interrogate that table.
The table is created with:
df.write.mode("overwrite").saveAsTable("TableName")
The problem is that every time, after I overwrite the table (it's a daily job) when I connect with MSSQL I get an error. If I restart the Thrift Server it works OK but I want to automate this and restarting the server every time seems a bit extreme.
The most likely culprit is the Thrift cached metadata which is no longer valid after the table overwrite. How can I force Thrift to refresh the metadata after I overwrite the table, before it's accessed by any of the clients?
I can settle for a solution for MSSQL but there are other "clients" to the table, not just MSSQL. If I can force the metadata refresh from spark (or linux terminal), after I finish the overwrite, rather than ask each client to run a refresh command before it requests the data, I would prefer that.
Note:
spark.catalog.refreshTable("TableName")
Does not work for all clients, just for Spark
SQL REFRESH TABLE `TableName`;
Works for Qlick but again, if I ask each client to refresh it might mean extra work for Thrift and mistakes can happen (such as a dev forgetting to add the refresh).

Resources