We are running a map reduce/spark job to bulk load hbase data in One of the environment - apache-spark

We are running a map reduce/spark job to bulk load hbase data in one of the environments.
While running it, connection to the hbase zookeeper cannot initialize throwing the following error.
16/05/10 06:36:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181 sessionTimeout=90000 watcher=hconnection-0x74b47a30, quorum=c321shu.int.westgroup.com:2181,c149jub.int.westgroup.com:2181,c167rvm.int.westgroup.com:2181, baseZNode=/hbase
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Opening socket connection to server c321shu.int.westgroup.com/10.204.152.28:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.204.24.16:35740, server: c321shu.int.westgroup.com/10.204.152.28:2181
16/05/10 06:36:10 INFO zookeeper.ClientCnxn: Session establishment complete on server c321shu.int.westgroup.com/10.204.152.28:2181, sessionid = 0x5534bebb441bd3f, negotiated timeout = 60000
16/05/10 06:36:11 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table ecpdevv1patents:NormNovusDemo
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue May 10 06:36:11 CDT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller#3927df20, java.io.IOException: Call to c873gpv.int.westgroup.com/10.204.67.9:60020 failed on local exception: java.io.EOFException
We have executed the same job in Titan DEV too but facing the same problem. Please let us know if anyone has faced the same problem before.
Details are,
• Earlier job was failing to connect to localhost/127.0.0.1:2181. Hence only the property hbase.zookeeper.quorum has been set in map reduce code with c149jub.int.westgroup.com,c321shu.int.westgroup.com,c167rvm.int.westgroup.com which we got from hbase-site.xml.
• We are using jars of cdh version 5.3.3.

Related

Why does Cassandra Kerberos Connection to second keyspace fail?

We are trying to connect to two keyspaces of Cassandra (3.x) in the same application with the same Kerberos credentials. The application is able to connect to one keyspace but no the other. Access to the keyspaces has been verified.
Error on connection:
2022-08-22 13:15:10,972 [cluster-reconnection-0] DEBUG c.d.d.c.ControlConnection [--]- [Control connection] error on 169.24.167.109:9042 connection, trying next host
javax.security.auth.login.LoginException: No LoginModules configured for CassandraJavaClient
at javax.security.auth.login.LoginContext.init(LoginContext.java:264)
at javax.security.auth.login.LoginContext.<init>(LoginContext.java:417)
The ticket cache is :
CassandraJavaClient {
com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true ticketCache="/var//krb5cc_userlogin";
};
The same ticket cache file is used by the first connection - which succeeds. While the second connection fails. I am not even sure as to how to debug it (tried remote debugging and since the initial control connection is an Async call, unable to get to the actual error).
We are using com.datastax.cassandra:cassandra-driver-core:jar:3.6.0
Any ideas/help to debug / resolve this will be highly appreciated

Elasticsearch spark connection for structured-streaming

I'm trying to make a connection to elasticsearch from my spark program.
My elasticsearch host is https and found no connection property for that.
We are using spark structred streaming Java API and the connection details are as follows,
SparkSession spark = SparkSession.builder()
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, "username")
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, "password")
.config(ConfigurationOptions.ES_NODES, "my_host_url")
.config(ConfigurationOptions.ES_PORT, "9200")
.config(ConfigurationOptions.ES_NET_SSL_TRUST_STORE_LOCATION,"C:\\certs\\elastic\\truststore.jks")
.config(ConfigurationOptions.ES_NET_SSL_TRUST_STORE_PASS,"my_password") .config(ConfigurationOptions.ES_NET_SSL_KEYSTORE_TYPE,"jks")
.master("local[2]")
.appName("spark_elastic").getOrCreate();
spark.conf().set("spark.sql.shuffle.partitions",2);
spark.conf().set("spark.default.parallelism",2);
And I'm getting the following error
19/07/01 12:26:00 INFO HttpMethodDirector: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server 10.xx.xxx.xxx failed to respond
19/07/01 12:26:00 INFO HttpMethodDirector: Retrying request
19/07/01 12:26:00 ERROR NetworkClient: Node [10.xx.xxx.xxx:9200] failed (The server 10.xx.xxx.xxx failed to respond); no other nodes left - aborting...
19/07/01 12:26:00 ERROR StpMain: Error
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:344)
Probably it's because it tries to initiate connection by http protocol but in my case I need https connection and not sure how to configure that
The error happened as spark was not able to locate the truststore file. It seems we need to add "file:\\" for the path to be accepted.

zookeeper connection timing out, kafa-spark streaming

I'm trying some exercise with spark streaming with kafka. If I use kafka producer and consumer in command line, I can publish and consume the messages in kafka. When I try to do it using pyspark in jupyter notebook. I am getting zookeeper connection timeout error.
Client session timed out, have not heard from server in 6004ms for sessionid 0x0, closing socket connection and attempting reconnect
[2017-08-04 15:49:37,494] INFO Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient#158da8e (org.apache.zookeeper.ZooKeeper)
[2017-08-04 15:49:37,524] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2017-08-04 15:49:37,527] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:37,533] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[2017-08-04 15:49:38,637] INFO Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2017-08-04 15:49:38,639] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
`
Zookeeper has issues when using localhost (127.0.0.1). Described in https://issues.apache.org/jira/browse/ZOOKEEPER-1661?focusedCommentId=13599352
This little program explains the following things:
ZooKeeper does call InetAddress.getAllByName (see StaticHostProvider:60) on the connect string "localhost:2181" => as a result it gets 3 different addresses for localhost which then get shuffled (Collections.shuffle(this.serverAddresses): L72
Because of the shuffling (random), the call to StaticHostProvider.next will sometime return the fe80:0:0:0:0:0:0:1%1 address which as you can see from this small program times out after 5s => this explains the randomness I am experiencing.
It really seems to me that what I am experiencing is a reverse dns lookup issue with IPv6. Whether this reverse dns lookup is actually useful and required by ZooKeeper, I do not know. It did not behave this way in 3.3.3.
Solution, specify your zookeeper address as a FQDN and make sure the reverse lookup works or use 0.0.0.0 instead of localhost.

Deploying cassandra-mesos framework with marathon

I'm using the latest jar files of cassandra-mesos framework (by using this jason file: https://teamcity.mesosphere.io/repository/download/Oss_Mesos_Cassandra_CassandraFramework/97399:id/marathon.json), but getting the following errors:
I0310 13:19:34.699774 16389 sched.cpp:264] No credentials provided.
Attempting to register without authentication I0310 13:19:34.701026
16389 sched.cpp:819] Got error 'Completed framework attempted to
re-register' I0310 13:19:34.701038 16389 sched.cpp:1625] Asked to
abort the driver I0310 13:19:34.701364 16389 sched.cpp:861] Aborting
framework '20160309-183453-2497969674-5050-19271-0001' I0310
13:19:34.719744 16373 sched.cpp:1591] Asked to stop the driver I0310
13:19:34.719784 16389 sched.cpp:835] Stopping framework
'20160309-183453-2497969674-5050-19271-0001'
Any idea?
The error says Completed framework attempted to re-register which means the framework keeps its state somewhere (probably in Zookeeper, but cannot access your URL with marathon.json to verify), and thus tries to start with the framework ID stored in this state. However, that framework ID is already deregistered, and Mesos does not allow you to start the framework with the same ID again.
The solution to this would be either to pick a different znode for framework storage or remove the existing znode before starting the framework.
Thanks a lot :-). It's working now. but when I tried to check zookeeper for cassandra-mesos, I got the following error: mesos-resolve zk://mesos-master-2:2181/cassandra-mesos/cassandra-mesos-fw
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#716: Client environment:host.name=mesos-slave-1
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#723: Client environment:os.name=Linux
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#724: Client environment:os.arch=3.10.0-327.4.4.el7.x86_64
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#725: Client environment:os.version=#1 SMP Tue Jan 5 16:07:00 UTC 2016
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#733: Client environment:user.name=root
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#741: Client environment:user.home=/root
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#log_env#753: Client environment:user.dir=/ephemeral/cassandra-mesos
2016-03-13 12:46:22,428:26613(0x7fa4fa843700):ZOO_INFO#zookeeper_init#786: Initiating client connection, host=mesos-master-2:2181 sessionTimeout=10000 watcher=0x7fa5023200b0 sessionId=0 sessionPasswd= context=0x7fa4d8001ec0 flags=0
2016-03-13 12:46:22,429:26613(0x7fa4f6628700):ZOO_INFO#check_events#1703: initiated connection to server [10.254.227.148:2181]
2016-03-13 12:46:22,434:26613(0x7fa4f6628700):ZOO_INFO#check_events#1750: session establishment complete on server [10.254.227.148:2181], sessionId=0x25364fb9f3a0020, negotiated timeout=10000
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0313 12:46:22.434587 26616 group.cpp:313] Group process (group(1)#10.254.235.46:56890) connected to ZooKeeper
I0313 12:46:22.434659 26616 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0313 12:46:22.434670 26616 group.cpp:385] Trying to create path '/cassandra-mesos/cassandra-mesos-fw' in ZooKeeper
Failed to detect master from 'zk://mesos-master-2:2181/cassandra-mesos/cassandra-mesos-fw' within 5secs

Cassandra and defuncting connection

I've got a question about Cassandra. I haven't found any "understable answer" yet...
I made a cluster build on 3 nodes (RackInferringSnitch) on differents VM. I'm using Datastax's Java Driver to read and update my keyspace (with CSVs).
When one node is down (ie : 10.10.6.172), I've got this debug warning:
INFO 00:47:37,195 New Cassandra host /10.10.6.172:9042 added
INFO 00:47:37,246 New Cassandra host /10.10.6.122:9042 added
DEBUG 00:47:37,264 [Control connection] Refreshing schema
DEBUG 00:47:37,384 [Control connection] Successfully connected to /10.10.6.171:9042
DEBUG 00:47:37,391 Adding /10.10.6.172:9042 to list of queried hosts
DEBUG 00:47:37,395 Defuncting connection to /10.10.6.172:9042
com.datastax.driver.core.TransportException: [/10.10.6.172:9042] Channel has been closed
at com.datastax.driver.core.Connection$Dispatcher.channelClosed(Connection.java:621)
at
[...]
[...]
DEBUG 00:47:37,400 [/10.10.6.172:9042-1] Error connecting to /10.10.6.172:9042 (Connection refused: /10.10.6.172:9042)
DEBUG 00:47:37,407 Error creating pool to /10.10.6.172:9042 ([/10.10.6.172:9042] Cannot connect)
DEBUG 00:47:37,408 /10.10.6.172:9042 is down, scheduling connection retries
DEBUG 00:47:37,409 First reconnection scheduled in 1000ms
DEBUG 00:47:37,410 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,423 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,427 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,435 Shutting down pool
DEBUG 00:47:37,439 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,443 Shutting down pool
DEBUG 00:47:37,459 Connected to cluster: WormHole
I wanted to know if I need to handle this exception or it will be handled by itself (I mean, when the node will be back again cassandra will do the correct write if the batch was a write...)
EDIT : Current consistency level is ONE.
The DataStax driver keeps track of which nodes are available at all times and routes queries (load balacing) based on this information. The way it does this is based on your reconnection policy.
You will see debug level messages when nodes are detected as down, etc. This is no cause for concern as the driver will re-route to other available nodes, it will also re-try the nodes periodically to find out if they are back up. If you had a problem and the data was not getting saved to Cassandra you would see timeout errors. No action necessary in this case.

Categories

Resources