I am trying to run the below spark-submit command reading config file from command line args, when I run the code on local it runs fine, but when I run using yarn, it fails with the below error
Spark Submit :
time spark-submit --files /etc/hive/conf/hive-site.xml --master yarn --deploy-mode cluster --class IntegrateAD /home/username/s3ReadWrite-assembly-1.1.jar "day0" "/home/username/config.properties"
Error :
INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: User class threw exception: java.io.FileNotFoundException: /home/username/config.properties (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
Code I am running :
val propFile = new File(args(1))
val properties: Properties = new Properties()
if (propFile != null) {
// val source = Source.fromFile(propFile)
//properties.load(source.bufferedReader())
properties.load(new FileInputStream(propFile))
properties
}
else {
logger.error("properties file cannot be loaded at path " + propFile)
throw new FileNotFoundException("Properties file cannot be loaded")
}
Please help me with what maybe wrong here, in my code or my spark-submit or is it something else.
Thanks for your help in advance.
My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'
I am using the balance transfer sample.
I have enabled the ORDERER_GENERAL_TLS_CLIENTAUTHREQUIRED=true in orderer container.
While creating a new channel(mychannel) it was throwing error of Handshake failed with fatal error
After the error I configured the client
client.setTlsClientCertAndKey(cert, key);
let adminClient = JSON.parse(
fs.readFileSync(path.join(__dirname, "../fabric-client-kv-org1/admin"))
);
logger.info(adminClient);
client.setTlsClientCertAndKey(
adminClient.enrollment.identity.certificate,
adminClient.enrollment.signingIdentity
);
I am importing admin and then using its signingIdentity and certificate to set the tls client.
Now, it is throwing error as Invalid private key
E0619 17:15:44.135000000 139448 ssl_transport_security.cc:671] Invalid private key.
E0619 17:15:44.136000000 139448 security_connector.cc:1087] Handshaker factory creation failed with TSI_INVALID_ARGUMENT.
E0619 17:15:44.137000000 139448 secure_channel_create.cc:121] Failed to create secure subchannel for secure name 'localhost:7050'
E0619 17:15:44.137000000 139448 secure_channel_create.cc:154] Failed to create subchannel arguments during subchannel creation.
2019-06-19T11:45:47.132Z - error: [Remote.js]: Error: Failed to connect before the deadline URL:grpcs://localhost:7050
2019-06-19T11:45:47.133Z - error: [Orderer.js]: Orderer grpcs://localhost:7050 has an error Error: Failed to connect before the deadline URL:grpcs://localhost:7050
What is the cause of error and Am I using the correct client certificate and key? It is confusing in docs
https://fabric-sdk-node.github.io/tutorial-network-config.html
I figured out the reason for the invalid private key. The signing identity is not the private key.
After registering the user, I am enrolling it and saving its private key and certificate locally.
let req = {
enrollmentID: "admin",
enrollmentSecret: "adminpw",
profile: "tls"
};
const enrollment = await caClient.enroll(req);
client.setTlsClientCertAndKey(
enrollment.certificate,
enrollment.key.toBytes()
);
Facing one issue with Kerberos enabled Hadoop cluster.
We are trying to run a streaming job on yarn-cluster, which interacts with Kafka (direct stream), and hbase.
Somehow, we are not able to connect to hbase in the cluster mode. We use keytab to login to hbase.
This is what we do:
spark-submit --master yarn-cluster --keytab "dev.keytab" --principal "dev#IO-INT.COM" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j_executor_conf.properties -XX:+UseG1GC" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j_driver_conf.properties -XX:+UseG1GC" --conf spark.yarn.stagingDir=hdfs:///tmp/spark/ --files "job.properties,log4j_driver_conf.properties,log4j_executor_conf.properties" service-0.0.1-SNAPSHOT.jar job.properties
To connect to hbase:
def getHbaseConnection(properties: SerializedProperties): (Connection, UserGroupInformation) = {
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM_VALUE);
config.set("hbase.zookeeper.property.clientPort", 2181);
config.set("hadoop.security.authentication", "kerberos");
config.set("hbase.security.authentication", "kerberos");
config.set("hbase.cluster.distributed", "true");
config.set("hbase.rpc.protection", "privacy");
config.set("hbase.regionserver.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
config.set("hbase.master.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
UserGroupInformation.setConfiguration(config);
var ugi: UserGroupInformation = null;
if (SparkFiles.get(properties.keytab) != null
&& (new java.io.File(SparkFiles.get(properties.keytab)).exists)) {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
SparkFiles.get(properties.keytab));
} else {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
properties.keytab);
}
val connection = ConnectionFactory.createConnection(config);
return (connection, ugi);
}
and we connect to hbase:
….
foreachRDD { rdd =>
if (!rdd.isEmpty()) {
//var ugi: UserGroupInformation = Utils.getHbaseConnection(properties)._2
rdd.foreachPartition { partition =>
val connection = Utils.getHbaseConnection(propsObj)._1
val table = …
partition.foreach { json =>
}
table.put(puts)
table.close()
connection.close()
}
}
}
Keytab file is not getting copied to yarn staging/temp directory, we are not getting that in SparkFiles.get… and if we pass keytab with --files, spark-submit is failing because it’s there in --keytab already.
error is:
This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)
I have setup datastax cassandra service and have created a keyspace and my db is running fine.
Below are output from the nodetool status command:
C:\Users\xxx>cd C:\Program Files\DataStax Community\apache-cassandra\bin
C:\Program Files\DataStax Community\apache-cassandra\bin>nodetool status
Datacenter: datacenter1
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 229 KB 256 100.0% d5229669-f8f2-4b06-a887-4ab91a883a74 rack1
Also , the data is created having a keyspace.
cqlsh:axiaglobal> use axiaglobal;
cqlsh:axiaglobal> describe tables;
greetings
cqlsh:axiaglobal> select * from greetings;
user | id | creation_date | greet
------+----+---------------+-------
Now when I try to connect to cassandra via Java I get the following exception:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (null))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:196)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1104)
at com.datastax.driver.core.Cluster.init(Cluster.java:121)
at com.datastax.driver.core.Cluster.connect(Cluster.java:198)
at com.datastax.driver.core.Cluster.connect(Cluster.java:226)
at com.axia.global.dao.cassandra.service.CassandraApp.main(CassandraApp.java:29)
Piece of code which makes a call to Cassandra is listed below:
package com.axia.global.dao.cassandra.service;
import java.net.InetAddress;
import java.net.UnknownHostException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.data.cassandra.core.CassandraOperations;
import org.springframework.data.cassandra.core.CassandraTemplate;
import com.axia.global.model.cassandra.Person;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
import com.datastax.driver.core.querybuilder.QueryBuilder;
import com.datastax.driver.core.querybuilder.Select;
public class CassandraApp {
private static final Logger LOG = LoggerFactory.getLogger(CassandraApp.class);
private static Cluster cluster;
private static Session session;
public static void main(String[] args) {
try {
cluster = Cluster.builder().addContactPoints("127.0.0.1").build();
session = cluster.connect("axiaglobal");
CassandraOperations cassandraOps = new CassandraTemplate(session);
Select s = QueryBuilder.select().from("greetings");
} catch (Exception e) {
e.printStackTrace();
}
}
}
I am unable to understand where I am going wrong and why my connection is failing to connect to cassandra
can some one help me out with it ?
I have tried setting up the following:
rpc_address: 0.0.0.0 broadcast_rpc_address: 1.2.3.4 and even it did not work.
Did you check you have the correct java Driver for your Cassandra version?
what version of driver and Cassandra are you using?
Check here
https://docs.datastax.com/en/developer/driver-matrix/doc/javaDrivers.html