Elasticsearch pyspark connection in insecure mode - apache-spark

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks

After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)

I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```

can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

Related

PySpark write data to Ceph returns 400 Bad Request

I have a problem with pySpark configuration when writing data inside a ceph bucket.
With the following Python code snippet I can read data from the Ceph bucket but when I try to write inside the bucket, I get the following error:
22/07/22 10:00:58 DEBUG S3ErrorResponseHandler: Failed in parsing the error response :
org.apache.hadoop.shaded.com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
at [row,col {unknown-source}]: [1,0]
at org.apache.hadoop.shaded.com.ctc.wstx.sr.StreamScanner.throwUnexpectedEOF(StreamScanner.java:701)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.handleEOF(BasicStreamReader.java:2217)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2123)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1179)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.createException(S3ErrorResponseHandler.java:122)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:71)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:52)
[...]
22/07/22 10:00:58 DEBUG request: Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
22/07/22 10:00:58 DEBUG AwsChunkedEncodingInputStream: AwsChunkedEncodingInputStream reset (will reset the wrapped stream because it is mark-supported).
Pyspark code (not working):
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-bundle:1.12.264,org.apache.spark:spark-sql-kafka-0-10_2.13:3.3.0,org.apache.hadoop:hadoop-aws:3.3.3 pyspark-shell"
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.getOrCreate()
)
spark.sparkContext.setLogLevel("TRACE")
# This works
spark.read.csv("s3a://test-data/data.csv")
# This throws the provided error
df_to_write = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df_to_write.write.csv("s3a://test-data/with_love.csv")
Also, referring to the same ceph bucket, I am able to read and write data to the bucket via boto3:
import boto3
from botocore.exceptions import ClientError
from botocore.client import Config
config = Config(connect_timeout=20, retries={'max_attempts': 0})
s3_client = boto3.client('s3', config=config,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name="defaut",
endpoint_url='http://HOST_NAME:88',
verify=False
)
response = s3_client.list_buckets()
# Read
print('Existing buckets:')
for bucket in response['Buckets']:
print(f' {bucket["Name"]}')
# Write
dummy_data = b'Dummy string'
s3_client.put_object(Body=dummy_data, Bucket='test-spark', Key='awesome_key')
Also s3cmd with the same configuration is working fine.
I think I'm missing some pyspark (hadoop-aws) configuration, could anyone help me in identifying the configuration problem? Thanks.
After some research on the web, I was able to solve the problem using this hadoop-aws configuration:
fs.s3a.signing-algorithm: S3SignerType
I configured this property in pySpark with:
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.config("spark.hadoop.fs.s3a.signing-algorithm", "S3SignerType") \
.getOrCreate()
)
From what I understand, the version of ceph I am using (16.2.3), does not support the default signing algorithm used in Spark version v3.3.0 on hadoop version 3.3.2.
For further details see this documentation.

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

How do I use spark xml data source in .net?

Is there a way to use spark-xml (https://github.com/databricks/spark-xml) in a spark .net/c# job?
I was able to use spark-xml data source from .Net.
Here is the test program:
using Microsoft.Spark.Sql;
namespace MySparkApp
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("spark-xml-example")
.GetOrCreate();
DataFrame df = spark.Read()
.Option("rowTag", "book")
.Format("xml")
.Load("books.xml");
df.Show();
df.Select("author", "_id")
.Write()
.Format("xml")
.Option("rootTag", "books")
.Option("rowTag", "book")
.Save("newbooks.xml");
spark.Stop();
}
}
}
Checkout https://github.com/databricks/spark-xml and build an assembly jar using 'sbt assembly' command, copy the assembly jar to the dotnet project workspace.
Build project: dotnet build
Submit Spark job:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--jars scala-2.11/spark-xml-assembly-0.10.0.jar \
--master local bin/Debug/netcoreapp3.1/microsoft-spark-2.4.x-0.10.0.jar \
dotnet bin/Debug/netcoreapp3.1/sparkxml.dll

Spark, kerberos, yarn-cluster -> connection to hbase

Facing one issue with Kerberos enabled Hadoop cluster.
We are trying to run a streaming job on yarn-cluster, which interacts with Kafka (direct stream), and hbase.
Somehow, we are not able to connect to hbase in the cluster mode. We use keytab to login to hbase.
This is what we do:
spark-submit --master yarn-cluster --keytab "dev.keytab" --principal "dev#IO-INT.COM" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j_executor_conf.properties -XX:+UseG1GC" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j_driver_conf.properties -XX:+UseG1GC" --conf spark.yarn.stagingDir=hdfs:///tmp/spark/ --files "job.properties,log4j_driver_conf.properties,log4j_executor_conf.properties" service-0.0.1-SNAPSHOT.jar job.properties
To connect to hbase:
def getHbaseConnection(properties: SerializedProperties): (Connection, UserGroupInformation) = {
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM_VALUE);
config.set("hbase.zookeeper.property.clientPort", 2181);
config.set("hadoop.security.authentication", "kerberos");
config.set("hbase.security.authentication", "kerberos");
config.set("hbase.cluster.distributed", "true");
config.set("hbase.rpc.protection", "privacy");
config.set("hbase.regionserver.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
config.set("hbase.master.kerberos.principal", “hbase/_HOST#IO-INT.COM”);
UserGroupInformation.setConfiguration(config);
var ugi: UserGroupInformation = null;
if (SparkFiles.get(properties.keytab) != null
&& (new java.io.File(SparkFiles.get(properties.keytab)).exists)) {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
SparkFiles.get(properties.keytab));
} else {
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(properties.kerberosPrincipal,
properties.keytab);
}
val connection = ConnectionFactory.createConnection(config);
return (connection, ugi);
}
and we connect to hbase:
….
foreachRDD { rdd =>
if (!rdd.isEmpty()) {
//var ugi: UserGroupInformation = Utils.getHbaseConnection(properties)._2
rdd.foreachPartition { partition =>
val connection = Utils.getHbaseConnection(propsObj)._1
val table = …
partition.foreach { json =>
}
table.put(puts)
table.close()
connection.close()
}
}
}
Keytab file is not getting copied to yarn staging/temp directory, we are not getting that in SparkFiles.get… and if we pass keytab with --files, spark-submit is failing because it’s there in --keytab already.
error is:
This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020
RpcRetryingCaller{globalStartTime=1497943263013, pause=100, retries=5}, org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: myserver.test.com/120.111.25.45:60020 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)

Yarn support on Ooyala Spark JobServer

Just started experimenting with the JobServer and would like to use it in our production environment.
We usually run spark jobs individually in yarn-client mode and would like to shift towards the paradigm offered by the Ooyala Spark JobServer.
I am able to run the WordCount examples shown in the official page.
I tried running submitting our custom spark job to the Spark JobServer and I got this error:
{
"status": "ERROR",
"result": {
"message": "null",
"errorClass": "scala.MatchError",
"stack": ["spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:220)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future $PromiseCompletingRunnable.run(Future.scala:24)",
"akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)",
"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)",
"scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)",
"scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java 1339)",
"scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)",
"scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"]
}
I had made the necessary code modifications like extending SparkJob and implementing the runJob() method.
This is the dev.conf file that I used:
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
jar-store-rootdir = /tmp/jobserver/jars
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
context-creation-timeout = "60 s"
}
contexts {
my-low-latency-context {
num-cpu-cores = 1
memory-per-node = 512m
}
}
context-settings {
num-cpu-cores = 2
memory-per-node = 512m
}
home = "/data/softwares/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041"
}
spray.can.server {
parsing.max-content-length = 200m
}
spark.driver.allowMultipleContexts = true
YARN_CONF_DIR=/home/spark/conf/
Also how can I give run-time parameters for the spark job, such as --files, --jars ?
For example, I usually run our custom spark job like this:
./spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/bin/spark-submit --class com.demo.SparkDriver --master yarn-cluster --num-executors 3 --jars /tmp/api/myUtil.jar --files /tmp/myConfFile.conf,/tmp/mySchema.txt /tmp/mySparkJob.jar
Number of executors and extra jars are passed in a different way, through the config file (see dependent-jar-uris config setting).
YARN_CONF_DIR should be set in the environment and not in the .conf file.
As for other issues, the google group is the right place to ask. You may want to search it for yarn-client issues, as several other folks have figured out how to get it to work.

Resources