I am running Spark inside Glue to write down to AWS/ElasticSearch with the following configuration for Spark:
conf.set("es.nodes", s"$nodes/$indexName")
conf.set("es.port", "443")
conf.set("es.batch.write.retry.count", "200")
conf.set("es.batch.size.bytes", "512kb")
conf.set("es.batch.size.entries", "500")
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.wan.only", "true")
conf.set("es.net.ssl", "true")
however what I get is the following error:
diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
....
I know in which "VPC" is running my ElasticSearch instance, but I am not sure how to set that for Glue/Spark or if it is a different problem. Any idea?
I have also tried to add a "glue jdbc" connection which should use the proper VPC connection but I am not sure how set it up properly:
import scala.reflect.runtime.universe._
def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
SparkProvider.glueContext.getJDBCSink(
catalogConnection = "my-elasticsearch-connection",
options = JsonOptions(
"WHAT HERE?"
),
transformationContext = "SinkToElasticSearch"
).writeDynamicFrame(DynamicFrame(
SparkProvider.sqlContext.createDataFrame[T](data),
SparkProvider.glueContext))
Try to create to create a dummy JDBC connection. The dummy connection will tell Glue the ES - VPC, subnet and security group. A test connection might not work but when you run your job with the connection, it will use the connection metadata to launch elastic network interface in your VPC to facilitate this communication. More on connections can be found here:
[1] https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html
Related
I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?
I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the spark-sql shell.
The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data.
When I launch the spark-sql shell with DEBUG logging enabled, this is what I see in the logs:
> bin/spark-sql --proxy-user proxy_user
...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_HOST#REALM.COM at thrift://hive-metastore:9083
DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_host#REALM.COM (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130)
This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. This is the piece of code in Spark which does that. I also verified in the metastore logs that the get_delegation_token() call was being made.
Now when I run a simple query like create table test_table (id int) location "s3://some/prefix"; I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized (org/apache/hadoop/hive/metastore/Warehouse.java):
public static FileSystem getFs(Path f, Configuration conf) throws MetaException {
...
try {
// get the current user
UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
LOG.info("UGI information: " + ugi);
Collection<Token<? extends TokenIdentifier>> tokens = ugi.getCredentials().getAllTokens();
// print all the tokens it has
for(Token token : tokens) {
LOG.info(token);
}
} catch (IOException e) {
e.printStackTrace();
}
...
}
In the metastore logs, this does print the correct UGI information:
UGI information: proxy_user (auth:PROXY) via hive/hive-metastore#REALM.COM (auth:KERBEROS)
but there are no tokens present in the UGI. Looks like Spark code adds it with the alias hive.server2.delegation.token but I don't see it in the UGI. This makes me suspect that somehow the UGI scope is isolated and not being shared between spark-sql and hive metastore. How do I go about solving this?
Spark is not picking up your Kerberos identity -it asks each FS to issue some "delegation token" which lets the caller interact with that service and that service alone. This is more restricted and so more secure.
The problem here is that spark collects delegation tokens from every filesystem which can issue them -and as your S3 connector isn't issuing any, nothing is coming down.
Now, Apache Hadoop 3.3.0's S3A connector can be set to issue your AWS credentials inside a delegation token, or, for bonus security, ask AWS for session credentials and send only those over. But (a) you need a spark build with those dependencies, and (b) Hive needs to be using those credentials to talk to S3.
I am trying to run a simple Apache spark (Cloudera) read operation using a local object store that is fully s3 sdk/api compatible. But I can not seem to figure out how to get Spark to understand that I am trying to access a local S3 bucket and not remote AWS/S3.
Here's what I've tried...
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com
df = spark.read.parquet("s3a://mybucket/path1/")
Error message...
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to mybucket.s3.amazonaws.com:443 [mybucket.s3.amazonaws.com/12.345.678.90] failed: Connection refused (Connection refused)
I can list the local bucket contents without issue on the command-line so I know that I have the access/secret key correct but I need to make Spark understand not to reach out to aws to try and resolve the bucket url.
Thanks.
Update / Resolution:
The fix to the issue was a missing prerequisite jar at maven coordinates: org.apache.hadoop:hadoop-aws:2.6.0
So the final pyspark call looked like:
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com --jars hadoop-aws-2.6.0.jar
df = spark.read.parquet("s3a://mybucket/path1/")
This is covered in HDP docs, Working with third party object stores.
Settings are the same for CDH.
It comes down
endpoint fs.s3a.endpoint = hostname
disable DNS to bucket map fs.s3a.path.style.access = true
play with signing options.
There are a few other switches you can turn for better compatibility; they're in those docs.
You might find the Cloudstore storediag command useful.
I have a Glue job script that does this (not showing imports and setup here) and it inserts the row into SQL Server RDS just fine:
columns = ['test']
vals = [("test")]
df = sqlContext.createDataFrame(vals, columns)
test = DynamicFrame.fromDF(df, glueContext, "test")
datasink = glueContext.write_dynamic_frame.from_catalog(frame = test,
database = "database-name", table_name = "table-name")
job.commit()
When I run with this same connection but for a larger test load (ends up being about 100 rows) I get this error:
An error occurred while calling o596.pyWriteDynamicFrame. The TCP/IP connection to the host , port 1433 has failed. Error: "Connection timed out: no further information. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall
The thing is that I know there's no firewall or security group issue since one row inserts just fine. I've tried adding a loginTimeout parameter to the JDBC connection like so:
jdbc:sqlserver://<host>:<port>;databaseName=dbName;loginTimeout=600;
As it indicates you can do so here. But the connection fails with Glue when I do that but succeeds when I remove the loginTimeout parameter.
I've also checked the remote timeout configuration on my SQL Server instance and it shows as 600 seconds which is longer than any of my failed jobs so it couldn't be that.
How can I get around this connection timeout error? It seems to be a limitation built into Glue.
In order to do a JDBC connection with Glue you need to follow the steps in this documentation: https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html
We had done that but it turns out that our self-referencing sec group wasn't actually self-referencing. Once we changed that it got resolved
I also had to create the connection as an Amazon RDS connection and not as a JDBC connection even though it's doing the same thing under the hood.
Even after doing all that I still had issues. Turns out that you need to add the sql connection specifically to the job outside of the script. If you hit "Edit Job" you'll see a list of sql connections there. If the connection you're trying to hit isn't on the list of required connections you will always timeout
thanks for the time. I am trying to access a remote Cassandra DB in order to complete my assertions. I see that the Server is running:
Cassandra V 3.0.8.1293
Driver Type: Cassandra CQL
Datastax Java Driver for Apache Cassandra - Core [3.0.5]
So, I am trying with the following simple code to access the DB
import com.datastax.driver.core.*
Cluster cluster = null;
try {
cluster = Cluster.builder().addContactPoint("x.x.x.x").withCredentials("xxxxxxx", "xxxxxx").withPort(9042).build()
Session session = cluster.connect();
ResultSet rs = session.execute("select * from TABLE");
Row row = rs.one();
} finally {
if (cluster != null) cluster.close();
}
when I use the cassandra-driver-core-2.0.1.jar I am getting the error :
ERROR:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /x.x.x.x(null))
Read the documentation and a lot of posts here and on other blogs and I saw that there may be an incompatibility with the driver version so I tried to upgrade the driver to many versions (cassandra-driver-core-2.5,cassandra-driver-core-3,cassandra-driver-core-3.2), but on that I am getting the following:
ERROR:java.lang.ExceptionInInitializerError
Have also tried to connect using JDBC, but to no avail, using the configuration proposed at this thread
SoapUI JDBC connection with Apache Cassandra
Actually I am running out of ideas. Can anyone propose or point to some direction on how to actually achieve this, either by pointing me to some tutorial or any idea.
Thank you very much
I think you haven't enable remote access to cassandra.
Try enabling remote access using below configuration -
File Path /etc/cassandra/default.conf/cassandra.yaml
rpc_address: 0.0.0.0
broadcast_rpc_address: <serverIPAddress>
After that, restart cassandra service.