How to run spark distributed in cluster mode, but take file locally? - apache-spark

Is it possible to have spark take a local file as input, but process it distributed?
I have sc.textFile(file:///path-to-file-locally) in my code, and I know that the exact path to the file is correct. Yet, I am still getting
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 14, spark-slave11.ydcloud.net): java.io.FileNotFoundException: File file:/<path to file> does not exist
I am running spark distributed, and not locally. Why the error exist?

It is possible but when you declare local path as an input it has to be present on each worker machine and the driver. So it means you have to distribute it first either manually or using built-in tools like SparkFiles.

The files must be located at a centralized location, which is accessible to all the nodes. This can be achieved by using a distributed file system, dse provides a replacement for HDFS called CFS(Cassandra File System). The cfs is available when dse is started in analytic mode using -k option.
For further details of setting up and using cfs, you can have a look at the following link http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/ana/anaCFS.html

Related

Apache Spark SQL can not SELECT Cassandra timestamp columns

I created Docker containers in which I installed Apache Spark 3.1.2 (Hadoop 3.2) that host a ThriftServer which is configured to access Cassandra via the spark-cassandra-connector(3.1.0). Each of these services is running in it's own container. So I got 5 containers up (1x spark master, 2x spark worker, 1x spark thriftserver, 1x cassandra) which are configured to live in the same network via docker-compose.
I use the beeline client from Apache Hive(1.2.1) to query the database. Everything works fine, except for querying a field in Cassandra with the type timestamp.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 0.0 failed 4 times, most recent failure: Lost task 9.3 in stage 0.0 (TID 53) (192.168.80.5 executor 0): java.lang.ClassCastException: java.sql.Timestamp cannot be cast to java.time.Instant
I checked the Spark/spark-cassandra-connector documentations but didn't find much except for a configuration property called spark.sql.datetime.java8API.enabled which
If the configuration property is set to true, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. If it is set to false, java.sql.Timestamp and java.sql.Date are used for the same purpose.
I think this property could maybe help in my case. Despite saying in the docs that the default value is false, the value is always true in my case. I don't set it anywhere and I've tried overwriting it with false in the $SPARK_HOME/conf/spark-defaults.conf file and via the --conf commandline parameters when starting up ThriftServer (and master/worker instances), but the environment tab (at localhost:4040) always shows it as true.
Is there a way to make Spark convert the timestamps in a way that doesn't lead to an exception? It would be important to do that in SQL since I want to connect software for data visualization later on.
I found this JIRA which mentions that there was a bug converting times which is not fixed in 3.1.2 (3.1.3 is not released yet), but in 3.0.3.
I downgraded Apache Spark(3.0.3) and spark-cassandra-connector(3.0.1) which seems to solve the problem for now.

Is AWS EMR suited for an HA spark direct streaming application

I am trying to run a apache spark direct streaming application in AWS EMR.
The application receives and sends data to AWS kinesis and needs to be running the whole time.
If course if a core node is killed, it stops. But it should self-heal when the core node is replaced.
Now I noticed: When I kill one of the core nodes (simulating a problem), it is replaced by AWS EMR. But the application stops working (no output is send to kinesis anymore) and in also does not continue working unless I restart it.
What I get in the logs is:
ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-10-100.eu-central-1.compute.internal: Slave lost
Which is expected. But then I get:
20/11/02 13:15:32 WARN TaskSetManager: Lost task 193.1 in stage 373.0 (TID 37583, ip-10-1-10-225.eu-central-1.compute.internal, executor 2): FetchFailed(null, shuffleId=186, mapIndex=-1, mapId=-1, reduceId=193, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 186
These are just warnings, still the application does not produce any output anymore.
So I stop the application and start it again. Now it produces output again.
My question: Is AWS EMR suited for a self-healing application, like the one I need? Or am I using the wrong tool?
If yes, how do I get my spark application to continue after a core node is replaced?
Its recommended to use On-Demand for CORE instances
And at the same time use TASK instance to leverage SPOT instances.
Have a look
Amazon Emr - What is the need of Task nodes when we have Core nodes?
AWS DOC on Master, Core, and Task Nodes

java.io.FileNotFoundException: Item not found Concurrent read/write on ORC table

When I try concurrent read/write on a table using spark application, I get the following error:
19/10/28 15:26:49 WARN TaskSetManager: Lost task 213.0 in stage 6.0 (TID 407, prod.internal, executor 3): java.io.FileNotFoundException: Item not found: 'gs://bucket/db_name/table_name/p1=xxx/part-1009-54ad3fbb-5eed-43ba-a7da-fb875382897c.c000'. If you enabled STRICT generation consistency, it is possible that the live version is still available but the intended generation is deleted.
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.getFileNotFoundException(GoogleCloudStorageExceptions.java:38)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:631)
I am using Google Cloud Dataproc Version 1.4 and stock hadoop component versions.
I was previously writing and reading from same partition of a PARQUET table but it used to throw a refresh table error. Now I'm using an ORC format table, but the error stays somewhat same. Any solutions for concurrent read/write on hive tables using spark applications?
You can try running;
spark.sql("refresh table your_table")
statement before your read/write operation it can work "occasionally".
First error line indicates that you were file was not found in your bucket, you may want to look into this. Make sure to check the existence of your folders and make sure the files and the requested versions are accessible.
For the “STRICT generation consistency”, this is most probably related to Cloud Storage and produced by the connector, more precisely related to “Strongly consistent operation’.
https://cloud.google.com/storage/docs/consistency
Have you looked into your error logs to see why this error occurs? What type of environment are you running your application on?
This may be more of a Hive issue relating to the concurrency mechanism in which you want to implement.
https://cwiki.apache.org/confluence/display/Hive/Locking
Also, I would advise you to look more into the recommendations and functionalities of using Apache Hive on Cloud Dataproc. You can also consider using a multi-regional bucket if the Hive data needs to be accessed from Hive servers that are located in multiple locations.
https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc

Change ulmit value in Spark

I am running Spark codes in EC2 instance. I am running into the "Too many open files" issue (logs below), and I searched online and seems I need to set ulimit to a higher number. Since I am running the Spark job in AWS, and I don't know where the config file is, how can I pass that value in my Spark code?
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 255 in stage 19.1 failed 4 times, most recent failure: Lost task 255.3 in stage 19.1 (TID 749786, 172.31.20.34, executor 207): java.io.FileNotFoundException: /media/ebs0/yarn/local/usercache/data-platform/appcache/application_1559339304634_2088/blockmgr-90a63e4a-dace-4246-a158-270b0b34c1f9/20/broadcast_13 (Too many open files)
Apart from changing the ulimit you should also look for connection leakages. For eg: Check if your i/o connections are properly closed. We saw Too many open files exception even with 655k ulimit on every node. Later we found the connection leakages in the code.

Unable to open native connection with spark sometimes

I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?
"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155)
"
It sometimes also gives me this exception along with the upper one:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))
I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.
I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.
In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.
I had two theory on the root cause, but the solution was not having the GC go crazy.
First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.
Hope this helps!

Resources