Amazon EMR pyspark unable to read a JSON file

Amazon EMR pyspark unable to read a JSON file - apache-spark

I am new to EMR and Bigdata,
We have an EMR step and that was working fine till last month, currently I am getting the below error.
--- Logging error ---
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
at
these json files are presents in S3, I downloaded some of the files to reproduce the issue in local,
when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.
also, I checked Application details of EMR for this step.
it says undefined status for status with the below details.
Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)
spark session creation
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
SparkSession\
.builder\
.config(conf=SparkConf())\
.appName("test"))
spark = spark_builder.getOrCreate()
I am not sure, what went wrong suddenly with this step, please help.

Your error indicates a failed security protocol as suggested by various results from googling all pointing to throttling/rejecting incoming TLS connections. Given that this occurs in the context of a backoff strategy.
You can further try these suggestions for retrying with exponential backoff strategy - here and limiting your requests by utilising the AMID.
Additionally you can check you DNS quotas to check if that is not limiting anything or exhausting your quota
Further add your Application Environment to further check if an outdated version might be causing this-
EMR Release version
Spark Versions
AWS SDK Version
AMI [ Amazon Linux Machine Images ] versions
Java & JVM Details
Hadoop Details
Recommended Environment would be to use - AMI 2.x , EMR - 5.3x and the compatible SDKs towards the same [ Preferably AWSS3JavaClient 1.11x ]
More info about EMR releases can be found here
Additionally provide a clear snippet , how are you exactly reading your json files from S3 , are you doing it in an iterative fashion , 1 after the other or in bulk or batches
References used -
https://github.com/aws/aws-sdk-java/issues/2269
javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake during web service communicaiton
https://github.com/aws/aws-sdk-java/issues/1405

From your error message: ...SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake, seems like you've got a security protocol that is not accepted by the host or the error indicates that the connection was closed on the service side before the SDK was able to perform handshake. You should add a try/except block and add some delay between retrys, to handle those
errors = 0
while errors < 5:
try:
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
errors = 0
except:
time.sleep(1)
errors += 1
pass

Related

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

I was building an application on Apache Spark 2.00 with Python 3.4 and trying to load some CSV files from HDFS (Hadoop 2.7) and process some KPI out of those CSV data.
I use to face "Failed to get broadcast_1_piece0 of broadcast_1" error randomly in my application and it stopped.
After searching a lot google and stakeoverflow, I found only how to get rid of it by deleting spark app created files manually from /tmp directory. It happens generally when an application is running for long and it's not responding properly but related files are in /tmp directory.
Though I don't declare any variable for broadcast but may be spark is doing at its own.
In my case, the error occurs when it is trying to load csv from hdfs.
I have taken low level logs for my application and attached herewith for support and suggestions/best practice so that I can resolve the problem.
Sample (details are Attached here):
Traceback (most recent call last): File
"/home/hadoop/development/kpiengine.py", line 258, in
df_ho_raw =
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(HDFS_BASE_URL
+ HDFS_WORK_DIR + filename) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 147, in load File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in call File
"/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.load. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.26.7.192):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1

You should to extends Serializable for your class
Your code Framework error, you can test it
$SPARK_HOME/examples/src/main/scala/org/apache/spark/examples/
If it's ok, you should check your code.

module error in multi-node spark job on google cloud cluster

This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.

Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

This issue is a continuation of my previous question here, which was seemingly resolved but leads to here as another issue.
I am using Spark 1.4.0 on Cloudera QuickstartVM CHD-5.4.0.
When I run my PySpark script as a SparkAction in Oozie, I encounter this error in the Oozie job / container logs:
KeyError: 'SPARK_HOME'
Then I came across this solution and this which are actually for Spark 1.3.0, although I still did try. The documentations seem to say that this issue is already fixed for Spark version 1.3.2 and 1.4.0 (but here I am, encountering the same issue).
The suggested solution in the link was that I need to set spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME to anything, even if it's just any path that does not point to actual SPARK_HOME (i.e., /bogus, although I did set these to actual SPARK_HOME).
Here's my workflow after:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>${workflowRootLocal}/lib/my_pyspark_job.py</jar>
<spark-opts>--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark spark.executorEnv.SPARK_HOME=/usr/lib/spark</spark-opts>
</spark>
Which seems to solve the original problem above. However, it leads to another error when I try to inspect stderr of Oozie container log:
Error: Cannot load main class from JAR file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorEnv.SPARK_HOME=/usr/lib/spark
If I am using Python, it should not expect for a main class right? Please note in my previous related post that the Oozie job example shipped with Cloudera QuickstartVM CDH-5.4.0, which features a SparkAction written in Java was working in my tests. It seems that the issue is only in Python.
Appreciate greatly anyone that can help.

Rather than setting spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME variables, try and add the following lines of code to your python script before setting your SparkConf()
os.environ["SPARK_HOME"] = "/path/to/spark/installed/location"
Found the reference here
This helped me resolve the error you face, but I faced the following error afterwards
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Spark standalone mode : failed on connection exception:

I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.

I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.

Cassandra - NullPointerException on new ColumnFamily creation

I'm running into the exact issue as described here https://issues.apache.org/jira/browse/CASSANDRA-4363 but with Cassandra 1.1.2, cqlsh --cql3.
When I try to create a column family, the error I get is
Traceback (most recent call last):
File "./cqlsh", line 1008, in perform_statement
self.cursor.execute(statement, decoder=decoder)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 117, in execute
response = self.handle_cql_execution_errors(doquery, prepared_q, compress)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cursor.py", line 132, in handle_cql_execution_errors
return executor(*args, **kwargs)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1583, in execute_cql_query
self.send_execute_cql_query(query, compression)
File "./../lib/cql-internal-only-1.0.10.zip/cql-1.0.10/cql/cassandra/Cassandra.py", line 1593, in send_execute_cql_query
self._oprot.trans.flush()
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TTransport.py", line 293, in flush
self.__trans.write(buf)
File "./../lib/thrift-python-internal-only-0.7.0.zip/thrift/transport/TSocket.py", line 117, in write
plus = self.handle.send(buff)
error: [Errno 32] Broken pipe
and sometimes I simply get TSocket read 0 bytes.
The server side log is also the same as mentioned in the JIRA ticket.
ERROR [Thrift:12] 2012-09-05 12:06:10,999 CustomTThreadPoolServer.java (line 204) Error occurred during processing of message.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:373)
at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:188)
at org.apache.cassandra.service.MigrationManager.announceNewColumnFamily(MigrationManager.java:139)
at org.apache.cassandra.cql3.statements.CreateColumnFamilyStatement.announceMigration(CreateColumnFamilyStatement.java:83)
at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:99)
at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:108)
at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:121)
at org.apache.cassandra.thrift.CassandraServer.execute_cql_query(CassandraServer.java:1237)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3542)
at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql_query.getResult(Cassandra.java:3530)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:369)
... 15 more
Caused by: java.lang.NullPointerException
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
I tried deleting /var/lib/cassandra directory and restarting the server. I still get the error.
This bug in JIRA has been marked as Cannot Reproduce, Fixed version None.
So what do I do to get my cassandra working again?

https://issues.apache.org/jira/browse/CASSANDRA-4526 suggests that this was fixed in 1.1.3. I'd upgrade (to 1.1.4, the most recent release).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Amazon EMR pyspark unable to read a JSON file - apache-spark

Related

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

module error in multi-node spark job on google cloud cluster

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

Spark standalone mode : failed on connection exception:

Cassandra - NullPointerException on new ColumnFamily creation

Categories

Resources