IllegalStateException on Google Cloud Dataproc - apache-spark

I am trying to write a simple vanilla collaborative filtering application, running on Google Cloud Dataproc.
The Data is located in BigQuery.
I have implemented this according to this tutorial: https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
Now the problem is that when running this (slightly modified) example I get an IllegalStateException. More specifically here is the stacktrace:
17/09/25 10:55:37 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/tmp/af84ad68-0259-4ca1-b464-a118a96f0742/marketing-pages-collaborative-filtering.py", line 109, in <module>
compute_recommendations()
File "/tmp/af84ad68-0259-4ca1-b464-a118a96f0742/marketing-pages-collaborative-filtering.py", line 59, in compute_recommendations
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 646, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, marketing-pages-collaborative-filtering-w-1.c.dg-dev-personalization.internal): java.lang.IllegalStateException: Found known file 'data-000000000002.json' with index 2, which isn't less than or equal to than endFileNumber 1!
at com.google.cloud.hadoop.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:197)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1324)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1324)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1324)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1298)
at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:582)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.json' with index 2, which isn't less than or equal to than endFileNumber 1!
at com.google.cloud.hadoop.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:197)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1324)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1324)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
17/09/25 10:55:37 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#1dfdb336{HTTP/1.1}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [af84ad68-0259-4ca1-b464-a118a96f0742] entered state [ERROR] while waiting for [DONE].
I think I have identified the problem, but I cannot find the cause of the problem. The relevant code snippet is this:
table_rdd = spark.sparkContext.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable",
"com.google.gson.JsonObject",
conf=conf)
table_json = table_rdd.map(lambda x: x[1])
visit_data = sparkSession.read.json(table_json)
First I create the RDD according to the tutorial from Google. The next step is to extract the JSON elements from the RDD, and this then is read into a table, that we can query.
The stacktrace shows that the exception happens when assigning conf, but the code works until I call sparkSession.read.json(table_json) because as I understood it spark works lazily and only then tries to access the actual JSON files that were exported from BigQuery.
Now the problem is that Spark finds more JSON files than there should be.
According to this comment in the code of the BigQuery Hadoop Library, even if everything fits into one shard, the minimum is two, such that BigQuery recognizes the export as such. Also it says there that it generates a so called end-marker file, which as far as I can say, is just an empty JSON file.
But when running the code the export that is generated by BigQuery has more than the 2 necessary files (1 containing data and 1 as the end-marker). It generates up to 5 JSON files, that sometimes contain just 1 or 2 rows from BigQuery.
I am pretty sure that this is the problem, that the export somehow is wrong. But I cannot find out why this happens and how to fix it. Any help is appreciated.
UPDATE:
I tried something else. I deleted the table in BigQuery and populated it again from scratch. This solved the problem with the export. There are only two files now. But I think the problem still persists. I will try adding some rows via Cloud Functions (which would happen in my application) and then update on the behaviour.
UPDATE 2:
So after waiting a day and adding some rows via streaming inserts using a Cloud Function, the issue happens again. Somehow the exports are partitioned by day. That would not be a problem if each day gets its own shard, but this does not happen unfortunately.

This is a bug in BigQuery (that it returns the output file count statistics that does not include the zero-record file). The fix for this issue has been submitted, and its rollout will complete in about a week.
In the meantime, a workaround of the issue is maybe set the flag "mapred.bq.input.sharded.export.enable" (a.k.a. ENABLE_SHARDED_EXPORT_KEY) to false in your hadoop config when configuring your DataProc job.
UPDATE:
As of today Oct 6 2017, the fix is now 100% rolled out on BigQuery.

Related

Azure Databricks sentiment analysis doesn't work

I'm using Azure Databricks and I want to do a text sentiment analysis with the following code:
from pyspark.sql.functions import col
import synapse.ml
from synapse.ml.cognitive import *
# Create a dataframe that's tied to it's column names
df_sentences = spark.createDataFrame([
("I am so happy today, its sunny!", "en-US"),
("this is a dog", "en-US"),s
("I am frustrated by this rush hour traffic!", "en-US")
], ["text", "language"])
# Run the Text Analytics service with options
sentiment = (TextSentiment()
.setTextCol("text")
.setLocation("eastasia") # Set the location of your cognitive service
.setSubscriptionKey(cognitive_service_key)
.setOutputCol("sentiment")
.setErrorCol("error")
.setLanguageCol("language"))
# Show the results of your text query in a table format
display(sentiment.transform(df_sentences).select("text", col("sentiment")[0].getItem("sentiment").alias("sentiment")))
This doesn't work.. Here is the full error with the details:
java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.azure.synapse.ml.param.ServiceParamJsonProtocol$
Py4JJavaError: An error occurred while calling None.com.microsoft.azure.synapse.ml.cognitive.TextSentiment.
: java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.azure.synapse.ml.param.ServiceParamJsonProtocol$
at com.microsoft.azure.synapse.ml.param.ServiceParam.<init>(JsonEncodableParam.scala:62)
at com.microsoft.azure.synapse.ml.cognitive.HasSubscriptionKey.$init$(CognitiveServiceBase.scala:130)
at com.microsoft.azure.synapse.ml.cognitive.CognitiveServicesBaseNoHandler.<init>(CognitiveServiceBase.scala:306)
at com.microsoft.azure.synapse.ml.cognitive.TextAnalyticsBase.<init>(TextAnalytics.scala:53)
at com.microsoft.azure.synapse.ml.cognitive.TextSentiment.<init>(TextAnalytics.scala:288)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
I'm running 10.4 (includes Apache Spark 3.2.1, Scala 2.12) cluster and have synapseml_2.12:0.10.0 installed. Does someone know what goes wrong?
Generally this error is caused by incompatible versions of libraries with spark and Scala.
To get rid of this error you need to Uninstall synapseml_2.12:0.10.0 and install synapseml_2.12:0.10.1
Maven coordinates - com.microsoft.azure:synapseml_2.12:0.10.1

Spark magic output committer settings not recognized

I'm trying to play around with different Spark output committer settings for s3, and wanted to try out the magic committer. So far I didn't manage to get my jobs to use the magic committer, and they always seem to fall back on the file output committer.
The Spark job I'm running is a simple PySpark test job that runs a simple query, repartitions the data and outputs parquet to s3:
df = spark.sql("select * from some_table where some_condition")
df.write \
.partitionBy("some_column") \
.parquet("s3://some-bucket/some-folder", mode="overwrite")
The relevant spark settings are (taken from the Spark UI, job's environment tab):
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.magic.enabled true
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.staging.tmp.path tmp/staging
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
mapreduce.output.fileoutputformat.compress false
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type RECORD
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
mapreduce.fileoutputcommitter.algorithm.version 1
mapreduce.fileoutputcommitter.task.cleanup.enabled false
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
Hadoop properties:
fs.s3a.committer.magic.enabled true
fs.s3a.committer.name magic
(Let me know if any other settings are relevant)
I'm basing the observation of file committer being used instead of magic committer on a couple of things:
Different log lines produced by the spark job seem to indicate the file output committer being used:
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:601","func":"commitTask","message":"Saved output of task 'attempt_2021...' to s3://some-bucket/some-folder/_temporary/0/
task_2021..."
"class":"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat","file_line":"ParquetFileFormat.scala:54","message":"U
sing user defined output committer for Parquet: org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:141","func":"<init>","message":"File Outpu
t Committer Algorithm version is 1"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:156","func":"<init>","message":"FileOutput
Committer skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false"
When setting the file committer's algo to an invalid number, like so:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version -7
an exception is raised from the file committer's constructor saying the value is invalid - implicating that the file committer was initialized instead of the magic committer.
I'm not seeing any logs indicating usage of the magic committer, or any failure to initialize a committer which could explain falling back on the file committer.
Spark version is 3.1.2 using this spark-hadoop-cloud JAR. Let me know if there's any other officially published JAR I can try or if there are any other log indications that may be relevant.
Any thoughts?
===== EDIT:
Below is the stack trace I see when setting the file committer algo to an invalid value. It seems that the call to org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter ends up calling org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter which in turn initializes the incorrect type org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter instead of the configured type org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
Py4JJavaError: An error occurred while calling o259.parquet.
: java.io.IOException: Only 1 or 2 algorithm version is supported
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:143)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:117)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createFileOutputCommitter(PathOutputCommitterFactory.java:134)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter(FileOutputCommitterFactory.java:35)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createCommitter(PathOutputCommitterFactory.java:201)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:88)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:49)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:177)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:874)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Mystery solved - the failure to initialize the magic committer was due to a mismatch between the committer factory scheme setting to the scheme of the actual destination URL. Consider this:
The committer factory configuration was set using the key: spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a - meaning that the setting is made for s3a protocol URLs.
While th URL sent to the write method was: s3://some-bucket/some-folder - using s3 protocol instead of s3a.
The PathOutputCommitterFactory hadoop class searches for a config key with pattern mapreduce.outputcommitter.factory.scheme.%s to recognize which factory to use for the given output URL. In case the pattern set in the config key (in this case s3a) does not match the pattern in the destination URL (in this case s3) - the committer factory setting will not be recognized and the factory type will fall back on FileOutputCommitter.
Solution - make sure the outputcommitter.factory.scheme.<protocol> setting matches the protocol in the destination URL. I've successfully tested using both s3 and s3a in the URL & config key.
this does sound like a binding problem but I cannot see immediately where it is. At a glance you have all the right settings.
The easiest way to check that an S3 a committee is being used is to look at the _SUCCESS file . If it is a piece of JSON then a new committer was used… The text inside will then tell you more about the committer.
a 0 byte file means that the classic file output committer was still used

Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.
We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.
SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 24 more
Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration? Changing the code is not an option we entertain as it would involve quite a lot of testing.
The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.
EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.
I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths. Works fine, but would be nice if this was a config option.
def massage_paths(paths):
if isinstance(paths, basestring):
return 's3a' + x[2:] if x.startswith('s3:') else x
if isinstance(paths, list):
t = list
else:
t = tuple
return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])
cricket007 is correct.
spark.hadoop.fs.s3.impl org.apache.fs.s3a.S3AFileSystem
There's some code in org.apache.hadoop.FileSystem which looks up from a schema "s3" to an implementation class, loads it and instantiates it with the full URL.
Warning There's no specific code in the core S3A FS which looks for an FS schema being s3a, but you will encounter problems if you use the DynamoDB consistency layer "S3Guard" -that's probably a bit of overkill someone could fix
Ideally, you could refactor the code to detect the runtime environment, or externalize the paths to a config file that could be used in the respective areas.
Otherwise, you would need to edit the hdfs-site.xml to configure the fs.s3a.impl key to rename s3a to s3, and you might be able to keep the value the same. That change would need done for all Spark workers
You probably won't be able to configure Spark to help you "translate".
Instead, this is more like a design issue. The code should be made configurable to choose different protocol for different environment(that was what I did for a similar situation). If you insist on working locally, some code refactoring may not be avoidable...

spark-excel dataype issues

I am using spark-excel package for processing ms excel files using spark 2.2. Some of the files are getting failed to load as a spark dataframe with below exception. If someone have faced this issue can you please help to fix such data type issues?
After analyzing I found at that if column name is not a string, it ends up giving below exception, If I manually change the column name to string from integer, it works fine.
Code:
val excelDF = spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
option("sheetName", sheetName).
load(filePath)
Exception:
java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC cell
at org.apache.poi.xssf.usermodel.XSSFCell.typeMismatch(XSSFCell.java:1077)
at org.apache.poi.xssf.usermodel.XSSFCell.getRichStringCellValue(XSSFCell.java:395)
at org.apache.poi.xssf.usermodel.XSSFCell.getStringCellValue(XSSFCell.java:347)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1$$anonfun$10.apply(ExcelRelation.scala:206)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1$$anonfun$10.apply(ExcelRelation.scala:205)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:205)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:204)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:37)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
The new version of com.crealytics:spark-excel_2.11:0.12.5 library works with non string column/header names as well.
There is probably a more elegant answer for this I would post this as comment, but do not have the required reputation.
I always try to ensure that my column headers are strings.
Also as a rule I do not have numeric characters in the column headers and we have a simple script that replaces the numeric with alphabetical characters (i.e. 1 by one).

Can't read files via FTP using SparkContext.textFile(...) on Google Dataproc

I am running a Spark cluster on Google Dataproc and I'm experiencing some issues while trying to read GZipped file from FTP using sparkContext.textFile(...).
The code I am running is:
object SparkFtpTest extends App {
val file = "ftp://username:password#host:21/filename.txt.gz"
val lines = sc.textFile(file)
lines.saveAsTextFile("gs://my-bucket-storage/tmp123")
}
The error that I get is:
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
I see some people have suggested that the credentials are wrong, so I've tried entering wrong credentials and the error was different, i.e. Invalid login credentials.
It also works if I copy the URL into the browser - the file is being downloaded properly.
It's also worth mentioning that I've tried using Apache commons-net library (the same version as the one in Spark - 2.2) and it worked - I was able to stream the data (from both Master and Worker nodes). I wasn't able to decompress it though (by using Java's GZipInputStream; I can't remember the failure but if you think it's important I can try and reproduce it). I think this suggests that it's not some firewall issue on the cluster, though I wasn't able to use curl to download the file.
I think I was running the same code a few months ago from my local machine and if I remember correctly it worked just fine.
Do you have any ideas what is causing this problem?
Could it be that it's some kind of dependency conflict problem and if so which one?
I have a couple of dependencies in the project such as google-sdk, solrj, ... However, I'd expect to see something like ClassNotFoundException or NoSuchMethodError if it was a dependency problem.
The whole stack trace looks like this:
16/12/05 23:53:46 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
16/12/05 23:53:49 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/0/
16/12/05 23:53:50 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/0/ - removing from cache
16/12/05 23:53:50 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:51 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1219)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1064)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:955)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1459)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1438)
It looks like this may be a known unresolved issue in Spark/Hadoop: https://issues.apache.org/jira/browse/HADOOP-11886 and https://github.com/databricks/learning-spark/issues/21 both allude to a similar stack trace.
If you were able to manually use the Apache commons-net library, you could achieve the same effect as sc.textFile by obtaining a list of the files, parallelizing that list of files as an RDD, and using flatMap where each task takes a filename and reads the file line-by-line, generating the output collection of lines for each file.
Alternatively, if the amount of data you have in FTP is small (up to maybe 10 GB or so) then parallel reads won't be helping too much compared to a single thread copying from your FTP server onto HDFS or GCS in your Dataproc cluster before then processing using an HDFS or GCS path in your Spark job.

Resources