I'm working on some data (~200GB) using spark in azure databricks. I am able to read the dataset (from blob storage) and modify it in various ways. However, every time I try to store it, either through .saveAsTable() or .csv() etc. I get this error. I am also getting it while using .select() (and various other functions) as seen in the stacktrace.
Any suggestions as to what could be wrong here would be highly appreciated!
Note I'm not a spark, azure or python expert so the more explicit the better :)
EDIT
Example of code returning this error:
Example 1
df=spark.read.csv(f"wasbs://{container}#{storage_account_name}.blob.core.windows.net/")
df.write.mode("overwrite").saveAsTable("test")
Example 2
df.write.csv("test")
Example 3
df.select().where(df['installationId'] == '').count()
All of these are giving the same error
Py4JJavaError Traceback (most recent call last)
<command-2838615377381489> in <cell line: 1>()
----> 1 numInstallationId = df_android.select().where((df_android['installationId'] == '') | (df_android['installationId'] == None)).count()
/databricks/spark/python/pyspark/sql/dataframe.py in count(self)
807 2
808 """
--> 809 return int(self._jdf.count())
810
811 def collect(self) -> List[Row]:
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
194 def deco(*a: Any, **kw: Any) -> Any:
195 try:
--> 196 return f(*a, **kw)
197 except Py4JJavaError as e:
198 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o1449.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 234 in stage 18.0 failed 4 times, most recent failure: Lost task 234.3 in stage 18.0 (TID 1145) (10.139.64.18 executor 15): com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART#**REMOVED BY ME**.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:544)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:513)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:637)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:359)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:354)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:870)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1690)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:873)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:728)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - null
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: [0]
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=)
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=0, column=20481, record=0, charIndex=5890569
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:707)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:288)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:446)
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:92)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:454)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:478)
... 33 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3247)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3181)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3175)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3175)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1412)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1412)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1412)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3456)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3397)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3385)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
Caused by: com.databricks.sql.io.FileReadException: Error while reading file wasbs:REDACTED_LOCAL_PART#**REMOVED BY ME**
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:544)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:513)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:637)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:359)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:354)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:870)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1690)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:873)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:728)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - null
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: [0]
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=)
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=0, column=20481, record=0, charIndex=5890569
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:707)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:288)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:446)
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:92)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:454)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:478)
... 33 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
I reproduced same thing in my environment. I got the same error, for overcoming this error. Follow below syntax and reference.
Create a mount location to your storage account with data bricks. Mount creates a separate storage location and also you can perform both read and write operations:
dbutils.fs.mount(
source = "wasbs://<container_name>#<storage_account_name>.blob.core.windows.net/",
mount_point = "/mnt/<Folder_name>",
extra_configs = {"fs.azure.account.key.<storage_account_name>.blob.core.windows.net":"Access key"})
Access key
#reading Data
file_location = "/mnt/demo123"
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("sep", ",").load(file_location)
display(df)
#Writing data
df.write.mode("overwrite").option("header", "true").csv("/mnt/demo123")
Or
Without mount also you perform Read and write operation
For more information refer this link and MS document.
Ok, so it seems like the problem was related to reading the blob as a csv. The content of the blob storage was probably not correct csv format so reading as text solved the issue!
df = spark.read.option("header", "true").text(file_location)
Related
I'm trying to optimize a process for saving a dataframe in cassandra. The code that I wrote work just fine if I run it, but took a lot of time (4 sec for each iteration, that is a lot given the amount of data).
I tried optimize it using persist() and cache() at some point in the script and now when I try to save in cassandra with this code
input.write.format("org.apache.spark.sql.cassandra").mode('append').options(**{
"table": "<table>",
"keyspace": "<keyspace>" ,
"spark.cassandra.connection.host": "<host>",
"spark.cassandra.auth.username": "<username>",
"spark.cassandra.auth.password": "<password>"
}
).save
I receive the following error:
Py4JJavaError: An error occurred while calling o2254.save.
: java.lang.IllegalStateException: Unexpected partitioning: DataSourcePartitioning
and the last error raised is this one:
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
I understand that is something relative to the memory used, but I cannot find any reference of this on the internet, given my poor experience with databricks. The databricks kernel is 1 Driver
14 GB Memory, 4 Cores of Standard_DS3_v2,
EDIT: Apache Spark 3.2.1 and cassandra connector:com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0
Here the full trace of the error:
Py4JJavaError: An error occurred while calling o2254.save.
: java.lang.IllegalStateException: Unexpected partitioning: DataSourcePartitioning
at org.apache.spark.sql.catalyst.plans.physical.Partitioning.createShuffleSpec(partitioning.scala:245)
at org.apache.spark.sql.catalyst.plans.physical.Partitioning.createShuffleSpec$(partitioning.scala:244)
at org.apache.spark.sql.execution.datasources.v2.DataSourcePartitioning.createShuffleSpec(DataSourcePartitioning.scala:27)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$5(EnsureRequirements.scala:129)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$5$adapted(EnsureRequirements.scala:124)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:124)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:471)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:441)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:629)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:167)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:629)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:626)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1226)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1225)
at org.apache.spark.sql.execution.ProjectExec.mapChildren(basicPhysicalOperators.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:626)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:626)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1226)
at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1225)
at org.apache.spark.sql.execution.ProjectExec.mapChildren(basicPhysicalOperators.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:626)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:626)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:689)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:626)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:602)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:441)
at com.databricks.sql.optimizer.EnsureRequirementsDP.apply(EnsureRequirementsDP.scala:718)
at com.databricks.sql.optimizer.EnsureRequirementsDP.apply(EnsureRequirementsDP.scala:550)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$3(AdaptiveSparkPlanExec.scala:1089)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:1089)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:1088)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$initialPlan$1(AdaptiveSparkPlanExec.scala:293)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.<init>(AdaptiveSparkPlanExec.scala:292)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:83)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:48)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.$anonfun$applyInternal$1(InsertAdaptiveSparkPlan.scala:64)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.applyInternal(InsertAdaptiveSparkPlan.scala:64)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:48)
at org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan.apply(InsertAdaptiveSparkPlan.scala:42)
at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$2(QueryExecution.scala:596)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.QueryExecution$.$anonfun$prepareForExecution$1(QueryExecution.scala:596)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at org.apache.spark.sql.execution.QueryExecution$.prepareForExecution(QueryExecution.scala:595)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$2(QueryExecution.scala:232)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:268)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:265)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:265)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:228)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:222)
at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:298)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:361)
at org.apache.spark.sql.execution.QueryExecution.explainStringLocal(QueryExecution.scala:325)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:202)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:386)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:186)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:336)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:160)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:156)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:575)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:167)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:575)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:551)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:156)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:156)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:141)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:186)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:346)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
So, for personal project I tried using this PySpark SMOTE Implementation to handle an imbalanced flight data that looks like this:
CRS_DEP_TIME
DEP_TIME
CRS_ARR_TIME
DISTANCE
YEAR_IDX
DAY_OF_WEEK_IDX
DEST_IDX
ORIGIN_IDX
MONTH_OF_YEAR_IDX
OP_CARRIER_IDX
y
715
709
1014
630
2
1
15
30
0
1
0
2141
2153
2234
630
2
1
30
15
0
1
1
I got the data from this Kaggle dataset and after split, the training data contains around 2.1GB, 34M rows, with probably around 1:7 imbalance. I know it is not that big of an imbalance but I just wanna try the implementation.
The main goal is to make a classification model that can predict whether a flight will arrive late or not, with 1 as late and 0 as not in the label column y.
I've tried running this on Google Colab. Note that both the smote and pre_smote_df_process function is the one from the PySpark SMOTE Implementation.
class SmoteConfig:
def __init__(self, seed, bucketLength, k, multiplier):
self.seed = seed
self.bucketLength = bucketLength
self.k = k
self.multiplier = multiplier
config = SmoteConfig(76, 200, 3, 2)
train_smoted = smote(pre_smote_df_process(train,['CRS_DEP_TIME','DEP_TIME','CRS_ARR_TIME','DISTANCE'],['YEAR_IDX','DAY_OF_WEEK_IDX','DEST_IDX','ORIGIN_IDX','MONTH_OF_YEAR_IDX','OP_CARRIER_IDX'],'y',index_suffix="_index"), config)
train_smoted.show()
When I tried using a fraction of the data (up to 1%) it works witin reasonable time. However, when I ran it on 10% data it took me more than 14 hours to show the result, and using 100% data this following error came up.
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-6919ba1953f2> in <module>()
----> 1 disassembled.coalesce(1).write.format('com.databricks.spark.csv').save("/content/drive/MyDrive/Tugas/Tugas Akhir/Eksperimen/lsh_smoted_18_05_21.csv",header = 'true')
3 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o12989.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 745.0 failed 1 times, most recent failure: Lost task 28.0 in stage 745.0 (TID 34904) (55fc2b589d8c executor driver): java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:263)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:193)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.close(UnsafeRowSerializer.scala:96)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:175)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:163)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:243)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:143)
at java.base/java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.flush(UnsafeRowSerializer.scala:91)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$1(DiskBlockObjectWriter.scala:156)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.storage.DiskBlockObjectWriter.close(DiskBlockObjectWriter.scala:158)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:164)
... 10 more
Suppressed: java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:182)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.org$apache$spark$storage$DiskBlockObjectWriter$ManualCloseOutputStream$$super$close(DiskBlockObjectWriter.scala:108)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose(DiskBlockObjectWriter.scala:65)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose$(DiskBlockObjectWriter.scala:64)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.manualClose(DiskBlockObjectWriter.scala:108)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$closeResources$1(DiskBlockObjectWriter.scala:135)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.storage.DiskBlockObjectWriter.closeResources(DiskBlockObjectWriter.scala:136)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$2(DiskBlockObjectWriter.scala:158)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448)
... 12 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContex
On the resource monitor, the disk constantly creeps up until full, before the error mentioned occurs.
Running it on local also shows the same error. Is there anything I'm doing wrong? Is the code unoptimized? Or there might possibly be something wrong with the data? Any insight will be much appreciated. Thanks.
since you say it works fine when you use fraction of data, and considering the error that i see here, it seems like memory problem. I didn't use this code. i used SMOTE() in imblearn library and that works fine with my PySpark project(maybe consider using that too). but for this kind of error, increasing memory worked for me.
you can increase memory in this way:
I used Google Colab like you and when i wanted to create a SparkSession instance i changed it to :
spark = SparkSession.builder.master("local")\
.appName("your_project_Name")\
.config("spark.ui.port",'4050')\
.config("spark.driver.memory","32G")\
.config("spark.executer.memory","32G")\
.getOrCreate()
it worked for me. hope it helps you and others too.
but generally,if data is still so big and even if you don't want to just use Undersampling, you can use combination of Undersampling with other methods(basic Oversampling or others). in this way you can reduce some memory problems.
I am trying to load a spark dataframe which has two attributes with collection datatypes into a Cassandra table.
In the incoming feed file, these attributes are text/String. I used the below code to convert the String type to List and Map types respectively:
spark.udf.register("getLst", (input: String) => input.split(",").toList)
spark.udf.register("getMap", (input:String) => parse(input).values.asInstanceOf[Map[String, String]])
val ofr_data_final=spark.sql("""select
...
getLst(acct_nb_ls) as acct_nb_ls,
getMap(brw_eci_and_sts_mp) as brw_eci_and_sts_mp,
.....""")
The print schema of the spark dataframe shows those two attributes as shown below:
|-- acct_nb_ls: array (nullable = true)
| |-- element: string (containsNull = true)
|-- brw_eci_and_sts_mp: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
In Cassandra, those two attributes are defined as shown below:
acct_nb_ls FROZEN<LIST<text>>,
brw_eci_and_sts_mp FROZEN<MAP<text, text>>,
Here is my load statement:
ofr_data_final.rdd.saveToCassandra(Config.keySpace,offerTable, writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))
However the load fails with the below error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 140 in stage 24.0 failed 4 times, most recent failure: Lost task 140.3 in stage 24.0 (TID 1741, bdtcstr70n12.svr.us.jpmchase.net, executor 9): java.io.IOException: Failed to write statements to mars_offerdetails.offer_detail_2.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:167)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:37)
at com.jpmc.mars.LoadOfferData$.delayedEndpoint$com$jpmc$mars$LoadOfferData$1(LoadOfferData.scala:246)
at com.jpmc.mars.LoadOfferData$delayedInit$body.apply(LoadOfferData.scala:22)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.jpmc.mars.LoadOfferData$.main(LoadOfferData.scala:22)
at com.jpmc.mars.LoadOfferData.main(LoadOfferData.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to write statements to mars_offerdetails.offer_detail_2.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:167)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I suspect the issue might be because the attribute acct_nb_lst is inferred as 'array' and not as 'list' but I am not sure how to make spark infer it as 'list' instead of 'array'. In my UDF, I had defined mentioned
input.split(",").toList
but still it's getting inferred as array.
Loading collection data types using spark-cassandra connector in batch mode worked as expected with ttl option on record level using rdd.saveToCassandra. The issue was with the data. The data was old and had past expired dates which generated negative ttl values and hence the load failed.
Spark error message should be enhanced to imply that.
I cannot properly connect to Kudu from Spark, error says "Kudu master has no leader"
CDH 5.14
Kudu 1.6
Spark 1.6.0 standalone and 2.2.0
When I use Impala in HUE to create and query kudu tables, it works flawlessly.
However, connecting from Spark throws some errors I cannot decipher.
I have tried using both pyspark and spark-shell. With spark shell I had to use spark 1.6 instead of 2.2 because some maven dependencies problems, that I have localized but not been able to fix. More info here.
Case 1: using pyspark2 (Spark 2.2.0)
$ pyspark2 --master yarn --jars /opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/kudu/kudu-spark2_2.11.jar
> df = sqlContext.read.format('org.apache.kudu.spark.kudu').options(**{"kudu.master":"172.17.0.43:7077", "kudu.table":"impala::default.test"}).load()
18/03/02 10:23:27 WARN client.ConnectToCluster: Error receiving response from 172.17.0.43:7077
org.apache.kudu.client.RecoverableException: [peer master-172.17.0.43:7077] encountered a read timeout; closing the channel
at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:412)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.apache.kudu.client.Connection.handleUpstream(Connection.java:239)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.exceptionCaught(SimpleChannelUpstreamHandler.java:153)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:536)
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:236)
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask$1.run(ReadTimeoutHandler.java:276)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutException
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.<clinit>(ReadTimeoutHandler.java:84)
at org.apache.kudu.client.Connection$ConnectionPipeline.init(Connection.java:782)
at org.apache.kudu.client.Connection.<init>(Connection.java:199)
at org.apache.kudu.client.ConnectionCache.getConnection(ConnectionCache.java:133)
at org.apache.kudu.client.AsyncKuduClient.newRpcProxy(AsyncKuduClient.java:248)
at org.apache.kudu.client.AsyncKuduClient.newMasterRpcProxy(AsyncKuduClient.java:272)
at org.apache.kudu.client.ConnectToCluster.run(ConnectToCluster.java:157)
at org.apache.kudu.client.AsyncKuduClient.getMasterTableLocationsPB(AsyncKuduClient.java:1350)
at org.apache.kudu.client.AsyncKuduClient.exportAuthenticationCredentials(AsyncKuduClient.java:651)
at org.apache.kudu.client.KuduClient.exportAuthenticationCredentials(KuduClient.java:293)
at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:97)
at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:96)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:96)
at org.apache.kudu.spark.kudu.KuduRelation.<init>(DefaultSource.scala:162)
at org.apache.kudu.spark.kudu.DefaultSource.createRelation(DefaultSource.scala:75)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
... 1 more
18/03/02 10:23:27 WARN client.ConnectToCluster: Unable to find the leader master 172.17.0.43:7077; will retry
Py4JJavaError Traceback (most recent call last)
<ipython-input-1-e1dfaec7a544> in <module>()
----> 1 df = sqlContext.read.format('org.apache.kudu.spark.kudu').options(**{"kudu.master":"172.17.0.43:7077", "kudu.table":"impala::default.logika_dataset_kudu"}).load()
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
163 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
164 else:
--> 165 return self._df(self._jreader.load())
166
167 #since(1.4)
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o59.load.
: java.security.PrivilegedActionException: org.apache.kudu.client.NoLeaderFoundException: Master config (172.17.0.43:7077) has no leader. Exceptions received: org.apache.kudu.client.RecoverableException: [peer master-172.17.0.43:7077] encountered a read timeout; closing the channel
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:96)
at org.apache.kudu.spark.kudu.KuduRelation.<init>(DefaultSource.scala:162)
at org.apache.kudu.spark.kudu.DefaultSource.createRelation(DefaultSource.scala:75)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kudu.client.NoLeaderFoundException: Master config (172.17.0.43:7077) has no leader. Exceptions received: org.apache.kudu.client.RecoverableException: [peer master-172.17.0.43:7077] encountered a read timeout; closing the channel
at org.apache.kudu.client.ConnectToCluster.incrementCountAndCheckExhausted(ConnectToCluster.java:272)
at org.apache.kudu.client.ConnectToCluster.access$100(ConnectToCluster.java:49)
at org.apache.kudu.client.ConnectToCluster$ConnectToMasterErrCB.call(ConnectToCluster.java:349)
at org.apache.kudu.client.ConnectToCluster$ConnectToMasterErrCB.call(ConnectToCluster.java:338)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1280)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1259)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1315)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1286)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1259)
at com.stumbleupon.async.Deferred.callback(Deferred.java:1002)
at org.apache.kudu.client.KuduRpc.handleCallback(KuduRpc.java:238)
at org.apache.kudu.client.KuduRpc.errback(KuduRpc.java:292)
at org.apache.kudu.client.RpcProxy.failOrRetryRpc(RpcProxy.java:388)
at org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:217)
at org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:60)
at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:132)
at org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:128)
at org.apache.kudu.client.Connection.cleanup(Connection.java:694)
at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:439)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.apache.kudu.client.Connection.handleUpstream(Connection.java:239)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.exceptionCaught(SimpleChannelUpstreamHandler.java:153)
at org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:536)
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:236)
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask$1.run(ReadTimeoutHandler.java:276)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: org.apache.kudu.client.RecoverableException: [peer master-172.17.0.43:7077] encountered a read timeout; closing the channel
at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:412)
... 21 more
Caused by: org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutException
at org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.<clinit>(ReadTimeoutHandler.java:84)
at org.apache.kudu.client.Connection$ConnectionPipeline.init(Connection.java:782)
at org.apache.kudu.client.Connection.<init>(Connection.java:199)
at org.apache.kudu.client.ConnectionCache.getConnection(ConnectionCache.java:133)
at org.apache.kudu.client.AsyncKuduClient.newRpcProxy(AsyncKuduClient.java:248)
at org.apache.kudu.client.AsyncKuduClient.newMasterRpcProxy(AsyncKuduClient.java:272)
at org.apache.kudu.client.ConnectToCluster.run(ConnectToCluster.java:157)
at org.apache.kudu.client.AsyncKuduClient.getMasterTableLocationsPB(AsyncKuduClient.java:1350)
at org.apache.kudu.client.AsyncKuduClient.exportAuthenticationCredentials(AsyncKuduClient.java:651)
at org.apache.kudu.client.KuduClient.exportAuthenticationCredentials(KuduClient.java:293)
at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:97)
at org.apache.kudu.spark.kudu.KuduContext$$anon$1.run(KuduContext.scala:96)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:96)
at org.apache.kudu.spark.kudu.KuduRelation.<init>(DefaultSource.scala:162)
at org.apache.kudu.spark.kudu.DefaultSource.createRelation(DefaultSource.scala:75)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
... 1 more
Case 2: using spark-shell (Spark 1.6.0 standalone):
$ spark-shell --master spark://localhost:7077 --packages org.apache.kudu:kudu-spark_2.10:1.1.0
> import org.apache.kudu.spark.kudu._
> import org.apache.kudu.client._
> import collection.JavaConverters._
> val df = sqlContext.read.options(Map("kudu.master" -> "localhost:7051","kudu.table" -> "impala::default.test")).kudu
df: org.apache.spark.sql.DataFrame = [dataset: string, id: string, itemnumber: string, srcid: string, timestamp: string, year: string, month: string, day: string, week: string, quarter: string, season: string, city: string, region1: string, region2: string, region3: string, region4: string, locality: string, itemname: string, itembqu: string, product_category: string, amount: string, mapped_zipcode: string, latitude: string, longitude: string, depositor_code: string, depositor_name: string, customer_code: string, is_island: string]
It seems to be connecting, as it is able to show the column names, but if I
// register a temporary table and use SQL
df.registerTempTable("test")
val filteredDF = sqlContext.sql("select count(*) from test").show()
bang!
[Stage 0:> (0 + 6) / 6]
Lost task 1.0 in stage 0.0 (TID 1, tt-slave-2.novalocal, executor 1): org.apache.kudu.client.NonRecoverableException: RPC can not complete before timeout: KuduRpc(method=GetTableSchema, tablet=null, attempt=30, DeadlineTracker(timeout=30000, elapsed=27307), Traces:
[0ms] querying master,
[48ms] Sub rpc: GetMasterRegistration sending RPC to server Kudu Master - localhost:7051,
[71ms] Sub rpc: GetMasterRegistration received from server Kudu Master - localhost:7051 response
Network error:
[Peer Kudu Master - localhost:7051] Connection reset,
[75ms] delaying RPC due to Service unavailable: Master config (localhost:7051) has no leader.
Exceptions received: org.apache.kudu.client.RecoverableException:
[Peer Kudu Master - localhost:7051] Connection reset,
...
(SAME MESSAGE REPEATS 25 TIMES)
...
[24262ms] querying master,
[24262ms] Sub rpc: GetMasterRegistration sending RPC to server Kudu Master - localhost:7051,
[24263ms] Sub rpc: GetMasterRegistration received from server Kudu Master - localhost:7051 response
Network error:
[Peer Kudu Master - localhost:7051] Connection reset,
[24263ms] delaying RPC due to Service unavailable: Master config (localhost:7051) has no leader.
Exceptions received: org.apache.kudu.client.RecoverableException:
[Peer Kudu Master - localhost:7051] Connection reset,
[24661ms] trace too long, truncated)
at org.apache.kudu.client.AsyncKuduClient.tooManyAttemptsOrTimeout(AsyncKuduClient.java:961)
at org.apache.kudu.client.AsyncKuduClient.delayedSendRpcToTablet(AsyncKuduClient.java:1203)
at org.apache.kudu.client.AsyncKuduClient.access$800(AsyncKuduClient.java:110)
at org.apache.kudu.client.AsyncKuduClient$RetryRpcErrback.call(AsyncKuduClient.java:764)
at org.apache.kudu.client.AsyncKuduClient$RetryRpcErrback.call(AsyncKuduClient.java:754)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.callback(Deferred.java:1005)
at org.apache.kudu.client.GetMasterRegistrationReceived.incrementCountAndCheckExhausted(GetMasterRegistrationReceived.java:156)
at org.apache.kudu.client.GetMasterRegistrationReceived.access$300(GetMasterRegistrationReceived.java:45)
at org.apache.kudu.client.GetMasterRegistrationReceived$GetMasterRegistrationErrCB.call(GetMasterRegistrationReceived.java:236)
at org.apache.kudu.client.GetMasterRegistrationReceived$GetMasterRegistrationErrCB.call(GetMasterRegistrationReceived.java:225)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.callback(Deferred.java:1005)
at org.apache.kudu.client.KuduRpc.handleCallback(KuduRpc.java:220)
at org.apache.kudu.client.KuduRpc.errback(KuduRpc.java:274)
at org.apache.kudu.client.TabletClient.failOrRetryRpc(TabletClient.java:770)
at org.apache.kudu.client.TabletClient.failOrRetryRpcs(TabletClient.java:747)
at org.apache.kudu.client.TabletClient.cleanup(TabletClient.java:736)
at org.apache.kudu.client.TabletClient.channelClosed(TabletClient.java:698)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
at org.apache.kudu.client.TabletClient.handleUpstream(TabletClient.java:679)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.apache.kudu.client.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.channelClosed(ReadTimeoutHandler.java:176)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels$6.run(Channels.java:457)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.apache.kudu.client.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.apache.kudu.client.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kudu.client.NoLeaderFoundException: Master config (localhost:7051) has no leader. Exceptions received: org.apache.kudu.client.RecoverableException: [Peer Kudu Master - localhost:7051] Connection reset
at org.apache.kudu.client.GetMasterRegistrationReceived.incrementCountAndCheckExhausted(GetMasterRegistrationReceived.java:154)
... 32 more
Caused by: org.apache.kudu.client.RecoverableException: [Peer Kudu Master - localhost:7051] Connection reset
at org.apache.kudu.client.TabletClient.cleanup(TabletClient.java:734)
... 21 more
As I said, Kudu service is up an running, and I am able to query kudu tables from Hue using Impala.
What am I missing here? Is this the right approach to interfacing Spark with Kudu?
Thanks
We're evaluating AWS Glue for a big data project, with some ETL. We added a crawler, which is correctly picking up a CSV file from S3. Initially, we simply want to transform that CSV to JSON, and drop the file in another S3 location (same bucket, different path).
We used the script as provided by AWS (no custom script here). And just mapped all the columns.
The target folder is empty (job has been just created), but the job fails with "File already exists":
snapshot here.
The S3 location were we pretend to drop the output was empty before starting the job. However after the error we do see two files, but those seems to be partials:
snapshot
Any ideas on what might be going on?
Here's the fully stack:
Container: container_1513099821372_0007_01_000001 on ip-172-31-49-38.ec2.internal_8041
LogType:stdout
Log Upload Time:Tue Dec 12 19:12:04 +0000 2017
LogLength:8462
Log Contents:
Traceback (most recent call last):
File "script_2017-12-12-19-11-08.py", line 30, in
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options =
{
"path": "s3://primero-viz/output/tcw_entries"
}
, format = "json", transformation_ctx = "datasink2")
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 523, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/PyGlue.zip/awsglue/context.py", line 175, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/PyGlue.zip/awsglue/context.py", line 198, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1513099821372_0007/container_1513099821372_0007_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-63-141.ec2.internal, executor 1): java.io.IOException: File already exists:s3://primero-viz/output/tcw_entries/run-1513105898742-part-r-00000
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:604)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:915)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:896)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:793)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:176)
at com.amazonaws.services.glue.hadoop.TapeOutputFormat.getRecordWriter(TapeOutputFormat.scala:65)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1158)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
at com.amazonaws.services.glue.HadoopDataSink$$anonfun$2.apply$mcV$sp(DataSink.scala:192)
at com.amazonaws.services.glue.HadoopDataSink.writeDynamicFrame(DataSink.scala:202)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: File already exists:s3://primero-viz/output/tcw_entries/run-1513105898742-part-r-00000
For me, a similar error message turned out to be unrelated to the file already existing. The error message can be a little misleading. There was a problem in a previous stage (in my case, I was reading data from a MySQL database in a previous stage and the source DB contained an invalid date, which resulted in partial data being written and the task crashing).
I would suggest checking the other stages leading up to this write.
See also this other StackOverflow answer.
Setup the write mode to "append" whether your load is incremental or "overwrite" if it's full load.
One example could be:
events.toDF().write.json(events_dir, mode="append", partitionBy=["partition_0", "partition_1"])
The target folder is empty
Empty is not the same as not exist. It doesn't look like write_dynamic_frame supports write modes so might have to drop the directory first.