how to monitor multi directories in spark streaming task - apache-spark

I want to use use fileStream in spark streaming to monitor multi hdfs directories, such as:
val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/root/*/*", check_valid_file(_), false).map(_._2.toString).print
Buy the way, i could not under the meaning of the three class : LongWritable, Text, TextInputFormat
but it doesn't work...
java.io.FileNotFoundException: File /user/root/*/*
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:697)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:243)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:241)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:241)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:177)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:84)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

From the Spark documentation:
def fileStream[K, V, F <: InputFormat[K, V]](directory: String, filter: (Path) ⇒ Boolean, newFilesOnly: Boolean, conf: Configuration)(implicit arg0: ClassTag[K], arg1: ClassTag[V], arg2: ClassTag[F]): InputDStream[(K, V)]
Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them using the given key-value types and input format. Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.
K Key type for reading HDFS file
V Value type for reading HDFS file
F Input format for reading HDFS file
directory HDFS directory to monitor for new file
filter Function to filter paths to process
newFilesOnly Should process only new files and ignore existing files in the directory
conf Hadoop configuration
filter parameter takes a method which identifies the input files for processing.
new Function<Path, Boolean>() {
#Override
public Boolean call(Path v1) throws Exception {
return Boolean.TRUE;
}
}
TextInputFormat is the default input format and used for plain or compressed text files.
Files are broken into lines.
Keys are the position (offset) in the file, and values are the line of text.
ssc.fileStream[LongWritable, Text, TextInputFormat] will work exactly as ssc.textFileStream(directory)
If you want to customize your file reading process then you need to define a custom input format and specify what are the key-value pair returned from
To implement custom input format you can refer defining Hadoop mapreduce custom input format.
Spark API Reference
Source code

Related

Spark 3.1.1 | Unable to run job : IDENTIFIER expected instead of '['

I'm trying to run a code that runs in spark 2.4.4 and I'm having the following error:
21/08/20 08:35:05 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 75, Column 32: IDENTIFIER expected instead of '['
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 75, Column 32: IDENTIFIER expected instead of '['
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:196)
at org.codehaus.janino.Parser.read(Parser.java:3705)
at org.codehaus.janino.Parser.parseQualifiedIdentifier(Parser.java:446)
at org.codehaus.janino.Parser.parseReferenceType(Parser.java:2569)
at org.codehaus.janino.Parser.parseType(Parser.java:2549)
at org.codehaus.janino.Parser.parseFormalParameter(Parser.java:1688)
at org.codehaus.janino.Parser.parseFormalParameterList(Parser.java:1639)
at org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1620)
at org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518)
at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028)
at org.codehaus.janino.Parser.parseClassBody(Parser.java:841)
at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736)
at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941)
at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:33)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
at org.apache.spark.sql.execution.streaming.OneTimeExecutor.execute(TriggerExecutor.scala:39)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
I read something in the official documentation regarding the way that spark 3.1.1 handles the maps, arrays, etc:
In Spark 3.1, structs and maps are wrapped by the {} brackets in
casting them to strings. For instance, the show() action and the CAST
expression use such brackets. In Spark 3.0 and earlier, the []
brackets are used for the same purpose. To restore the behavior before
Spark 3.1, you can set
spark.sql.legacy.castComplexTypesToString.enabled to true.
In Spark 3.1, NULL elements of structures, arrays and maps are
converted to “null” in casting them to strings. In Spark 3.0 or
earlier, NULL elements are converted to empty strings. To restore the
behavior before Spark 3.1, you can set
spark.sql.legacy.castComplexTypesToString.enabled to true.
I have enabled this, but still no luck. I'm removing all the maps and arrays from my dataframe meanwhile.
Any idea?

How to pass "WriteThroughputBudget" config in Spark cosmosdb connector

I am using the spark cosmosdb connector to write data in bulk to cosmosdb container. Since this is a bulk upload/write, and there are read operations happening at the same time. I want to restrict the RUs used by the write operation by spark connector. As per the wiki of the connector i found the config WriteThroughputBudget can be used to limit the write RUs consumption.
https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
As per the wiki, WriteThroughputBudget is an integer value defining the RU budget the ingestion operations in a certain Spark job should not exceed.
I tried setting this config using the option in write dataframe as below
inputDataset.write.mode(SaveMode.Overwrite).format("com.microsoft.azure.cosmosdb.spark").option("WriteThroughputBudget", 400).options(writeConfig).save()
Also, tried using the options in write dataframe
val writeConfig: Map[String, String] = Map(
"Endpoint" -> accountEndpoint,
"Masterkey" -> accountKey,
"Database" -> database,
"Collection" -> collection,
"Upsert" -> "true",
"ConnectionMode" -> "Gateway",
"WritingBatchSize" -> "1000",
"WriteThroughputBudget" -> "1000")
In both the cases write operation fails with excpetion
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnectionCache$.getOrCreateBulkExecutor(CosmosDBConnectionCache.scala:104)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnection.getDocumentBulkImporter(CosmosDBConnection.scala:57)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.bulkImport(CosmosDBSpark.scala:264)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.com$microsoft$azure$cosmosdb$spark$CosmosDBSpark$$savePartition(CosmosDBSpark.scala:389)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I understand the value of the config WriteThroughputBudget, is Integer value. Even though I am passing the Integer as String, it should have implicitly casted it but if fails.
Is there any other way I can specify WriteThroughputBudget option.
There is something wrong with the specific release version 3.0.5. azure-cosmosdb-spark_2.4.0_2.11-3.0.5-uber.jar.
WriteThroughputBudget can work with below versions:
azure-cosmosdb-spark_2.4.0_2.11-3.4.0-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-3.6.1-uber.jar

Read CSV file from Azure Blob storage in Rstudio Server with spark_read_csv()

I have provisioned an Azure HDInsight cluster type ML Services (R Server), operating system Linux, version ML Services 9.3 on Spark 2.2 with Java 8 HDI 3.6.
Within Rstudio Server I am trying to read in a csv file from my blob storage.
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
Sys.setenv(YARN_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(SPARK_CONF_DIR="/etc/spark/conf")
options(rsparkling.sparklingwater.version = "2.2.28")
library(sparklyr)
library(dplyr)
library(h2o)
library(rsparkling)
sc <- spark_connect(master = "yarn-client",
version = "2.2.0")
origins <-file.path("wasb://MYDefaultContainer#MyStorageAccount.blob.core.windows.net",
"user/RevoShare")
df2 <- spark_read_csv(sc,
path = origins,
name = 'Nov-MD-Dan',
memory = FALSE)```
When I run this I get the following error
Error: java.lang.IllegalArgumentException: invalid method csv
for object 235
at sparklyr.Invoke$.invoke(invoke.scala:122)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
at sparklyr.StreamHandler$.read(stream.scala:62)
at sparklyr.BackendHandler.channelRead0(handler.scala:52)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleCh annelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:748)
Any help would be awesome!
The path origins should point to a CSV file or a directory of CSVs. Are you sure that origins points to a directory of files or a file? There's typically at least one more directory under /user/RevoShare/ for each HDFS user, i.e., /user/RevoShare/sshuser/.
Here's an example that may help:
sample_file <- file.path("/example/data/", "yellowthings.txt")
library(sparklyr)
library(dplyr)
cc <- rxSparkConnect(interop = "sparklyr")
sc <- rxGetSparklyrConnection(cc)
fruits <- spark_read_csv(sc, path = sample_file, name = "fruits", header = FALSE)
You can use RxHadoopListFiles("/example/data/") or use hdfs dfs -ls /example/data to inspect your directories on HDFS / Blob.
HTH!

Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext

I used SparkContext.addFile("hdfs://host:54310/spark/fileName.txt") and added a file to SparkContext. I verified its presence using org.apache.spark.SparkFiles.get(fileName.txt). It showed an absolute path, something like /tmp/spark-xxxx/userFiles-xxxx/fileName.txt.
Now I want to read that file from the above given absolute path
location from SparkContext. I tried
sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
It considers the path returned by SparkFiles.get() as a HDFS
path, which is incorrect.
I searched extensively to find any helpful reads on this, but ran out of luck.
Is there anything wrong in the approach? Any help is really appreciated.
Here is the code and the outcome:
scala> sc.addFile("hdfs://localhost:54310/spark/fileName.txt")
scala> org.apache.spark.SparkFiles.get("fileName.txt")
res23: String = /tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
scala> sc.textFile(org.apache.spark.SparkFiles.get("fileName.txt")).collect().foreach(println)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/tmp/spark-3646b5fe-0a67-4a16-bd25-015cc73533cd/userFiles-a7d54640-fab2-4dfa-a94f-7de6f74a0764/fileName.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
... 49 elided
Refer to local file using the "file://" syntax.
sc.textFile("file://" + org.apache.spark.SparkFiles.get("fileName.txt"))
.collect()
.foreach(println)

Spark SQL coalesce function fails to evaluate

I am doing an outer join between a source dataframe and a smaller "overrides" dataframe, and I'd like to use the coalesce function:
val outputColumns: Array[Column] = dimensionColumns.map(dc => etlDf(dc)).union(attributeColumns.map(ac => coalesce(overrideDf(ac), etlDf(ac))))
etlDf.join(overrideDf, childColumns, "left").select(outputColumns:_*)
When it comes time to write the resulting dataframe to a parquet file, I am receiving the following exception:
org.apache.spark.sql.AnalysisException: Attribute name "coalesce(top_customer_fg, top_customer_fg)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:567)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:431)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:431)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:431)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:115)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:494)
at com.mycompany.customattributes.ProgramImplementation$StandardProgram.createAttributeFiles(ProgramImplementation.scala:63)
So even though the coalesce function returns a column, it appears it is evaluated as a literal column name. This seems unexpected to me.
Is there a syntax mistake I'm making here, or do I need to take a different approach?
Thanks.

Resources