add path for pipelinemodel in spark - apache-spark

I want to add path for Pipelinemodel in spark to load model from my local file system but it returns the following exception.
import org.apache.spark.ml.PipelineModel
val pipeline = PipelineModel.load("C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save")
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save/model_final.sav/metadata
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)

If C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save/model_final.sav/metadata does not exist then you likely have a format issue. The format you are saving the model in is not the format load is looking for the data. (It clearly is calling out that that metadata folder is missing and that seems critical to the load.)

Related

Spark magic output committer settings not recognized

I'm trying to play around with different Spark output committer settings for s3, and wanted to try out the magic committer. So far I didn't manage to get my jobs to use the magic committer, and they always seem to fall back on the file output committer.
The Spark job I'm running is a simple PySpark test job that runs a simple query, repartitions the data and outputs parquet to s3:
df = spark.sql("select * from some_table where some_condition")
df.write \
.partitionBy("some_column") \
.parquet("s3://some-bucket/some-folder", mode="overwrite")
The relevant spark settings are (taken from the Spark UI, job's environment tab):
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.magic.enabled true
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.staging.tmp.path tmp/staging
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
mapreduce.output.fileoutputformat.compress false
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type RECORD
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
mapreduce.fileoutputcommitter.algorithm.version 1
mapreduce.fileoutputcommitter.task.cleanup.enabled false
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
Hadoop properties:
fs.s3a.committer.magic.enabled true
fs.s3a.committer.name magic
(Let me know if any other settings are relevant)
I'm basing the observation of file committer being used instead of magic committer on a couple of things:
Different log lines produced by the spark job seem to indicate the file output committer being used:
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:601","func":"commitTask","message":"Saved output of task 'attempt_2021...' to s3://some-bucket/some-folder/_temporary/0/
task_2021..."
"class":"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat","file_line":"ParquetFileFormat.scala:54","message":"U
sing user defined output committer for Parquet: org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:141","func":"<init>","message":"File Outpu
t Committer Algorithm version is 1"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:156","func":"<init>","message":"FileOutput
Committer skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false"
When setting the file committer's algo to an invalid number, like so:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version -7
an exception is raised from the file committer's constructor saying the value is invalid - implicating that the file committer was initialized instead of the magic committer.
I'm not seeing any logs indicating usage of the magic committer, or any failure to initialize a committer which could explain falling back on the file committer.
Spark version is 3.1.2 using this spark-hadoop-cloud JAR. Let me know if there's any other officially published JAR I can try or if there are any other log indications that may be relevant.
Any thoughts?
===== EDIT:
Below is the stack trace I see when setting the file committer algo to an invalid value. It seems that the call to org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter ends up calling org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter which in turn initializes the incorrect type org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter instead of the configured type org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
Py4JJavaError: An error occurred while calling o259.parquet.
: java.io.IOException: Only 1 or 2 algorithm version is supported
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:143)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:117)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createFileOutputCommitter(PathOutputCommitterFactory.java:134)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter(FileOutputCommitterFactory.java:35)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createCommitter(PathOutputCommitterFactory.java:201)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:88)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:49)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:177)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:874)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Mystery solved - the failure to initialize the magic committer was due to a mismatch between the committer factory scheme setting to the scheme of the actual destination URL. Consider this:
The committer factory configuration was set using the key: spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a - meaning that the setting is made for s3a protocol URLs.
While th URL sent to the write method was: s3://some-bucket/some-folder - using s3 protocol instead of s3a.
The PathOutputCommitterFactory hadoop class searches for a config key with pattern mapreduce.outputcommitter.factory.scheme.%s to recognize which factory to use for the given output URL. In case the pattern set in the config key (in this case s3a) does not match the pattern in the destination URL (in this case s3) - the committer factory setting will not be recognized and the factory type will fall back on FileOutputCommitter.
Solution - make sure the outputcommitter.factory.scheme.<protocol> setting matches the protocol in the destination URL. I've successfully tested using both s3 and s3a in the URL & config key.
this does sound like a binding problem but I cannot see immediately where it is. At a glance you have all the right settings.
The easiest way to check that an S3 a committee is being used is to look at the _SUCCESS file . If it is a piece of JSON then a new committer was used… The text inside will then tell you more about the committer.
a 0 byte file means that the classic file output committer was still used

Problem when calling createOrReplaceGlobalTempView on a dataframe

On my new windows machine, I am using Spark 3.1.1 (using winutils for Hadoop) to create from a csv file a global temp view like so :
DF.createOrReplaceGlobalTempView("firstTable");
Where DF is a Dataset<Row> containing the csv data.
I get the following error on the global view creation :
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741515:
What's the problem ?
Thanks

Unable to read parquet file locally in spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook.
df = spark.read.parquet("metastore_db/tmp/userdata1.parquet")
I am getting this exception
An error occurred while calling o738.parquet.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Does anyone know how to do it?
Assuming that you are running spark on your local, you should be doing something like
df = spark.read.parquet("file:///metastore_db/tmp/userdata1.parquet")

Unable to save data frame as Hive table, throwing file not found exception

When I am trying to save data frame as Hive table in pyspark
df_writer.saveAsTable('hive_table', format='parquet', mode='overwrite')
I am getting following error:
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path
does not exist:
hdfs://hostname:8020/apps/hive/warehouse/testdb.db/hive_table at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
I have the path till 'hdfs://hostname:8020/apps/hive/warehouse/testdb.db/'
Please provide your inputs
Try using DataFrameWriter as
df.write.mode(SaveMode.Append).insertInto(s"${dbName}.${t.table}")

Can't read files via FTP using SparkContext.textFile(...) on Google Dataproc

I am running a Spark cluster on Google Dataproc and I'm experiencing some issues while trying to read GZipped file from FTP using sparkContext.textFile(...).
The code I am running is:
object SparkFtpTest extends App {
val file = "ftp://username:password#host:21/filename.txt.gz"
val lines = sc.textFile(file)
lines.saveAsTextFile("gs://my-bucket-storage/tmp123")
}
The error that I get is:
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
I see some people have suggested that the credentials are wrong, so I've tried entering wrong credentials and the error was different, i.e. Invalid login credentials.
It also works if I copy the URL into the browser - the file is being downloaded properly.
It's also worth mentioning that I've tried using Apache commons-net library (the same version as the one in Spark - 2.2) and it worked - I was able to stream the data (from both Master and Worker nodes). I wasn't able to decompress it though (by using Java's GZipInputStream; I can't remember the failure but if you think it's important I can try and reproduce it). I think this suggests that it's not some firewall issue on the cluster, though I wasn't able to use curl to download the file.
I think I was running the same code a few months ago from my local machine and if I remember correctly it worked just fine.
Do you have any ideas what is causing this problem?
Could it be that it's some kind of dependency conflict problem and if so which one?
I have a couple of dependencies in the project such as google-sdk, solrj, ... However, I'd expect to see something like ClassNotFoundException or NoSuchMethodError if it was a dependency problem.
The whole stack trace looks like this:
16/12/05 23:53:46 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
16/12/05 23:53:49 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/0/
16/12/05 23:53:50 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/0/ - removing from cache
16/12/05 23:53:50 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:51 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1219)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1064)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:955)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1459)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1438)
It looks like this may be a known unresolved issue in Spark/Hadoop: https://issues.apache.org/jira/browse/HADOOP-11886 and https://github.com/databricks/learning-spark/issues/21 both allude to a similar stack trace.
If you were able to manually use the Apache commons-net library, you could achieve the same effect as sc.textFile by obtaining a list of the files, parallelizing that list of files as an RDD, and using flatMap where each task takes a filename and reads the file line-by-line, generating the output collection of lines for each file.
Alternatively, if the amount of data you have in FTP is small (up to maybe 10 GB or so) then parallel reads won't be helping too much compared to a single thread copying from your FTP server onto HDFS or GCS in your Dataproc cluster before then processing using an HDFS or GCS path in your Spark job.

Resources