Spark saveAsTable with location at s3 bucket's root cause NullPointerException

Spark saveAsTable with location at s3 bucket's root cause NullPointerException - apache-spark

I am working with Spark 3.0.1 and my partitioned table is stored in s3. Please find here the description of the issue.
Create Table
Create table root_table_test_spark_3_0_1 (
id string,
name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION 's3a://MY_BUCKET_NAME/'
Code that is causing the NullPointerException on the second run
Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
.toDS()
.write
.partitionBy("id")
.mode(SaveMode.Append)
.saveAsTable("root_table_test_spark_3_0_1")
When the Hive metastore is empty everything is ok but the issue is happening when spark is trying to do the getCustomPartitionLocations in InsertIntoHadoopFsRelationCommand phase. (on the second run for example)
Indeed it calls the below method : from (org.apache.hadoop.fs.Path)
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return new Path(getParent(), getName()+suffix);
}
But the getParent() will return null when we are at root, resulting in a NullPointerException. The only option i'm thinking at the moment is to override this method to do something like :
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}
Anyone having issues when LOCATION of a spark hive table is at root level? Any workaround? Is there any known issues opened?
My Runtime does not allow me to override the Path class and fix the suffix method and i can't move my data from the bucket's root as it exists since 2 years now.
The issue happen because i'm migrating from Spark 2.1.0 to Spark 3.0.1 and the behavior checking the custom partitions appeared in Spark 2.2.0 (https://github.com/apache/spark/pull/16460)
This whole context help to understand the problem but basically you can reproduce it easily doing
val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))
FYI. the hadoop-common version is 2.7.4 and please find here the full stacktrace
NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)
Thanks

Looks like a situation where the spark code calls Path.suffix("something) and because the root path has no parent, an NPE is triggered
Long term fix
File JIRA on issues.apache.org against HADOOP; provide a patch with test for fix suffix() to downgrade properly when called on root path. Best for all
don't use the root path as a destination of a table.
Do both of these
Option #2 should avoid other surprises about how tables are created/committed etc...some of the code may fail because an attempt to delete the root of the path (here s3a://some-bucket") won't delete the root, will it?
Put differently: root directories have "odd" semantics everywhere; most of the time you don't notice this on a local FS because you never try to use / as a destination of work, get surprised that rm -rf / is different from rm -rf /subdir, etc etc. Spark, Hive etc were never written to use / as a destination of work, so you get to see the failures.

Related

Where are the spark intermediate files stored on the disk?

During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
tmp. Default value of spark.local.dir
/home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?

Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as #KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
:

This is going to depend on what your cluster setup is and your Spark version, but you're more or less looking at the correct places.
For this explanation, I'll be talking about Spark v3.3.1. which is the latest version as of the time of this post.
There is an interesting method in org.apache.spark.util.Utils called getConfiguredLocalDirs and it looks like this:
/**
* Return the configured local directories where Spark can write files. This
* method does not create any directories on its own, it only encapsulates the
* logic of locating the local directories according to deployment mode.
*/
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_SANDBOX"))
} else {
if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " +
s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
This is interesting, because it makes us understand the order of precedence each config setting has. The order is:
if running in Yarn, getYarnLocalDirs should give you your local dir, which depends on the LOCAL_DIRS environment variable
if SPARK_EXECUTOR_DIRS is set, it's going to be one of those
if SPARK_LOCAL_DIRS is set, it's going to be one of those
if MESOS_SANDBOX and !shuffleServiceEnabled, it's going to be MESOS_SANDBOX
if spark.local.dir is set, it's going to be that
ELSE (catch-all) it's going to be java.io.tmpdir
IMPORTANT: In case you're using Kubernetes, all of this is disregarded and this logic is used.
Now, how do we find this directory?
Luckily, there is a nicely placed logging line in DiskBlockManager.createLocalDirs which prints out this directory if your logging level is INFO.
So, set your default logging level to INFO in log4j.properties (like so), restart your spark application and you should be getting a line saying something like
Created local directory at YOUR-DIR-HERE

Databricks Autoloader throws IllegalArgumentException

I'm trying the simplest auto loader example included in the databricks website
https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(input_data_path))
(df.writeStream.format("delta")
.option("checkpointLocation", chkpt_path)
.table("iot_stream"))
I keep getting this message:
IllegalArgumentException: cloudFiles.schemaLocation Could not find required option: schemaLocation. Please provide a schema location using cloudFiles.schemaLocation for storing inferred schema and supporting schema evolution.
If providing cloudFiles.schemaLocation is required, why do the examples everywhere are missing it? what's the underlying issue here?

I suspect what is going on is that you are not explicitly setting the .option("cloudFiles.schemaEvolutionMode")
Which means it is being set to the default which is "addNewColumns" as per https://docs.databricks.com/ingestion/auto-loader/options.html
And that requires you set the .option("cloudFiles.schemaLocation", path) in the reader.
Thus you are inadvertently requiring it and not setting it.

Chaining Delta Streams programmatically raising AnalysisException

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)

From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.

After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.

For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")

What is the purpose of global temporary views?

Trying to understand how to use the Spark Global Temporary Views.
In one spark-shell session I've created a view
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
df = (
spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("/user/root/data/cars.csv"))
df.createGlobalTempView("my_cars")
# works without any problem
spark.sql("SELECT * FROM global_temp.my_cars").show()
And on another I tried to access it, without success (table or view not found).
#second Spark Shell
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
spark.sql("SELECT * FROM global_temp.my_cars").show()
That's the error I receive :
pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`my_cars`; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`my_cars`\n"
I've read that each spark-shell has its own context, and that's why one spark-shell cannot see the other. So I don't understand, what's the usage of the GTV, where will it be useful ?
Thanks

in the spark documentation you can see:
If you want to have a temporary view that is shared among all sessions
and keep alive until the Spark application terminates, you can create
a global temporary view.
The global table remains accessible as long as the application is alive.
Opening a new shell and giving it the same application will just create a new application.
you can try and test it within the same shell:
spark.newSession.sql("SELECT * FROM global_temp.my_cars").show()
please see my answer on a similar question for a more detailed example as well as a short definition of a Spark Application and Spark Session

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it,
df.createGlobalTempView("people")

Could not retrieve endpoint ranges: java.lang.IllegalArgumentException

I am tyring to load sstables to cassandra using sstableloader utility. But I am getting the following error.
> java.lang.IllegalArgumentException
java.lang.RuntimeException: Could not retrieve endpoint ranges:
at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:338)
at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:156)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:106)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:543)
at org.apache.cassandra.serializers.CollectionSerializer.readValue(CollectionSerializer.java:124)
at org.apache.cassandra.serializers.MapSerializer.deserializeForNativeProtocol(MapSerializer.java:101)
at org.apache.cassandra.serializers.MapSerializer.deserializeForNativeProtocol(MapSerializer.java:30)
at org.apache.cassandra.serializers.CollectionSerializer.deserialize(CollectionSerializer.java:50)
at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:68)
at org.apache.cassandra.cql3.UntypedResultSet$Row.getMap(UntypedResultSet.java:287)
at org.apache.cassandra.config.CFMetaData.fromSchemaNoTriggers(CFMetaData.java:1824)
at org.apache.cassandra.config.CFMetaData.fromThriftCqlRow(CFMetaData.java:1117)
at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:330)
... 2 mor
the command I am using to load the sstable is
$bin/sstableloader -d nodename -u username -pw password path/to/sstable/keyspacename/tablename
this was working a few days back .I am not sure whats changed and how to debug it ?
I am using datastax.
I am loading the sstable from the same node which is in the cluster.i.e my source and destination node are same.
Has someone seen this error before ?
Cassandra version : 2.1
Any Help is appreciated.

The exception in the stack trace comes from this piece of code:
if (version >= Server.VERSION_3)
{
int size = input.getInt();
if (size < 0)
return null;
return ByteBufferUtil.readBytes(input, size); // HERE !
}
I'm wondering if you're loading sstables that have been generated by a Cassandra 2.1 or an older version .... Because the issue seems to be at the byte-encoding level.
There is also a possibility that your SSTables are corrupted.
How did you get those sstables ? From a copy of another Cassandra instance ? Generated by CQLSSTableWriter ?

I had this problem again, so debugged it a little for the root cause. The problem is if at any time you have altered your cassandra table by dropping some column. it triggers a bug for sstableLoader. That why dropping the table and creating it again works.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string