Java code executed by Databricks/Spark job - settings for one executor - apache-spark

In our current system there is one Java code that is reading one file and it will generate many JSON documents for the full day - 24h; all JSON docs are written to CosmosDB. When I execute it in the console everything is OK. I have tried to schedule a Databricks job by using the uber-jar file and it failed with the following error:
"Resource with specified id or name already exists."
It seems ok... IMO because the default settings of the existing cluster contain many executors - so each executor will try to write to CosmosDB the same set of JSON docs.
So I changed the main method as below:
public static void main(String[] args) {
SparkConf conf01 = new SparkConf().set("spark.executor.instances","1").set("spark.executor.cores","1");
SparkSession spark = SparkSession.builder().config(conf01).getOrCreate();
...
}
But I received the same error "Resource with specified id or name already exists" from the CosmosDB.
I would like to have only one executor for this specific Java code, how to use only one spark executor?
Any help (link/doc/url/code) will be appreciated.
Thank you !

Related

Schema disagreements with Cassandra 4.0 using the Java driver

we have a 3-node dev Cassandra cluster running 3.11.13 that we have upgraded to 4.0.7, and we’ve been basically sending DDL statements through our Java applications using spring-data-cassandra:3.4.6 which uses the DataStax Java Driver version 4.14.1, and ever since we hadn’t had faced any issues with it until the upgrade to 4.0.7
The main issue with 4.0.7 that we’re facing is the schema disagreements that we’ve been seeing due to the tables created programmatically that has been a non-issue for us since 3.11.x. Although DDL statements made through cqlsh is working as expected, it’s only through the programmatic creation that we’re seeing the schema disagreements.
We’ve tried different cluster setups, C* versions, and Ubuntu versions, but we still face the same issue:
3-node, single-rack DC (Ubuntu 18.04, 20.04, 22.04) (4.0.x, 4.1.x)
3-node, 3-rack DC (Ubuntu 18.04, 20.04, 22.04) (4.0.x, 4.1.x) — This is the setup we’ve been using since 3.11.x
We’ve also tried fiddling with the driver configurations like adjusting the timeouts and disabling debouncing, but with no luck, face the same issue.
advanced.control-connection {
schema-agreement {
interval = 500 milliseconds
timeout = 10 seconds
warn-on-failure = true
}
},
advanced.metadata {
topology-event-debouncer {
window = 1 milliseconds
max-events = 1
}
schema {
request-timeout = 5 seconds
debouncer {
window = 1 milliseconds
max-events = 1
}
}
}
We’re creating tables programmatically through the following snippets:
#Override
protected abstract List<String> getStartupScripts();
#Bean
SessionFactoryInitializer sessionFactoryInitializer(SessionFactory sessionFactory) {
SessionFactoryInitializer initializer = new SessionFactoryInitializer();
initializer.setSessionFactory(sessionFactory);
final ResourceKeyspacePopulator resourceKeyspacePopulator = new ResourceKeyspacePopulator();
getStartupScripts().forEach(script ->
{
resourceKeyspacePopulator.addScript(scriptOf(script));
});
initializer.setKeyspacePopulator(resourceKeyspacePopulator);
return initializer;
}
And create one like:
#Override
protected List<String> getStartupScripts() {
return Arrays.asList(testTable());
}
private String testTable() {
return "CREATE TABLE IF NOT EXISTS test_table ("
+ "test text, "
+ "test2 text, "
+ "createdat bigint, "
+ "PRIMARY KEY(test, test2))";
}
But we end up in a loop until it timeouts due to the schema disagreement with the following errors:
DEBUG com.datastax.oss.driver.internal.core.metadata.SchemaAgreementChecker - [s1] Schema agreement not reached yet ([09989a2c-7348-3117-8b4a-d5cad549bc09, f4c8755d-6fec-38fe-984f-4083f4a0a0a0]), rescheduling in 500 ms
WARN org.springframework.context.support.GenericApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'sessionFactoryInitializer' defined in com.bitcoin.wallet.config.CassandraConfig: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.data.cassandra.core.cql.session.init.SessionFactoryInitializer]: Factory method 'sessionFactoryInitializer' threw exception; nested exception is org.springframework.data.cassandra.core.cql.session.init.ScriptStatementFailedException: Failed to execute CQL script statement #1 of Byte array resource [resource loaded from byte array]: CREATE TABLE IF NOT EXISTS test_table (test text,test2 text,createdat bigint,PRIMARY KEY(test, test2)); nested exception is com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S
So two things come to mind when reading through this:
Schema disagreements are often a symptom of some larger issue.
Does the node have its CPU pegged at 100%? Schema disagreement. Inefficient network routing? Schema disagreement. Disk IOPS maxed-out causing write back-pressure? Schema disagreement.
I'd have a look at the activity on the nodes and see if any of the above stand out.
Programmatic schema changes are often problematic.
Each node needs to store the complete schema, so each schema change gets sent to all nodes, essentially making schema changes running at an asynchronous ALL level of consistency. Because of that, there's no margin for error. And programmatic schema changes are often sent from within an application much faster than Cassandra can reconcile them.
My recommendations for making any schema changes:
Execute during off-peak times.
Only run when all nodes are UN.
Run them using cqlsh (not from application code).
Verify each individual change using nodetool describecluster.

Where are the spark intermediate files stored on the disk?

During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
tmp. Default value of spark.local.dir
/home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?
Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as #KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
:
This is going to depend on what your cluster setup is and your Spark version, but you're more or less looking at the correct places.
For this explanation, I'll be talking about Spark v3.3.1. which is the latest version as of the time of this post.
There is an interesting method in org.apache.spark.util.Utils called getConfiguredLocalDirs and it looks like this:
/**
* Return the configured local directories where Spark can write files. This
* method does not create any directories on its own, it only encapsulates the
* logic of locating the local directories according to deployment mode.
*/
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_SANDBOX"))
} else {
if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " +
s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
This is interesting, because it makes us understand the order of precedence each config setting has. The order is:
if running in Yarn, getYarnLocalDirs should give you your local dir, which depends on the LOCAL_DIRS environment variable
if SPARK_EXECUTOR_DIRS is set, it's going to be one of those
if SPARK_LOCAL_DIRS is set, it's going to be one of those
if MESOS_SANDBOX and !shuffleServiceEnabled, it's going to be MESOS_SANDBOX
if spark.local.dir is set, it's going to be that
ELSE (catch-all) it's going to be java.io.tmpdir
IMPORTANT: In case you're using Kubernetes, all of this is disregarded and this logic is used.
Now, how do we find this directory?
Luckily, there is a nicely placed logging line in DiskBlockManager.createLocalDirs which prints out this directory if your logging level is INFO.
So, set your default logging level to INFO in log4j.properties (like so), restart your spark application and you should be getting a line saying something like
Created local directory at YOUR-DIR-HERE

Spark saveAsTable with location at s3 bucket's root cause NullPointerException

I am working with Spark 3.0.1 and my partitioned table is stored in s3. Please find here the description of the issue.
Create Table
Create table root_table_test_spark_3_0_1 (
id string,
name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION 's3a://MY_BUCKET_NAME/'
Code that is causing the NullPointerException on the second run
Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
.toDS()
.write
.partitionBy("id")
.mode(SaveMode.Append)
.saveAsTable("root_table_test_spark_3_0_1")
When the Hive metastore is empty everything is ok but the issue is happening when spark is trying to do the getCustomPartitionLocations in InsertIntoHadoopFsRelationCommand phase. (on the second run for example)
Indeed it calls the below method : from (org.apache.hadoop.fs.Path)
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return new Path(getParent(), getName()+suffix);
}
But the getParent() will return null when we are at root, resulting in a NullPointerException. The only option i'm thinking at the moment is to override this method to do something like :
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}
Anyone having issues when LOCATION of a spark hive table is at root level? Any workaround? Is there any known issues opened?
My Runtime does not allow me to override the Path class and fix the suffix method and i can't move my data from the bucket's root as it exists since 2 years now.
The issue happen because i'm migrating from Spark 2.1.0 to Spark 3.0.1 and the behavior checking the custom partitions appeared in Spark 2.2.0 (https://github.com/apache/spark/pull/16460)
This whole context help to understand the problem but basically you can reproduce it easily doing
val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))
FYI. the hadoop-common version is 2.7.4 and please find here the full stacktrace
NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)
Thanks
Looks like a situation where the spark code calls Path.suffix("something) and because the root path has no parent, an NPE is triggered
Long term fix
File JIRA on issues.apache.org against HADOOP; provide a patch with test for fix suffix() to downgrade properly when called on root path. Best for all
don't use the root path as a destination of a table.
Do both of these
Option #2 should avoid other surprises about how tables are created/committed etc...some of the code may fail because an attempt to delete the root of the path (here s3a://some-bucket") won't delete the root, will it?
Put differently: root directories have "odd" semantics everywhere; most of the time you don't notice this on a local FS because you never try to use / as a destination of work, get surprised that rm -rf / is different from rm -rf /subdir, etc etc. Spark, Hive etc were never written to use / as a destination of work, so you get to see the failures.

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Logging Spark Configuration Properties

I'm trying to log the properties for each Spark application that run in one Yarn cluster ( properties like spark.shuffle.compress, spark.reducer.maxMbInFlight, spark.executor.instances and so on ).
However i don't know if this information is logged anywhere. I know that we can access to the yarn logs through the "yarn" command but the properties I'm talking about are not store there.
Is there anyway to access to this kind of info?. The idea is to have a trace of all the applications that run in the cluster together with its properties to identify which ones have the most impact in their execution time.
You could log it yourself... use sc.getConf.toDebugString, sqlContext.getConf("") or sqlContext.getAllConfs.
scala> sqlContext.getConf("spark.sql.shuffle.partitions")
res129: String = 200
scala> sqlContext.getAllConfs
res130: scala.collection.immutable.Map[String,String] = Map(hive.server2.thrift.http.cookie.is.httponly -> true, dfs.namenode.resource.check.interval ....
scala> sc.getConf.toDebugString
res132: String =
spark.app.id=local-1449607289874
spark.app.name=Spark shell
spark.driver.host=10.5.10.153
Edit: However, I could not find the properties you specified among the 1200+ properties in sqlContext.getAllConfs :( Otherwise the documentation says:
The application web UI at http://:4040 lists Spark properties
in the “Environment” tab. This is a useful place to check to make sure
that your properties have been set correctly. Note that only values
explicitly specified through spark-defaults.conf, SparkConf, or the
command line will appear. For all other configuration properties, you
can assume the default value is used.

Resources