Path for persistent, managed hive tables in Spark 1.4 - apache-spark

I am new to Spark and working on JavaSqlNetworkWordCount example to append the word count in a persistent table. I understand that I can only do it via HiveContext. HiveContext, however, keeps trying to save the table in /user/hive/warehouse/. I have tried changing the path by adding
hiveContext.setConf("hive.metastore.warehouse.dir", "/home/user_name");
and by adding the property
<property><name>hive.metastore.warehouse.dir</name>
<value>/home/user_name</value></property>
$SPARK_HOME/conf/hive-site.xml, but nothing seems to work. If anyone else has faced this problem, please let me know if/how you resolved it. I am using Spark1.4 on my local RHEL5 machine.

I think I solved the problem. It looks like spark-submit was creating a metastore_db directory in root directory of the jar file. If metastore_db exists, then hive-stie.xml values are ignored. As soon as I removed that directory, code picked up values from hive-site.xml. I still cannot set the value of the hive.metastore.warehouse.dir property from the code, though.

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Spark: How to overwrite data in partitions but not the root folder while saving to disk?

W.r.t. following code:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Overwrite).parquet(rootPath)
It deletes everything under the rootPath before writing data to it. If the code is changed to:
spark.sql(sqlStatement).write.partitionBy("city", "dataset", "origin").mode(SaveMode.Append).parquet(rootPath)
then it does not delete anything. What we want is a mode that will not delete the data under rootPath but delete the data under a city/dataset/origin before writing to it. How can this be done?
Try basepath option. Partition discovery will be only pointed towards children of '/city/dataset/origin'
according to documentation -
Spark SQL’s partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if
path="/my/data/x=1" then x=1 will no longer be considered a partition
but only children of x=1.) This behavior can be overridden by manually
specifying the basePath that partitioning discovery should start with
(SPARK-11678).
spark.sql(sqlStatement)\
.write.partitionBy("city", "dataset","origin")\
.option("basePath","/city/dataset/origin") \
.mode(SaveMode.Append).parquet(rootPath)
let me know if this doesnt work. I'll remove my answer.
Have a look at spark.sql.sources.partitionOverwriteMode="dynamic" setting, which was introduced in Spark 2.3.0.

Spark job keeps having output folder already exists exception

I am running a spark job, and it kept failing with output folder already exists exceptions. I indeed removed the output folder before the job. Looks like the folder is created during the job and it confused other nodes/threads. It happens randomly but not always.
rdd.write().format("parquet").mode(SaveMode.Overwrite).save("location");
This should solve the issue of file already exists.
If you are using a local filesystem path, then be aware that the folder gets created on all workers. So you probably have to delete it from all of them.

Resources