Error While inserting rows into Kudu using Spark Shell - apache-spark

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.

From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Failed to execute 'table' on org.apache.spark.sql.SparkSession

I have a Spark + Hive application.
It works fine. But at some point I had to create another Hive environment.
So I ran show create table ... and recreated the same view (with underlying tables). And added some data.
I can query the data from hive cli, etc.
but whenever I run my application it fails with
ERROR Failed to execute 'table' on 'org.apache.spark.sql.SparkSession' with args=([Type=java.lang.String, Value: <view name>])
I believe it refers to the line code when I can sparkSession.table(<view-name>)
What steps can be executed to troubleshoot a such issue?
UPD
Session declaration (definitely tried to create a session without this configuration)
.Config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.Config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "some.file")
.Config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.Config("spark.sql.debug.maxToStringFields", int64 2048)
.Config("spark.debug.maxToStringFields", int64 2048)
Maybe a bit trivial, but when it comes to troubleshooting this kind of an issue, really try and get to the root of the problem with a minimal set up:
I generally start off by starting the spark-shell.
Check whether it is possible to run spark.sql("SHOW DATABASES").show(20, false). If this fails, it's probably something with your Hive configuration, indeed.
Try and see whether you can run spark.table("your_table"). If not, it'll probably give you a clearer error (such as Table or view not found: ...).
If all of the above works, try to strip your application such that it only does that spark.table, which did work in your spark-shell at that point in time. If that suddenly doesn't work, it might have to do with how the SparkSession is created in your application.
If that works, try and uncomment the code piece by piece, until you're back to your original code to better pinpoint where it fails.

Push in existing local table failure (windows): InvalidRegionNumberException then IllegalArgumentException

I want to push data into an already existing table, single column family, no records.
I am using shc-core:1.1.1-2.1-s_2.11 on a windows machine. I have hbase 1.2.6 installed and use scala 2.11.8.
When I try to push data I got first the following error: org.apache.spark.sql.execution.datasources.hbase.InvalidRegionNumberException: Number of regions specified for new table must be greater than 3.
After following the advice of this link https://github.com/hortonworks-spark/shc/issues/249#issue-318285217, I added: HBaseTableCatalog.newTable -> "5" to my options.
It still failed but with: java.lang.IllegalArgumentException: Can not create a Path from a null string.
Following this link: https://github.com/hortonworks-spark/shc/issues/151#issuecomment-313800739, I added to my catalog: , "tableCoder":"PrimitiveType".
Still facing the same error.
I saw people are expecting some clarification about this issue (https://github.com/hortonworks-spark/shc/issues/249#issuecomment-463528032).
It is known issue and apparently it seems fixed (https://github.com/hortonworks-spark/shc/issues/155#issuecomment-315236736).
I do not know what to do next.
Is there a solution about this?

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

Unable to type anything in spark console

In my machine when i start SPARK console in command prompt , I am unable to type anything in it. When I type something , It is not visible and when I use Enter or Control + Enter on it, It shows the result which was typed. But whatever is typed is not visible.
Please suggest some remedy for this .
The spark version I am using is Spark - 1.2.0
Also It is sowing me this line while starting the spark in command prompt
Failed to created SparkJLineReader: java io. IOExceptio : No such file or directory.
Could that be a problem?
The issue seems to be related to permissions on home directory. The same issue is resolved here

Resources