I want to push data into an already existing table, single column family, no records.
I am using shc-core:1.1.1-2.1-s_2.11 on a windows machine. I have hbase 1.2.6 installed and use scala 2.11.8.
When I try to push data I got first the following error: org.apache.spark.sql.execution.datasources.hbase.InvalidRegionNumberException: Number of regions specified for new table must be greater than 3.
After following the advice of this link https://github.com/hortonworks-spark/shc/issues/249#issue-318285217, I added: HBaseTableCatalog.newTable -> "5" to my options.
It still failed but with: java.lang.IllegalArgumentException: Can not create a Path from a null string.
Following this link: https://github.com/hortonworks-spark/shc/issues/151#issuecomment-313800739, I added to my catalog: , "tableCoder":"PrimitiveType".
Still facing the same error.
I saw people are expecting some clarification about this issue (https://github.com/hortonworks-spark/shc/issues/249#issuecomment-463528032).
It is known issue and apparently it seems fixed (https://github.com/hortonworks-spark/shc/issues/155#issuecomment-315236736).
I do not know what to do next.
Is there a solution about this?
Related
I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.
I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0
I was trying to create a User Defined Function using Devcenter.
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html
The following example was taken from the above link.
CREATE OR REPLACE FUNCTION fLog (input double) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return Double.valueOf(Math.log(input.doubleValue()));';
It works in cqlsh but fails in DevCenter with the following error. Based on how fast I get the response, I think DevCenter does a local syntax check and aborts execution as it thinks the query is wrong.
I am using Cassandra 2.2.3 and DevCenter 1.4.1
DevCenter does perform it's own grammar syntax checking as well as semantic validation.
This is a DevCenter bug - the column name "input" is not being allowed by the grammar in this context because it's also a CQL keyword that's incorrectly not being allowed here.
A suggested workaround until a fix is available is to change the name of the column, e.g. "inputvalue".
As it says "No viable alternative at input 'input'"
It seems like 'input' is a reserved keyword in Devcenter and not in the CQLSH.
Try using another name for the variable.
Setup:I am using Python 3.3 on a Windows 2012 client.
I have a select query running using pyodbc which is not returning any results via fetchall(). I know the query works fine because i can take it out and run it from Microsoft SQL Management Studio without any issues.
I can also remove one column from the select list and the query will return results. For the database row in question, this column contains a large amount of XML data (> 10,000 characters), so it seems as though there is some buffer overflow issue going on causing fetchall() to fail, though it doesn't throw any exceptions. I have tried googling around and i have seen rumors of a config option to raise the buffer size, but i haven't been able to nail down exactly how to do it, or what a workaround would be.
Is there a configuration option that I can use, or any alternative to pyodbc.
Disclaimer: I have only been using python for about 2 weeks now so i
am still quite the noob, though i have made every attempt to research
my problems thoroughly this one has proven to be elusive:
On a side note, i tried using odbc instead of pyodbc but the same query throws this oddball error which google isn't helping me solve either
[ERROR] An exception while executing the Select query: [][Negative size passed to PyBytes_FromStringAndSize]
It seems this issue was resolved by changing my SQL connection string
FROM:
DRIVER={SQL Server Native Client 11.0}
TO:
DRIVER={SQL Server}
I have a counter column family in cassandra. When i try to view the data from CQL i get an error even though there is data in the column family.
SELECT * from userstats;
Generates the following error:
'int' object has no attribute 'replace'
I can confirm that the data is in the column family and is working properly since I can view the data with the Datastax Opscenter data explorer.
It sounds like you're using an older version of cqlsh. Upgrading it (just copying the bin/cqlsh file from the Cassandra 1.1 branch head, along with everything under the pylib directory, into place) ought to solve this.
If it doesn't, running cqlsh with --debug would help a lot in diagnosing the problem.