How to create a SQL table inside the project path instead of outside in Azure datalake from databricks? - databricks

I am working on a project in which I want to create a SQL table and save it a project path. Here is how my project path looks like this.
abfss://dev#xyz.dfs.core.windows.net/hkay/project_name/
When I use the below SQL code in databricks, it saves the data in this folder instead of the project path given above. I tried using location {project_path} inside create statement but it failed due to wrong syntax.
spark.sql(f"""
CREATE OR REPLACE TABLE {database}.table_name
SELECT * FROM {database}.table_name_temp
WHERE 1=0
""")
This creates a folder in abfss://dev#xyz.dfs.core.windows.net/hkay/ outside the project path which I don't want. Any idea?
Edit 1:
This is what I was using. Not sure if the syntax is correct.
spark.sql(f"""
CREATE {database}.table_name
LOCATION 'abfss://dev#xyz.dfs.core.windows.net/hkay/project_name/'
SELECT * FROM {database}.table_name_temp
WHERE 1=0
""")

If you just want to create an empty table with the structure as another table, then you need to use slightly different syntax (see docs) - note the AS clause:
CREATE database.table_name
USING delta
LOCATION 'abfss://dev#xyz.dfs.core.windows.net/hkay/project_name/'
AS SELECT * FROM database.table_name_temp LIMIT 0

Related

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

Bcp must be installed error when loading Adventureworks Azure datawarehouse

I'm trying to test the Azure Data Warehouse. I successfully created and connected to the database, but I've run into a snag as I attempt to load the tables. I'm trying to execute the following instructions:
To install AdventureWorksSQLDW2012:
-----------------------------------
4. Extract files from AdventureWorksSQLDW2012.zip file into a directory.
5. Edit aw_create.bat setting the following variables:
a. server=<servername> from step 1. e.g. mylogicalserver.database.windows.net
b. user=<username> from step 1 or another user with proper permissions
c. password=<passwordname> for user in step 5b
d. database=<database> created in step 1
e. schema=<schema> this schema will be created if it does not yet exist
6. Run aw_create.bat from a cmd prompt, running from the directory where the files were unzipped to.
This script will...
a. Drop any Adventure Works tables or views that already exist in the schema
b. Create the Adventure Works tables and views in the schema specified
c. Load each table using bcp
d. Validate the row counts for each table
e. Collect statistics on every column for each table
I completed the prerequisites of installing bcp and sqlcmd and used the -? command to confirm the installations.
Unfortunately, when I try to complete step 6 above I get the following error:
REM AdventureWorksSQLDW2012 sample database version 3.0 for DW Service Tue 06/27/2017 20:31:01.99 Bcp must be installed.
Has anyone else come across this error or can anyone suggest a potential solution.
UPDATE: I've also copied the path where BCP is located to my path environment variables. Still no luck.
The aw_create.bat contains a line where you need to provide the path of the bcp program. Once provided ans save the script worked like a charm.

writing data to filesystem from hive queries in hdinsight

I see that its viable to write query results to filesystem in hadoop: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
How do I save a query result in case of hdinsight in a folder which is accessible from blobstorage.
I tried something as below but was not successful.
INSERT OVERWRITE LOCAL DIRECTORY '/example/distinctconsumers' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select consumerid from distinctconsumers;
Thanks
Language manual clearly states below
LOCAL keyword is used, Hive will write data to the directory on the local file system.
If you remove 'LOCAL' from your query them it will work.
NOTE: the result might not be a single file but a list of files (one from each task)

How can I reference another DB from a VS DB project?

We have several databases, say DB1, DB2, DB3 etc.
They have to have identical code base, so we use a DB project in Visual Studio 2012 and generate a SQL script for deployment based on comparison between the project and UAT/Prod DB1. Then this script is applied to DB1-DBn.
For the very first time in the history of this DB project I had to create a function that contained a hardcoded database name, example:
inner join DB1.schema.table1 as t1 on
And now the project cannot be built or comparison cannot be updated or script generated (Update and Generate Script buttons disabled) due to a number of errors pertaining to that database reference, as VS seems to believe that DB1 does not exist.
I tried to add a project level SQLCMD variable $(DB) and set it to DB1 default value and use it as
inner join [$(DB1)].schema.table1 as t1 on
to work around the errors, but it did not seem to make any difference.
Edit:
A suggestion was made to add a circular project reference to itself and assign to it the same variable I was trying to add manually, not sure how to accomplish that.
As per this article the reference should be added to a manually extracted .dacpac file as follows:
Extracted .dacpac file from the targed DB with the following command:
"C:\Program Files (x86)\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Microsoft\SQLDB\DAC\120\sqlpackage.exe" /SourcePassword:p /SourceUser:u /Action:Extract /ssn:192.168.2.1 /sdn:DB1 /tf:DB1.dacpac
Included that as a database reference. It automatically assigned the correct SQLCMD variable name and the error disappeared.
From the source control point, even though when adding a database reference to a .dacpac file automatically creates a SQLCMD variable, it does not add the file to the project. The .dacpac file used still has to be added to the project as an existing item, which is kind of lame. Doing that in the solution explorer I encountered an error and had to do that through the team explorer instead, where that worked.

In hive how to insert data into a single file

This work
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
but when we give command like
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/sample.csv' SELECT * from
table1;
Failed with exception Unable to rename: wasb://incrementalhive-1#crmdbs.blob.core.windows.net/hive/scratch/hive_2015-06-08_10-01-03_930_4881174794406290153-1/-ext-10000 to: wasb:/hiveblob/sample.csv
So, is there any way in which we can insert data to a single file
I don't think you can tell hive to write to a specific file like wasb:///hiveblob/foo.csv directly.
What you can do is:
Tell hive to merge the output files into one before you run the query.
This way you can have as many reducers as you want and still have single output file.
Run your query, e.g. INSERT OVERWRITE DIRECTORY ...
Then use dfs -mv within hive to rename the file to whatever.
This is probably less painful than using separate hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName as suggested by Ramzy.
The way to instruct to merge the files may be different depending on the runtime engine you are using.
For example, if you use tez as the runtime engine in your hive queries, you can do this:
-- Set the tez execution engine
-- And instruct to merge the results
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
-- Your query goes here.
-- The results should end up in wasb:///hiveblob/000000_0 file.
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
-- Rename the output file into whatever you want
dfs -mv 'wasb:///hiveblob/000000_0' 'wasb:///hiveblob/foo.csv'
(The above worked for me with these versions: HDP 2.2, Tez 0.5.2, and Hive 0.14.0)
For MapReduce engine (which is the default), you can try these, although I haven't tried them myself:
-- Try this if you use MapReduce engine.
set hive.execution.engine=mr;
set hive.merge.mapredfiles=true;
You can coerce hive to build to build one file by forcing reducers to one. This will copy any fragmented files in one table and combine them in another location in HDFS. Of course forcing one reducer breaks the benefit of parallelism. If you plan on doing any transformation of data I recommend doing that first then doing this in a last and separate phase.
To produce a single file using hive you can try:
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
set hive.exec.reducers.max=1;
create table if not exists db.table
stored as textfiel as
select * from db.othertable;
db.othertable is the table that has multiple fragmented files. db.table will have a single text file containing the combined data.
You will be having multiple output files by default, equal to the number of reducers. That is decided by Hive. However you can configure the reducers. Look here. However, the performance can be a hit, if we reduce the reducers and will run into more execution time. Alternatively, once the files are present, you can use get merge, and combine all the files into one file.
hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName
. The src folder contains all the files to be merged.

Resources