I see that its viable to write query results to filesystem in hadoop: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
How do I save a query result in case of hdinsight in a folder which is accessible from blobstorage.
I tried something as below but was not successful.
INSERT OVERWRITE LOCAL DIRECTORY '/example/distinctconsumers' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select consumerid from distinctconsumers;
Thanks
Language manual clearly states below
LOCAL keyword is used, Hive will write data to the directory on the local file system.
If you remove 'LOCAL' from your query them it will work.
NOTE: the result might not be a single file but a list of files (one from each task)
Related
I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning.
The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up sharing the same temp folder and cause conflict, which can cause one job to delete temp files, and the other job fail with an error saying an expected temp file doesn't exist.
Can we change temporary path for individual job or is there any alternate way to avoid issue
To change the temp location you can do this:
/opt/spark/bin/spark-shell --conf "spark.local.dir=/local/spark-temp"
spark.local.dir changes where all temp files are read and written to, I would advise building and opening the positions of this location via command line before the first session with this argument is run.
I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.
If I list all the databases in Hive, I get the following result (I have 2 tables default ans sbm):
But if I try to do the same thing in spark I get this
It doesn't show the database SBM.
Are you connected to that hive metastore? Did you specify somewhere the metastore details (i.e. hive-site.xml in spark-conf directory) ? Seems like you are connected to the local merastore.
I think that you need copy your hive-site.xml to spark/conf directory.
If you use ubuntu and defined the environmental variables use the next command:
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf
Does MemSQL support user variables in the load data command, similar to MySQL (see MySQL load NULL values from CSV data for examples)? The MemSQL documentation (https://docs.memsql.com/docs/load-data) doesn't give a clue, and my attempts at using user variables have failed.
No, variables in LOAD DATA are not currently supported in general (as of MemSQL 5.5). This is a feature we are tracking for a future release.
We only support the following syntax to skip the contents of a column in the file using a dummy variable (briefly mentioned in the docs https://docs.memsql.com/docs/load-data):
load data infile 'foo.tsv' into table foo (bar, #, #, baz);
This work
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
but when we give command like
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/sample.csv' SELECT * from
table1;
Failed with exception Unable to rename: wasb://incrementalhive-1#crmdbs.blob.core.windows.net/hive/scratch/hive_2015-06-08_10-01-03_930_4881174794406290153-1/-ext-10000 to: wasb:/hiveblob/sample.csv
So, is there any way in which we can insert data to a single file
I don't think you can tell hive to write to a specific file like wasb:///hiveblob/foo.csv directly.
What you can do is:
Tell hive to merge the output files into one before you run the query.
This way you can have as many reducers as you want and still have single output file.
Run your query, e.g. INSERT OVERWRITE DIRECTORY ...
Then use dfs -mv within hive to rename the file to whatever.
This is probably less painful than using separate hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName as suggested by Ramzy.
The way to instruct to merge the files may be different depending on the runtime engine you are using.
For example, if you use tez as the runtime engine in your hive queries, you can do this:
-- Set the tez execution engine
-- And instruct to merge the results
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
-- Your query goes here.
-- The results should end up in wasb:///hiveblob/000000_0 file.
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
-- Rename the output file into whatever you want
dfs -mv 'wasb:///hiveblob/000000_0' 'wasb:///hiveblob/foo.csv'
(The above worked for me with these versions: HDP 2.2, Tez 0.5.2, and Hive 0.14.0)
For MapReduce engine (which is the default), you can try these, although I haven't tried them myself:
-- Try this if you use MapReduce engine.
set hive.execution.engine=mr;
set hive.merge.mapredfiles=true;
You can coerce hive to build to build one file by forcing reducers to one. This will copy any fragmented files in one table and combine them in another location in HDFS. Of course forcing one reducer breaks the benefit of parallelism. If you plan on doing any transformation of data I recommend doing that first then doing this in a last and separate phase.
To produce a single file using hive you can try:
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
set hive.exec.reducers.max=1;
create table if not exists db.table
stored as textfiel as
select * from db.othertable;
db.othertable is the table that has multiple fragmented files. db.table will have a single text file containing the combined data.
You will be having multiple output files by default, equal to the number of reducers. That is decided by Hive. However you can configure the reducers. Look here. However, the performance can be a hit, if we reduce the reducers and will run into more execution time. Alternatively, once the files are present, you can use get merge, and combine all the files into one file.
hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName
. The src folder contains all the files to be merged.