In hive how to insert data into a single file - azure

This work
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
but when we give command like
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/sample.csv' SELECT * from
table1;
Failed with exception Unable to rename: wasb://incrementalhive-1#crmdbs.blob.core.windows.net/hive/scratch/hive_2015-06-08_10-01-03_930_4881174794406290153-1/-ext-10000 to: wasb:/hiveblob/sample.csv
So, is there any way in which we can insert data to a single file

I don't think you can tell hive to write to a specific file like wasb:///hiveblob/foo.csv directly.
What you can do is:
Tell hive to merge the output files into one before you run the query.
This way you can have as many reducers as you want and still have single output file.
Run your query, e.g. INSERT OVERWRITE DIRECTORY ...
Then use dfs -mv within hive to rename the file to whatever.
This is probably less painful than using separate hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName as suggested by Ramzy.
The way to instruct to merge the files may be different depending on the runtime engine you are using.
For example, if you use tez as the runtime engine in your hive queries, you can do this:
-- Set the tez execution engine
-- And instruct to merge the results
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
-- Your query goes here.
-- The results should end up in wasb:///hiveblob/000000_0 file.
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;
-- Rename the output file into whatever you want
dfs -mv 'wasb:///hiveblob/000000_0' 'wasb:///hiveblob/foo.csv'
(The above worked for me with these versions: HDP 2.2, Tez 0.5.2, and Hive 0.14.0)
For MapReduce engine (which is the default), you can try these, although I haven't tried them myself:
-- Try this if you use MapReduce engine.
set hive.execution.engine=mr;
set hive.merge.mapredfiles=true;

You can coerce hive to build to build one file by forcing reducers to one. This will copy any fragmented files in one table and combine them in another location in HDFS. Of course forcing one reducer breaks the benefit of parallelism. If you plan on doing any transformation of data I recommend doing that first then doing this in a last and separate phase.
To produce a single file using hive you can try:
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
set hive.exec.reducers.max=1;
create table if not exists db.table
stored as textfiel as
select * from db.othertable;
db.othertable is the table that has multiple fragmented files. db.table will have a single text file containing the combined data.

You will be having multiple output files by default, equal to the number of reducers. That is decided by Hive. However you can configure the reducers. Look here. However, the performance can be a hit, if we reduce the reducers and will run into more execution time. Alternatively, once the files are present, you can use get merge, and combine all the files into one file.
hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileName
. The src folder contains all the files to be merged.

Related

How to reference the most current Physical Sequential (PS) file in JCL

I wanted to create a job where I need to consider the latest file available as input file.
File format is as below: FILE1.TEST.TYYMMDD
is there any way to identify latest file based on date present in file name via JCL.
P.S. GDG versions are not created in existing process . Only PS file is created.
Thank you
I wanted to create a job where I need to consider the latest file available as input file. File [name] format is as below: FILE1.TEST.TYYMMDD is there any way to identify latest file based on date present in file name via JCL.
No.
You indicate that GDGs are not created in the existing process. GDGs would be the best way to accomplish your goal. Absent GDGs, you must write code.
You could accomplish your goal by writing (C, clist, COBOL, PL/I, Rexx) code using the LMDINIT and LMDLIST ISPF services. Then you would execute your code by running ISPF in batch. Many mainframe shops have a cataloged procedure to execute ISPF in batch.
Agree with #cschneid that there is not a platform way to handle this. However, I want to point out that GDGs are the platform way of managing PS files for access in a relative form.
Your comment
GDG versions are not created in existing process . Only PS file is
created.
That statement didn't make sense to me. GDGs are not a file type like physical sequential (PS) or partitioned (PO). It's a convention to allow relative reference to files created over time which sounds like what you want. I've only seen the use of GDGs for PS files.
Putting the date in the file name can have its uses but to z/OS its only part of the filename and not meta information that it operates on (like G0000v00's in GDGs.

Change spark _temporary directory path

Is it possible to change the _temporary directory where spark save its temporary files before writing?
In particular, since I am writing single partitions of a table I woud like the temporary folder to be within the partition folder.
Is it possibile?
There is no way to use the default FileOutputCommitter because of its implementation, the FileOutputCommiter creates a ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.
In the end, an entire temporary folder deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.
Eventually, I've downloaded org.apache.hadoop.mapred.FileOutputCommitter and org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter (you can name it YourFileOutputCommitter) made some changes that allows _temporaly rename
in your driver, you'll have to add following code:
val conf: JobConf = new JobConf(sc.hadoopConfiguration)
conf.setOutputCommitter(classOf[YourFileOutputCommitter])
// update temporary path for committer
YourFileOutputCommitter.tempPath = "_tempJob1"
note: it's better to use MultipleTextOutputFormat to rename files because two jobs that write to the same location can override each other.
Update
I've created short post in our tech blog, it has more details
https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/

Change temporary path for individual job from spark code

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning.
The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up sharing the same temp folder and cause conflict, which can cause one job to delete temp files, and the other job fail with an error saying an expected temp file doesn't exist.
Can we change temporary path for individual job or is there any alternate way to avoid issue
To change the temp location you can do this:
/opt/spark/bin/spark-shell --conf "spark.local.dir=/local/spark-temp"
spark.local.dir changes where all temp files are read and written to, I would advise building and opening the positions of this location via command line before the first session with this argument is run.

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

writing data to filesystem from hive queries in hdinsight

I see that its viable to write query results to filesystem in hadoop: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
How do I save a query result in case of hdinsight in a folder which is accessible from blobstorage.
I tried something as below but was not successful.
INSERT OVERWRITE LOCAL DIRECTORY '/example/distinctconsumers' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select consumerid from distinctconsumers;
Thanks
Language manual clearly states below
LOCAL keyword is used, Hive will write data to the directory on the local file system.
If you remove 'LOCAL' from your query them it will work.
NOTE: the result might not be a single file but a list of files (one from each task)

Resources