Hive Create Table Reading CSV from S3 data spill - apache-spark

I’m trying to create a hive table from external location on S3 from a CSV file.
CREATE EXTERNAL TABLE coder_bob_schema.my_table (column data type)
ROW DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘s3://mybucket/path/file.CSV’
The resultant table has data from n-x fields spilling over to n which leads me to believe Hive doesn’t like the CSV. However, I downloaded the CSV from s3 and it opens and looks okay in excel. Is there a workaround like using a different delimiter?

Related

Is it possible to create a view from external data?

I have some csv files in my data lake which are being quite frequently updated through another process. Ideally I would like to be able to query these files through spark-sql, without having to run an equally frequent batch process to load all the new files into a spark table.
Looking at the documentation, I'm unsure as all the examples show views that query existing tables or other views, rather than loose files stored in a data lake.
You can do something like this if your csv is in S3 under the location s3://bucket/folder:
spark.sql(
"""
CREATE TABLE test2
(a string, b string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION 's3://bucket/folder'
"""
)
You have to adapt the fields tho and the field separators.
To test it, you can first run:
Seq(("1","a"), ("2","b"), ("3","a"), ("4","b")).toDF("num", "char").repartition(1).write.mode("overwrite").csv("s3://bucket/folder")

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

ADF: Dataflow sink activity file format for adls

I wanted to copy multiple table data from Azure SQL database to ADLS gen2. I created a pipeline which take table names as dynamic input values. later i used dataflow activity which copies the data to adls. I used sink type as delta. Now few of my table data are getting copied to adls properly with snappy.parquet format but few are giving error as column names are invalid for delta format.
How can we deal with this error and get data copied from all tables?
Also for knowledge wanted to know that does file formats for the files generated at destination folder in adls are by default parquet file? Or is there any option to change that?
Delta format is parquet underneath. You cannot use characters like " ,;{}()\n\t=" and have to replace that with _ or another character.
Dataflow has easy ways to rename column names in derive or select transforms.

Loading ORC Data into SQL DW using PolyBase

I am trying to load ORC file format via PolyBase but I am facing below problems.
Problem:1
I have a CSV file which below code converts the csv file to ORC format but its selecting data from permanent table. If I remove "as select * from dbo.test" then create external table is not working. Permanent table contains 0 record.
create external table test_orc
with
(
location='/TEST/',
DATA_SOURCE=SIMPLE,
FILE_FORMAT=TEST_ORC
)
as select * from dbo.test ---Permanent Table
Problem:2
If I select the data from test_orc then I am getting invalid postscript error so, I removed my .csv file from the TEST directory. Is there any way to convert CSV to ORC file in different directory like TEST2.
Problem:3
If I select the data from test_orc then the count is zero and I am not getting any error.
select count(*) from test_orc---count is zero
Expectation
Need to load ORC file in the TEST directory to dbo.test table.
Please share your thought one this issue.

Data parsing using hive query

I am building a pipeline through Azure data factory. Input dataset is a csv file with column delimiter and the output dataset is also a csv file column delimiter. The pipeline is designed with a HDinsight activity through hive query in the file with extension .hql. The hive query is as follows
set hive.exec.dynamic.partition.mode=nonstrict;
DROP TABLE IF EXISTS Table1;
CREATE EXTERNAL TABLE Table1 (
Number string,
Name string,
Address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/your/folder/location'
SELECT * FROM Table1;
Below is the file format
Number,Name,Address
1,xyz,No 152,Chennai
2,abc,7th street,Chennai
3,wer,Chennai,Tamil Nadu
How do I data parse the column header with the data in the output dataset?
As per my understanding, Your question is related to to csv file. You are putting csv file at table location and it consist of header. If my understanding is correct, Please try below property in your table ddl. I hope this will help you.
tblproperties ("skip.header.line.count"="1");
Thanks,
Manu

Resources