Loading ORC Data into SQL DW using PolyBase

Loading ORC Data into SQL DW using PolyBase - azure

I am trying to load ORC file format via PolyBase but I am facing below problems.
Problem:1
I have a CSV file which below code converts the csv file to ORC format but its selecting data from permanent table. If I remove "as select * from dbo.test" then create external table is not working. Permanent table contains 0 record.
create external table test_orc
with
(
location='/TEST/',
DATA_SOURCE=SIMPLE,
FILE_FORMAT=TEST_ORC
)
as select * from dbo.test ---Permanent Table
Problem:2
If I select the data from test_orc then I am getting invalid postscript error so, I removed my .csv file from the TEST directory. Is there any way to convert CSV to ORC file in different directory like TEST2.
Problem:3
If I select the data from test_orc then the count is zero and I am not getting any error.
select count(*) from test_orc---count is zero
Expectation
Need to load ORC file in the TEST directory to dbo.test table.
Please share your thought one this issue.

Related

ADF - The data type SqlBigDecimal is not supported when writing from Money to Decimal column

I have a copy data activity in Azure Data Factory that takes the output of a stored procedure and then writes to CSV file. I have a Money columns (Precision: 19 Scale: 4) in the source that are converted into Decimal columns in the CSV Sink. I'm getting an error that SqlBigDecimal is not supported but the mapping looks good and it should convert the data to Decimal from Money not BigDecimal.
I used to have the same problem but with writing to Parquet file. This issue got resolved by itself somehow. I don't know what I did exactly to resolve that.
"Failure happened on 'Sink' side. ErrorCode=DataTypeNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The data type SqlBigDecimal is not supported.,Source=Microsoft.DataTransfer.Common,'"

I created a simple test to sink money type column into a csv.
First, I created a table in Azure SQL
create table dbo.test(
id int,
salary money
);
insert into dbo.test values (1,3500.1234);
insert into dbo.test values (2,3465.1234);
Query the rows via stored procudure in the Copy Activity. I didnt set any mapping and import schema.
Then sink to an empty csv file.
It will create the csv file.
After I run debug, we can see the result:

Hive Create Table Reading CSV from S3 data spill

I’m trying to create a hive table from external location on S3 from a CSV file.
CREATE EXTERNAL TABLE coder_bob_schema.my_table (column data type)
ROW DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘s3://mybucket/path/file.CSV’
The resultant table has data from n-x fields spilling over to n which leads me to believe Hive doesn’t like the CSV. However, I downloaded the CSV from s3 and it opens and looks okay in excel. Is there a workaround like using a different delimiter?

ADF: Dataflow sink activity file format for adls

I wanted to copy multiple table data from Azure SQL database to ADLS gen2. I created a pipeline which take table names as dynamic input values. later i used dataflow activity which copies the data to adls. I used sink type as delta. Now few of my table data are getting copied to adls properly with snappy.parquet format but few are giving error as column names are invalid for delta format.
How can we deal with this error and get data copied from all tables?
Also for knowledge wanted to know that does file formats for the files generated at destination folder in adls are by default parquet file? Or is there any option to change that?

Delta format is parquet underneath. You cannot use characters like " ,;{}()\n\t=" and have to replace that with _ or another character.
Dataflow has easy ways to rename column names in derive or select transforms.

Parquet file generated using polybase does not contain column names

I have used the following query to create an external table in SQL 2016 with polybase.
CREATE EXTERNAL TABLE dbo.SampleExternal (
DateId INT NULL,
CalendarQuarter TINYINT NULL,
FiscalQuarter TINYINT NULL)
WITH (LOCATION='/SampleExternal.parquet',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=ParquetFile);
Inserted the data to external table from local table and the parquet file was successfully generated in azure container.But while reading the parquet file ,coulmn names are shown as col-0,col-1.Is there any way to
add original coumn names in parquet file as given in external tables.
Column Names

This seems 'as designed' in polybase. The consuming application has to do mapping from these numbered column names to meaningful column names. If the producing application is different than consumer app, they should handshake on the column mapping.

Data parsing using hive query

I am building a pipeline through Azure data factory. Input dataset is a csv file with column delimiter and the output dataset is also a csv file column delimiter. The pipeline is designed with a HDinsight activity through hive query in the file with extension .hql. The hive query is as follows
set hive.exec.dynamic.partition.mode=nonstrict;
DROP TABLE IF EXISTS Table1;
CREATE EXTERNAL TABLE Table1 (
Number string,
Name string,
Address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/your/folder/location'
SELECT * FROM Table1;
Below is the file format
Number,Name,Address
1,xyz,No 152,Chennai
2,abc,7th street,Chennai
3,wer,Chennai,Tamil Nadu
How do I data parse the column header with the data in the output dataset?

As per my understanding, Your question is related to to csv file. You are putting csv file at table location and it consist of header. If my understanding is correct, Please try below property in your table ddl. I hope this will help you.
tblproperties ("skip.header.line.count"="1");
Thanks,
Manu

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Loading ORC Data into SQL DW using PolyBase - azure

Related

ADF - The data type SqlBigDecimal is not supported when writing from Money to Decimal column

Hive Create Table Reading CSV from S3 data spill

ADF: Dataflow sink activity file format for adls

Parquet file generated using polybase does not contain column names

Data parsing using hive query

Categories

Resources