Data parsing using hive query - azure

I am building a pipeline through Azure data factory. Input dataset is a csv file with column delimiter and the output dataset is also a csv file column delimiter. The pipeline is designed with a HDinsight activity through hive query in the file with extension .hql. The hive query is as follows
set hive.exec.dynamic.partition.mode=nonstrict;
DROP TABLE IF EXISTS Table1;
CREATE EXTERNAL TABLE Table1 (
Number string,
Name string,
Address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/your/folder/location'
SELECT * FROM Table1;
Below is the file format
Number,Name,Address
1,xyz,No 152,Chennai
2,abc,7th street,Chennai
3,wer,Chennai,Tamil Nadu
How do I data parse the column header with the data in the output dataset?

As per my understanding, Your question is related to to csv file. You are putting csv file at table location and it consist of header. If my understanding is correct, Please try below property in your table ddl. I hope this will help you.
tblproperties ("skip.header.line.count"="1");
Thanks,
Manu

Related

loading a tab delimited text file as a hive table/dataframe in databricks

I am trying to upload a tab delimited text file in databricks notebooks, but all the column values are getting pushed into one column value
here is the sql code I am using
Create table if not exists database.table
using text
options (path 's3bucketpath.txt', header "true")
I also tried using csv
The same things happens if i'm reading into a spark dataframe
I am expecting to see the columns separated out with their header. Has anyone come across this issue and figured out a solution?
Have you tried to add a sep option to specify that you're using tab-separated values?
Create table if not exists database.table
using csv
options (path 's3bucketpath.txt', header 'true', sep '\t')

Hive Create Table Reading CSV from S3 data spill

I’m trying to create a hive table from external location on S3 from a CSV file.
CREATE EXTERNAL TABLE coder_bob_schema.my_table (column data type)
ROW DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘s3://mybucket/path/file.CSV’
The resultant table has data from n-x fields spilling over to n which leads me to believe Hive doesn’t like the CSV. However, I downloaded the CSV from s3 and it opens and looks okay in excel. Is there a workaround like using a different delimiter?

Removing extra comma from a column while exporting the data into csv file using Azure Data Factory

I am having data in my sql table as shown below screenshot, which is having an extra comma in values column after the values. It is actually list of values which is having more values.
I have to import this data into a pipe delimited csv file. And it is shown as below screenshot.
How will I remove the additional comma from that column while importing the data into a csv file with pipe delimiter.
I am performing the import using an azure data factory pipeline. Is there any way to avoid the extra comma from imported file or while importing?
Is there any way to make this changes at the time of importing the file to an ADLS location through ADF? Any changes that has to be make in ADF?
As Joel commented, you can just modify your query to do that while extracting. It might look like this:
select ID, timestamp, replace([values], ',', '') as values from [YourTable]
Hope this helped!

Need to insert csv source data to Azure SQL Database

I have a source data in csv. I created a sql table to insert the csv data. My sql table has primary key column & foreign key column in it. I cannot skip these 2 columns while mapping in Data factory. How to overcome this & insert data ?
Please refer to the rules in the Schema mapping in copy activity.
Source data store query result does not have a column name that is
specified in the input dataset "structure" section.
Sink data store (if with pre-defined schema) does not have a
column name that is specified in the output dataset "structure"
section. Either fewer columns or more columns in the "structure"
of sink dataset than specified in the mapping.
Duplicate mapping.
So,if your csv file does not cover all the columns in the sql database,copy activity can't work.
You could consider creating temporary table in sql database to match your csv file ,then use stored procedure to fill the exact table. Please refer to the detailed steps in this case to implement your requirement:Azure Data Factory mapping 2 columns in one column

Loading ORC Data into SQL DW using PolyBase

I am trying to load ORC file format via PolyBase but I am facing below problems.
Problem:1
I have a CSV file which below code converts the csv file to ORC format but its selecting data from permanent table. If I remove "as select * from dbo.test" then create external table is not working. Permanent table contains 0 record.
create external table test_orc
with
(
location='/TEST/',
DATA_SOURCE=SIMPLE,
FILE_FORMAT=TEST_ORC
)
as select * from dbo.test ---Permanent Table
Problem:2
If I select the data from test_orc then I am getting invalid postscript error so, I removed my .csv file from the TEST directory. Is there any way to convert CSV to ORC file in different directory like TEST2.
Problem:3
If I select the data from test_orc then the count is zero and I am not getting any error.
select count(*) from test_orc---count is zero
Expectation
Need to load ORC file in the TEST directory to dbo.test table.
Please share your thought one this issue.

Resources