Proper format for copy from with a transaction size in YugabyteDB YSQL

Proper format for copy from with a transaction size in YugabyteDB YSQL - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
What is the proper format for copy from with a transaction size? I am trying to load 12million rows and getting this error:
yugabyte=# copy parking_violations from '/home/yugabyte/load_violations.csv' with (FORMAT delimiter ',', csv, header, ROWS_PER_TRANSACTION 10000);
ERROR: syntax error at or near "','"
LINE 1: ...byte/load_violations.csv' with (FORMAT delimiter ',', csv, ...

The correct format is:
COPY parking_violations from '/home/yugabyte/load_violations.csv' with (FORMAT csv, DELIMITER ',', HEADER, ROWS_PER_TRANSACTION 10000);
See docs for more: https://docs.yugabyte.com/preview/api/ysql/the-sql-language/statements/cmd_copy/#import-a-large-table-using-smaller-transactions

Related

loading a tab delimited text file as a hive table/dataframe in databricks

I am trying to upload a tab delimited text file in databricks notebooks, but all the column values are getting pushed into one column value
here is the sql code I am using
Create table if not exists database.table
using text
options (path 's3bucketpath.txt', header "true")
I also tried using csv
The same things happens if i'm reading into a spark dataframe
I am expecting to see the columns separated out with their header. Has anyone come across this issue and figured out a solution?

Have you tried to add a sep option to specify that you're using tab-separated values?
Create table if not exists database.table
using csv
options (path 's3bucketpath.txt', header 'true', sep '\t')

How to handle double pipe delimiter in Azure Data Factory 2

I'm working with a double pipe delimited file which I want to ingest in Azure SQL DB via Data Factory (2):
Column1||Column2||Column3
In my input dataset, I specified the double pipe delimiter as columnDelimiter in the format section of typeProperties:
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "||",
"skipLineCount": 2
}
Unfortunately, this is not working. All rows are marked as "incompatible" and no data is ingested in my SQL table. Next, I tried a few things. When not specifying any columnDelimiter, the complete row is ingested in the first column of my SQL table, so no delimiter is considered at all. This is not correct, logically each data item must be ingested in a separate column. Scripting the || in Unicode also did not fix it. When converting my input file to single pipe delimited it works fine. However, in the end-state I'm not able to alter the input file so handling the double pipe delimiter is essential.

You could either ask the creators of your input file to use a single delimiter (like most people do) or else you have to do some pre-processing. For example, you could this with a U-SQL step which corrects the file and then import that file into your SQL DB, eg some simple U-SQL:
// Multiple column delimiters
#input =
EXTRACT
col1 string
FROM "/input/input91.txt"
USING Extractors.Text(delimiter:'~', skipFirstNRows:1);
// Do some other processing here?
#output = SELECT col1.Replace("||", "|") AS col2 FROM #input;
OUTPUT #output
TO "/output/output.txt"
USING Outputters.Text(delimiter : '|', quoting : false);
NB Here I've used a delimiter (tilde "~") which I know won't occur in the input file to effectively import all the columns as one column, and then fix it up.
Is there a particular reason you use two delimiters?
Also, if your file is on blob storage, then you can now BULK INSERT into Azure SQL DB, and having just given this a quick test, BULK INSERT can cope with multiple delimiters, eg
DROP TABLE IF EXISTS #tmp
CREATE TABLE #tmp (
a VARCHAR(50),
b VARCHAR(50),
c VARCHAR(50),
)
GO
BULK INSERT #tmp
FROM 'D:\Data Lake\USQLDataRoot\input\input91.txt'
WITH ( FIELDTERMINATOR = '||', DATA_SOURCE = '<yourDataSource>' );
See here for more info on the steps which involve creating an external data source and credential. You could then get Data Factory to use the Stored Proc task to execute for example.

Based on the text format allowed rules , only one character is allowed. I even want to use escape delimiter to avoid it, but it also be banned.
So, it seems you need to process your data before copying activity.Since ADF supports copy data from HTTP endpoint,so I suggest you following steps:
Using Azure Functions to read your file via stream way line by line.
2.Replace all the '||' with '|' and set it in the response.(Don't forget '\n')
3.ADF set column delimiter to '|' and copy data from HTTP endpoint.
This solution could process big size data , of course, if you do not want to do such complicate work , you could process your data directly and save it as temporary file for ADF.
Hope it helps you.

Data parsing using hive query

I am building a pipeline through Azure data factory. Input dataset is a csv file with column delimiter and the output dataset is also a csv file column delimiter. The pipeline is designed with a HDinsight activity through hive query in the file with extension .hql. The hive query is as follows
set hive.exec.dynamic.partition.mode=nonstrict;
DROP TABLE IF EXISTS Table1;
CREATE EXTERNAL TABLE Table1 (
Number string,
Name string,
Address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/your/folder/location'
SELECT * FROM Table1;
Below is the file format
Number,Name,Address
1,xyz,No 152,Chennai
2,abc,7th street,Chennai
3,wer,Chennai,Tamil Nadu
How do I data parse the column header with the data in the output dataset?

As per my understanding, Your question is related to to csv file. You are putting csv file at table location and it consist of header. If my understanding is correct, Please try below property in your table ddl. I hope this will help you.
tblproperties ("skip.header.line.count"="1");
Thanks,
Manu

DataStax Devcenter export CSV format with "|" instead of ","

I'm looking for a way to change the delimiter of CSV files exported by DataStax DevCenter from a "," to a "|". Reason being that there are several array lists in my Cassandra data export that have commas already so a "," delimiter makes it challenging to parse the data once I open the export as CSV file and then try to open it to parse with another application. I'm using DevCenter v1.6.0

COPY table_name ( column , ... )
TO 'file_name'
WITH option = 'value'
You can use DELIMITER option.
Example:
COPY table_name ( column , ... )
TO 'file_name'
WITH DELIMITER= '|'
For list of other options please refer below link.
COPY Command in Cassandra

How can I insert multiple records with a single INSERT statement in Sybase ASE

This feature is available in MySQL as shown in this post and according to the Sybase documentation it should also be supported, however Sybase don't provide a worked example so you have to interpret the following:
Syntax 1 Insert a single row, or multiple rows, with the specified expression column values. Multiple rows, if specified, are delimited by additional parentheses
So I interpret "additional parentheses" as expecting the following code to work
create table #tmp_codes (
code varchar(12) NULL
)
insert into #tmp_codes (code)
values
('AAA'),
('BBB'),
('CCC')
However it errors with
Incorrect syntax near ',' on line 7
I'm using Sybase ASE 15 and cannot see any reference to inserting multiple rows on this support page for the INSERT statement
Is this feature available in Sybase?

Your first Sybase doc link is ASA not ASE documentation.
In ASE you can insert multiple rows only with insert - select statement.

This seems long ago asked question but may be useful for reference.
Create a Text file with multiple Rows.
Run this command from Sybase ASE:
INPUT INTO TableName
FROM FilePath\FileName FORMAT TEXT
Example:
INPUT INTO TempTable FROM c:\test.txt FORMAT TEXT

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Proper format for copy from with a transaction size in YugabyteDB YSQL - yugabytedb

Related

loading a tab delimited text file as a hive table/dataframe in databricks

How to handle double pipe delimiter in Azure Data Factory 2

Data parsing using hive query

DataStax Devcenter export CSV format with "|" instead of ","

How can I insert multiple records with a single INSERT statement in Sybase ASE

Categories

Resources