Rename a file column name in Azure Synapse - azure

I have a file with poor naming convention I would like to clean up before using in Azure Synapse. Is it possible to rename the column in the with block?
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'path_to_file.csv'
FORMAT = 'CSV'
PARSER_VERSION = '2.0'
FIRSTROW = 2)
WITH (
[ORDER ID] varchar(50)
) as rows
I could use an alias in the select but was hoping to clean it up before that.
SELECT [ORDER ID] as order_id
And I could wrap this in a view - just was hoping there's a way to rename earlier.

Yes, it is possible to rename columns in WITH block; the name you provide there will override the column name read from the file (even with HEADER_ROW set to TRUE.
There's a caveat though. You will have to either provide names for all of your columns:
SELECT TOP 10 *
FROM OPENROWSET
(BULK 'path_to_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = true)
WITH
(
your_column_name_1 varchar(50)
...
your_column_name_N varchar(50)
)
AS rows
...or pick the ones you want to keep and/or rename using their ordinal number:
SELECT TOP 10 *
FROM OPENROWSET
(BULK 'path_to_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = true)
WITH
(
your_column_name_1 varchar(50) 1
your_column_name_4 varchar(50) 4
)
AS rows

You can also override the names with a subquery / derived table, eg
SELECT *
FROM (
SELECT TOP 100 *
FROM OPENROWSET (
BULK 'some path',
FORMAT = 'CSV',
PARSER_VERSION ='2.0',
FIRSTROW = 2
) AS [result]
) x ( col1, col2 )
This is more compact than the WITH clause I think, where you have to specify all columns, all data-types and all ordinals as far as I can tell. Unfortunately it won't let you put the column list after the [result] alias.

Related

Cassandra(Amazon keyspace) Query Error on clustered columns

I am trying execute query on clustering columns on amazon keyspace, since I don't want to use ALLOW FILTERING with my native query I have created 4-5 clustering columns for better performance.
But while trying to filter it based on >= and <= with on 2 clustering columns, I am getting error with below message
message="Clustering column "start_date" cannot be restricted (preceding column "segment_id" is restricted by a non-EQ relation)"
I had also tried with multiple columns query but I am getting not supported error
message="MultiColumn relation is not yet supported."
Query for the reference
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and (segment_id, start_date,end_date)>= (-1, '2022-05-16','2017-03-28') and flag = 1;
or
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and segment_id > -1 and start_date >='2022-05-16';
I am assuming that the your table has the following primary key:
CREATE TABLE table_name (
...
PRIMARY KEY(shard_id, division, customer_id, segment_id, start_date, end_date)
)
In any case, your CQL query is invalid because you can only apply an inequality operator on the last clustering column in your query. For example, these are valid queries based on your table schema:
SELECT * FROM table_name
WHERE shard_id = ? AND division = ?
AND customer_id <= ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id > ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id = ? AND start_date >= ?
All preceding columns must be filtered by an equality operator except for the very last clustering column in your query.
If you require a complex predicate for your queries, you will need to index your Cassandra data with tools such as Elasticsearch or Apache Solr. They will allow you to run complex search parameters to retrieve data from your database. Cheers!
ALLOW Filtering gets a bad rap sometimes. It all depends on how many rows you end up scanning. It's good to understand how many rows per partition will be scanned and work backwards from there. Only the last column can contain inequality statements to bound ranges. Try to order your columns to eliminate the most columns first, which reduce the number of rows 'Filtered'.
In the example below we used the index for keys up to start date and filtered on end_data, segment_id, and flag.
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and start_date >= '2022-05-16' and end_date > '2017-03-28') and (segment_id > -1 flag = 1;```

Insert new rows, continue existing rowset row_number count

I'm attempting to perform some sort of upsert operation in U-SQL where I pull data every day from a file, and compare it with yesterdays data which is stored in a table in Data Lake Storage.
I have created an ID column in the table in DL using row_number(), and it is this "counter" I wish to continue when appending new rows to the old dataset. E.g.
Last inserted row in DL table could look like this:
ID | Column1 | Column2
---+------------+---------
10 | SomeValue | 1
I want the next rows to have the following ascending ids
11 | SomeValue | 1
12 | SomeValue | 1
How would I go about making sure that the next X rows continues the ID count incrementally such that the next rows each increases the ID column by 1 more than the last?
You could use ROW_NUMBER then add it to the the max value from the original table (ie using CROSS JOIN and MAX). A simple demo of the technique:
DECLARE #outputFile string = #"\output\output.csv";
#originalInput =
SELECT *
FROM ( VALUES
( 10, "SomeValue 1", 1 )
) AS x ( id, column1, column2 );
#newInput =
SELECT *
FROM ( VALUES
( "SomeValue 2", 2 ),
( "SomeValue 3", 3 )
) AS x ( column1, column2 );
#output =
SELECT id, column1, column2
FROM #originalInput
UNION ALL
SELECT (int)(x.id + ROW_NUMBER() OVER()) AS id, column1, column2
FROM #newInput
CROSS JOIN ( SELECT MAX(id) AS id FROM #originalInput ) AS x;
OUTPUT #output
TO #outputFile
USING Outputters.Csv(outputHeader:true);
My results:
You will have to be careful if the original table is empty and add some additional conditions / null checks but I'll leave that up to you.

Azure data lake Auto gnerated column in usql

I want to add a auto generated column in my U-SQL select stamen.
how can we do.
same like identity column in SQL server
Regards,
Manish
The closest would be ROW_NUMBER. Here is a simple example:
#output =
SELECT
ROW_NUMBER() OVER () AS rn,
*
FROM #input;
You cannot use ROW_NUMBER directly with EXTRACT at this time. Simply extract the data first then add the row number in a subsequent section, like this:
// Get raw input
#input =
EXTRACT rawData string
FROM "/input/yourFile.txt"
USING Extractors.Tsv();
// Add a row number
#working =
SELECT ROW_NUMBER() OVER() AS rn,
*
FROM #input;

Export Postgres queries to the same excel file

I know I could copy a query to a csv file using:
COPY (
SELECT * FROM my_table
) TO '/tmp/myfile.csv' CSV HEADER DELIMITER E'\t';
But I need to export multiple queries to the same csv file, like:
COPY (
(SELECT * FROM my_table) AS tab1
(SELECT * FROM my_table2) AS tab2
(SELECT * FROM my_table3) AS tab3
(SELECT * FROM my_table4) AS tabN
) TO '/tmp/myfile.csv' CSV HEADER DELIMITER E'\t';
Is there a way of doing that?
You can do a union on all the select statements as long as you select the same column types from each, like so. If cola and colc are not the same type, cast it to the same type. If you have fewer fields, select a static value such as 0, NULL, or something like that.
COPY (
(SELECT cola, colb FROM my_table1) AS tab1
UNION ALL
(SELECT colc, cold FROM my_table2) AS tab2
) TO '/tmp/myfile.csv' CSV HEADER DELIMITER E'\t';
Alternatively, you could do the export to 2 files and combine them. Use your code to export to /tmp/myfile1.csv and /tmp/myfile2.csv
Then:
cat /tmp/myfile1.csv /tmp/myfile2.csv >/tmp/combined.csv

WHERE variable = ( subquery ) in OpenSQL

I'm trying to retrieve rows from a table where a subquery matches an variable. However, it seems as if the WHERE clause only lets me compare fields of the selected tables against a constant, variable or subquery.
I would expect to write something like this:
DATA(lv_expected_lines) = 5.
SELECT partner contract_account
INTO TABLE lt_bp_ca
FROM table1 AS tab1
WHERE lv_expected_lines = (
SELECT COUNT(*)
FROM table2
WHERE partner = tab1~partner
AND contract_account = tab1~contract_account ).
But obviously this select treats my local variable as a field name and it gives me the error "Unknown column name "lv_expected_lines" until runtime, you cannot specify a field list."
But in standard SQL this is perfectly possible:
SELECT PARTNER, CONTRACT_ACCOUNT
FROM TABLE1 AS TAB1
WHERE 5 = (
SELECT COUNT(*)
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT );
So how can I replicate this logic in RSQL / Open SQL?
If there's no way I'll probably just write native SQL and be done with it.
The program below might lead you to an Open SQL solution. It uses the SAP demo tables to determines the plane types that are used on a specific number of flights.
REPORT zgertest_sub_query.
DATA: lt_planetypes TYPE STANDARD TABLE OF s_planetpp.
PARAMETERS: p_numf TYPE i DEFAULT 62.
START-OF-SELECTION.
SELECT planetype
INTO TABLE lt_planetypes
FROM sflight
GROUP BY planetype
HAVING COUNT( * ) EQ p_numf.
LOOP AT lt_planetypes INTO DATA(planetype).
WRITE: / planetype.
ENDLOOP.
It only works if you don't need to read fields from TAB1. If you do you will have to gather these with other selects while looping at your results.
For those dudes who found this question in 2020 I report that this construction is supported since ABAP 7.50. No workarounds are needed:
SELECT kunnr, vkorg
FROM vbak AS v
WHERE 5 = ( SELECT COUNT(*)
FROM vbap
WHERE kunnr = v~kunnr
AND vkorg = v~vkorg )
INTO TABLE #DATA(customers).
This select all customers who made 5 sales orders within some sales organization.
In ABAP there is no way to do the query as in NATIVE SQL.
I would advice not to use NATIVE SQL, instead give a try to SELECT/ENDSELECT statement.
DATA: ls_table1 type table1,
lt_table1 type table of table1,
lv_count type i.
SELECT PARTNER, CONTRACT_ACCOUNT
INTO ls_table1
FROM TABLE1.
SELECT COUNT(*)
INTO lv_count
FROM TABLE2
WHERE PARTNER = TAB1.PARTNER
AND CONTRACT_ACCOUNT = TAB1.CONTRACT_ACCOUNT.
CHECK lv_count EQ 5.
APPEND ls_table1 TO lt_table1.
ENDSELECT
Here you append to ls_table1 only those rows where count is equals to 5 in selection of table2.
Hope it helps.

Resources