Read JSON data from CSV in Synapse Azure

Read JSON data from CSV in Synapse Azure - azure

I have a CSV with some columns, one of them with JSON content. Can I query this JSON column with special handling?
An example below:
ei
My goal is to run a query (openrowset documentation) and get output similar to this.
ID
Name
0
Valarie Strickland
1
Mathews Harrison
2
Cecilia Giles

I tried to reproduce the same in my environment.
My sample data:
to convert the column with nested Json in the form of table. First, I created variable with nvarchar(max). set the select Querys value to it.
DECLARE #json nvarchar(max)
SET #json = (SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://dlsg2p.dfs.core.windows.net/fsn2p/jsoncolumn.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
firstrow=3
) AS [result])
with below statement checking the value is assigned properly to variable.
select #json as JSON
Using CROSS APPLY for converting Json to table format:
SELECT b.id as ID ,b.name as Name
FROM
OPENJSON(#json)
WITH
(
friends NVARCHAR(MAX) AS JSON
) AS a
CROSS APPLY
OPENJSON(a.friends)
WITH
(
id INT,
name VARCHAR(MAX)
) AS b;
Execution:
Output:

Related

what is the meaning of WITH (jsonContent varchar(MAX)) AS [result] in this code on azure synapse analytics

I am not able to undertsand why we are writing with (doc nvarchar(max)) as rows
JSON FILE
{"date_rep":"2020-07-24","day":24,"month":7,"year":2020,"cases":3,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-25","day":25,"month":7,"year":2020,"cases":7,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-26","day":26,"month":7,"year":2020,"cases":4,"deaths":0,"geo_id":"AF"}
{"date_rep":"2020-07-27","day":27,"month":7,"year":2020,"cases":8,"deaths":0,"geo_id":"AF"}
My Code to open this JSON file start from here.
select
JSON_VALUE(doc, '$.date_rep') AS date_reported,
JSON_VALUE(doc, '$.countries_and_territories') AS country,
CAST(JSON_VALUE(doc, '$.deaths') AS INT) as fatal,
JSON_VALUE(doc, '$.cases') as cases,
doc
from openrowset(
bulk 'https://demoaccname.dfs.core.windows.net/demoadlscontainer/afg.json',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows

jsonContent nvarchar(max) creates a external table with the column name jsonContent .
in
FROM
OPENROWSET(
BULK 'https://XXXXX',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0b'
) with (**jsonContent nvarchar(max)**) as rows

nvarchar(max) in SQL can be used to store any string types. when we are loading the JSON from source to SQL table, we don't know exactly the length of the JSON and JSON is a string type.
In SQL there is no specific datatype for JSON so we use WITH (jsonContent nvarchar(MAX)) AS [result] to store JSON from file to nvarchar column.
JSON documents can be stored as-is in NVARCHAR columns. This is the best way for quick data load and ingestion because the loading speed is matching loading of string columns.

MssqlRow to json string without knowing structure and data type on compile time [duplicate]

Using PostgreSQL I can have multiple rows of json objects.
select (select ROW_TO_JSON(_) from (select c.name, c.age) as _) as jsonresult from employee as c
This gives me this result:
{"age":65,"name":"NAME"}
{"age":21,"name":"SURNAME"}
But in SqlServer when I use the FOR JSON AUTO clause it gives me an array of json objects instead of multiple rows.
select c.name, c.age from customer c FOR JSON AUTO
[{"age":65,"name":"NAME"},{"age":21,"name":"SURNAME"}]
How to get the same result format in SqlServer ?

By constructing separate JSON in each individual row:
SELECT (SELECT [age], [name] FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
FROM customer
There is an alternative form that doesn't require you to know the table structure (but likely has worse performance because it may generate a large intermediate JSON):
SELECT [value] FROM OPENJSON(
(SELECT * FROM customer FOR JSON PATH)
)

no structure better performance
SELECT c.id, jdata.*
FROM customer c
cross apply
(SELECT * FROM customer jc where jc.id = c.id FOR JSON PATH , WITHOUT_ARRAY_WRAPPER) jdata (jdata)

Same as Barak Yellin but more lazy:
1-Create this proc
CREATE PROC PRC_SELECT_JSON(#TBL VARCHAR(100), #COLS VARCHAR(1000)='D.*') AS BEGIN
EXEC('
SELECT X.O FROM ' + #TBL + ' D
CROSS APPLY (
SELECT ' + #COLS + '
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) X (O)
')
END
2-Can use either all columns or specific columns:
CREATE TABLE #TEST ( X INT, Y VARCHAR(10), Z DATE )
INSERT #TEST VALUES (123, 'TEST1', GETDATE())
INSERT #TEST VALUES (124, 'TEST2', GETDATE())
EXEC PRC_SELECT_JSON #TEST
EXEC PRC_SELECT_JSON #TEST, 'X, Y'
If you're using PHP add SET NOCOUNT ON; in the first row (why?).

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.

In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED

Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.

You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

Why AWS Athena returns "string" datatype to all table's fields on "show create table" command or describe tables

Why AWS Athena returns "string" datatype to all table's fields on
"show create table" command or on describe tables:
for example table t_mus_albums:
albumid (bigint)
title (string)
artistid (bigint)
whan running
show create table t_mus_albums;
I get:
CREATE EXTERNAL TABLE `t_mus_albums`(
`albumid` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`artistid` string COMMENT 'from deserializer')

I think you might be doing something wrong or while generating the table automatically, you may not have correct formatted data.
Here are the systematic steps to solve your problem.
Assume that your data is in below format.
ID,Code,City,State
41,5,"Youngstown", OH
42,52,"Yankton", SD
46,35,"Yakima", WA
42,16,"Worcester", MA
43,37,"Wisconsin Dells", WI
36,5,"Winston-Salem", NC
Then your create table will go something like below.
CREATE EXTERNAL TABLE IF NOT EXISTS example.tbl_datatype (
`id` int,
`code` int,
`city` string,
`state` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://example-bucket/location/a/'
TBLPROPERTIES ('has_encrypted_data'='false');
Then, run the Query to description the table.
SHOW CREATE TABLE tbl_datatype;
It will give you output something like below.
CREATE EXTERNAL TABLE `tbl_datatype`(
`id` int,
`code` int,
`city` string,
`state` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example-bucket/location/a/';
Hope it helps!

This is because you use CSV serde and not e.g. TEXT serde.
CSV serde does support only string data type, so all columns are of this type.
From https://docs.aws.amazon.com/athena/latest/ug/csv.html
The OpenCSV SerDe [...] Converts all column type values to STRING.
The documentation outlines some conditions under which the table schema could be different than all strings ("For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them"), but apparently this was not effective in your case.

is it possible to insert json into cassandra without creating table

How can I insert json objects to Cassandra table without creating table? Can Cassandra parse json to table which is not created? Or, Can I create a table with no column and insert json?
Thanks.

After Cassandra 2.2 you can insert json directly, but the table still should be created beforehead.

You need to create table First, then you can insert data
You can create table like the below one :
CREATE TABLE json_data (
id timeuuid PRIMARY KEY,
data text
);
And you can insert the json as string with the below query :
INSERT INTO json_data (id , data ) VALUES ( now(), '{"first_name" : "Ashraful", "last_name" : "Islam"}') ;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Read JSON data from CSV in Synapse Azure - azure

I have a CSV with some columns, one of them with JSON content. Can I query this JSON column with special handling? An example below: ei My goal is to run a query (openrowset documentation) and get output similar to this. ID Name 0 Valarie Strickland 1 Mathews Harrison 2 Cecilia Giles

Related

what is the meaning of WITH (jsonContent varchar(MAX)) AS [result] in this code on azure synapse analytics

MssqlRow to json string without knowing structure and data type on compile time [duplicate]

Inserting Timestamp Into Snowflake Using Python 3.8

Why AWS Athena returns "string" datatype to all table's fields on "show create table" command or describe tables

is it possible to insert json into cassandra without creating table

Categories

Resources