Cassandra COPY TO String value incorrect - cassandra

I have these String values in Cassanrda
select snapshot from qa; produces
[{"rowKey":"6bdc3c1d-d21a-43e3-aa84-baeeda601bb3",xxxx,"lastUserAssertionDate":""}]
[{"rowKey":"6bdc3c1d-d21a-43e3-aa84-baeeda601bb3",xxxx,"lastUserAssertionDate":"2021-04-21T03:05:55Z"}]
Now I used COPY qa (snapshot) TO '/data/tmp/ak1simple.csv' with DELIMITER='|' and header=true; to export data into csv:
Then the exported data looks like this
snapshot
[{\"rowKey\":\"6bdc3c1d-d21a-43e3-aa84-baeeda601bb3\", xxxx, \"lastUserAssertionDate\":\"\"}]
"[{\"rowKey\":\"6bdc3c1d-d21a-43e3-aa84-baeeda601bb3\",xxxx ,\"lastUserAssertionDate\":\"2021-04-21T03:05:55Z\"}]"
The second String is wrapped with "" but the first not, although both of them have no surrouding "" in Cassandra.
Anyone got an indea on this?

Related

Is it possible for CQL to parse a JSON object to insert data?

From what I looked so far, it seems impossible with Cassandra. But I thought I'd give it a shot:
How can I select a value of a json property, parsed from a json object string, and use it as part of an update / insert statement in Cassandra?
For example, I'm given the json object:
{
id:123,
some_string:"hello there",
mytimestamp: "2019-09-02T22:02:24.355Z"
}
And this is the table definition:
CREATE TABLE IF NOT EXISTS myspace.mytable (
id text,
data blob,
PRIMARY KEY (id)
);
Now the thing to know at this point is that for a given reason the data field will be set to the json string. In other words, there is no 1:1 mapping between the given json and the table columns, but the data field contains the json object as kind of a blob value.
... Is it possible to parse the timestamp value of the given json object as part of an insert statement?
Pseudo code example of what I mean, which obviously doesn't work ($myJson is a placeholder for the json object string above):
INSERT INTO myspace.mytable (id, data)
VALUES (123, $myJson)
USING timestamp toTimeStamp($myJson.mytimestamp)
The quick answer is no, it's not possible to do that with CQL.
The norm is to parse the elements of the JSON object within your application to extract the corresponding values to construct the CQL statement.
As a side note, I would discourage using the CQL blob type due to possible performance issues should the blob size exceeed 1MB. If it's JSON, consider storing it as CQL text type instead. Cheers!
Worth mentioning, but CQL can do a limited amount of JSON parsing on its own. Albeit, not as detailed as you're asking here (ex: USING timestamp).
But something like this works:
> CREATE TABLE myjsontable (
... id TEXT,
... some_string TEXT,
... PRIMARY KEY (id));
> INSERT INTO myjsontable JSON '{"id":"123","some_string":"hello there"}';
> SELECT * FROM myjsontable WHERE id='123';
id | some_string
-----+-------------
123 | hello there
(1 rows)
In your case you'd either have to redesign the table or the JSON payload so that they match. But as Erick and Cédrick have mentioned, the USING timestamp part would have to happen client-side.
What you detailed is doable with Cassandra.
Timestamp
To insert timestamp in a query it should be formatted as an ISO 8601 String. Sample examples could be found here. In your code, you might have to convert your incoming value to expected type and format.
Blob:
Blob expects to store binary data, as such it cannot be put Ad hoc as a String in a CQL query. (you can use TEXT type to do it if you want to encode base64)
When you need to insert binary data you need to provide proper type as well. For instance if you are working with Javascript to need to provide a Buffer as describe in the documentation Then when you execute your query you externalized your parameters
const sampleId = 123;
const sampleData = Buffer.from('hello world', 'utf8');
const sampleTimeStamp = new Date();
client.execute('INSERT INTO myspace.mytable (id, data) VALUES (?, ?) USING timestamp toTimeStamp(?)', [ sampleId, sampleData, sampleTimeStamp ]);

Resolve datatype conflicts with Copy command in databricks

I have a requirement to load the data from a csv file in ADLS Gen2 to Delta table. I am using the copy for the same. One of the column in the target table 'IS_ACTIVE' is defined as TINYINT. When triggering the below code, it fails with the following error.
Failed to merge fields 'IS_ACTIVE' and 'IS_ACTIVE'. Failed to merge incompatible data types ByteType and StringType
COPY INTO metadata.md_config_master
FROM 'abfss://{container}#{storage_account}.dfs.core.windows.net/table_folder/'
WITH (CREDENTIAL (AZURE_SAS_TOKEN = '<sas_token_string>')
)
FILEFORMAT = CSV
FILES = ('MD_CONFIG_MASTER.csv')
FORMAT_OPTIONS ('mergeSchema'='true', 'header' = 'true', 'inferSchema'='true')
COPY_OPTIONS ('force' = 'true', 'mergeSchema'= 'true')
When I did not use 'inferSchema'='true' option in FORMAT_OPTIONS, it was failing due to data type mismatch for a integer column also. when i used 'inferSchema'='true' then this error disappeared.
But still have issue with TINYINT column conversion.
When I create the target table with all string columns, then the command is successful.
Is there a way to make this code run? I did not define ByteType at all in my target table. May be it is considering TINYINT as ByteType. I am not so sure.
Note: I can actually read the ADLS file and create a pandas Dataframe and convert this to spark dataframe and load the data to target table. But that is not what I am looking for. I want this copy command to work. Hence looking for a solution specifically for this COPY command.
Sample Schema:
CREATE TABLE IF NOT EXISTS metadata.MD_CONFIG_MASTER(
CONFIG_ID INT,
CLIENT_NAME STRING,
TARGET_DATABASE STRING,
TARGET_DATABASE_MODULE_NAME STRING,
TARGET_DATABASE_DRIVER_CLASS_NAME STRING,
TARGET_DATABASE_CNX_INFO STRING,
EXECUTION_PLATFORM STRING,
IS_ACTIVE TINYINT,
INSERT_DTS TIMESTAMP
) USING DELTA;
This error occurs when your data has any string values in that column.
I reproduced this with sample data. I have given a string value in Age column.
You can see, I got the same error when tried copy into code.
So, check your data if it has any string values in that column.
When I removed the string value and added the integer value in that column, I am able to copy the data.
Data successfully copied to delta table.

Kusto/Azure Data Explorer - How can I partition an external table using a timespan field?

Hoping someone can help..
I am new to Kusto and have to get an external table reading data from an Azure Blob storage account working, but the one table I have is unique in that the data for the timestamp column is split into 2 separate columns , i.e. LogDate and LogTime (see script below).
My data is stored in the following structure in the Azure Storage account container (container is named "employeedata", for example):
{employeename}/{year}/{month}/{day}/{hour}/{minute}.csv, in a simple CSV format.
I know the CSV is good because if I import it into a normal Kusto table, it works perfectly.
My KQL script for the external table creation looks as follows:
.create-or-alter external table EmpLogs (Employee: string, LogDate: datetime, LogTime:timestamp)
kind=blob
partition by (EmployeeName:string = Employee, yyyy:datetime = startofday(LogDate), MM:datetime = startofday(LogDate), dd:datetime = startofday(LogDate), HH:datetime = todatetime(LogTime), mm:datetime = todatetime(LogTime))
pathformat = (EmployeeName "/" datetime_pattern("yyyy", yyyy) "/" datetime_pattern("MM", MM) "/" datetime_pattern("dd", dd) "/" substring(HH, 0, 2) "/" substring(mm, 3, 2) ".csv")
dataformat=csv
(
h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
I am getting the error below constantly, which is not very helpful (redacted from full error, basically comes down to the fact there is a syntax error somewhere):
Syntax error: Query could not be parsed: {
"error": {
"code": "BadRequest_SyntaxError",
"message": "Request is invalid and cannot be executed.",
"#type": "Kusto.Data.Exceptions.SyntaxException",
"#message": "Syntax error: Query could not be parsed: . Query: '.create-or-alter external table ........
I know the todatetime() function works on timespan's, I tested it with another table and it created a date similar to the following: 0001-01-01 20:18:00.0000000.
I have tried using the bin() function on the timestamp/LogTime columns, but the same error as above, and even tried importing the time value as a string and doing some string manipulation on it, no luck. Getting the same syntax error.
Any help/guidance would be greatly appreciated.
Thank you!!
Currently, there's no way to define an external table partition based on more than one column. If your dataset timestamp is splitted between two columns: LogDate:datetime and LogTime:timestamp, then the best you can do is use virtual column for the partition by time:
.create-or-alter external table EmpLogs(Employee: string, LogDate:datetime, LogTime:timespan)
kind=blob
partition by (EmployeeName:string = Employee, PartitionDate:datetime)
pathformat = (EmployeeName "/" datetime_pattern("yyyy/MM/dd/HH/mm", PartitionDate))
dataformat=csv
(
//h#'************************'
)
with (folder="EmployeeInfo", includeHeaders="All")
Now, you can filter by the virtual column and fine tune using LogTime:
external_table("EmpLogs")
| where Employee in ("John Doe", ...)
| where PartitionDate between(datetime(2020-01-01 10:00:00) .. datetime(2020-01-01 11:00:00))
| where LogTime ...

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?
Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

COPY FROM CSV with static fields on Postgres

I'd like to switch an actual system importing data into a PostgreSQL 9.5 database from CSV files to a more efficient system.
I'd like to use the COPY statement because of its good performance. The problem is that I need to have one field populated that is not in the CSV file.
Is there a way to have the COPY statement add a static field to all the rows inserted ?
The perfect solution would have looked like that :
COPY data(field1, field2, field3='Account-005')
FROM '/tmp/Account-005.csv'
WITH DELIMITER ',' CSV HEADER;
Do you know a way to have that field populated in every row ?
My server is running node.js so I'm open to any cost-efficient solution to complete the files using node before COPYing it.
Use a temp table to import into. This allows you to:
add/remove/update columns
add extra literal data
delete or ignore records (such as duplicates)
, before inserting the new records into the actual table.
-- target table
CREATE TABLE data
( id SERIAL PRIMARY KEY
, batch_name varchar NOT NULL
, remote_key varchar NOT NULL
, payload varchar
, UNIQUE (batch_name, remote_key)
-- or::
-- , UNIQUE (remote_key)
);
-- temp table
CREATE TEMP TABLE temp_data
( remote_key varchar -- PRIMARY KEY
, payload varchar
);
COPY temp_data(remote_key,payload)
FROM '/tmp/Account-005'
;
-- The actual insert
-- (you could also filter out or handle duplicates here)
INSERT INTO data(batch_name, remote_key, payload)
SELECT 'Account-005', t.remote_key, t.payload
FROM temp_data t
;
BTW It is possible to automate the above: put it into a function (or maybe a prepared statement), using the filename/literal as argument.
Set a default for the column:
alter table data
alter column field3 set default 'Account-005'
Do not mention it the the copy command:
COPY data(field1, field2) FROM...

Resources