Select and decode blob using python cassandra driver - python-3.x

I am trying to query the traces Cassandra table which is part of the Jaeger architecture. As you can see the refs field is a list:
cqlsh:jaeger_v1_dc1> describe traces
CREATE TABLE jaeger_v1_dc1.traces (
trace_id blob,
span_id bigint,
span_hash bigint,
duration bigint,
flags int,
logs list<frozen<log>>,
operation_name text,
parent_id bigint,
process frozen<process>,
refs list<frozen<span_ref>>,
start_time bigint,
tags list<frozen<keyvalue>>,
PRIMARY KEY (trace_id, span_id, span_hash)
)
from the python code:
traces = session.execute('SELECT span_id,refs from traces')
for t in traces:
if t.refs is not None:
parentTrace=t['refs'][0].trace_id
My first question is it possible to directly select the parent trace without iterating through the result? Is there a way i can get the first element in the list and then get the elements inside from the select statment?
From the terminal using cqlsh ,I am getting this result: trace_id: 0x00000000000000003917678c73006f57. However, from a python cassandra client I got this trace_id=b'\x00\x00\x00\x00\x00\x00\x00\x009\x17g\x8cs\x00oW' any idea what transformation happened to it? How can decode it since I want to use to query the table again.

To my knowledge, there is no easy way as there is no guarantee that the spans are stored in a specific order. Worth noting though, is if by parentTrace, you mean the root span of the trace (the first span), then you can search for spans where refs is null because a root span has no parent. Another way to identify a root span is if the trace_id == span_id.
trace_id is stored as a binary blob. What you see from cassandra client is an array of 16 bytes with each octet element represented as two hexadecimal values. To convert it to the hex string you see in cqlsh, you'll need to convert the entire array to a single hex string. See the following python example that does this:
from cassandra.cluster import Cluster
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()
rows = session.execute("select * from jaeger_v1_test.traces")
trace = rows[0]
hexstr = ''.join('{:02x}'.format(x) for x in trace.trace_id)
print("hex=%s, byte_arr=%s, len(byte_arr)=%d" % (hexstr, trace.trace_id, len(trace.trace_id)))
cluster.shutdown()

Related

Is it possible for CQL to parse a JSON object to insert data?

From what I looked so far, it seems impossible with Cassandra. But I thought I'd give it a shot:
How can I select a value of a json property, parsed from a json object string, and use it as part of an update / insert statement in Cassandra?
For example, I'm given the json object:
{
id:123,
some_string:"hello there",
mytimestamp: "2019-09-02T22:02:24.355Z"
}
And this is the table definition:
CREATE TABLE IF NOT EXISTS myspace.mytable (
id text,
data blob,
PRIMARY KEY (id)
);
Now the thing to know at this point is that for a given reason the data field will be set to the json string. In other words, there is no 1:1 mapping between the given json and the table columns, but the data field contains the json object as kind of a blob value.
... Is it possible to parse the timestamp value of the given json object as part of an insert statement?
Pseudo code example of what I mean, which obviously doesn't work ($myJson is a placeholder for the json object string above):
INSERT INTO myspace.mytable (id, data)
VALUES (123, $myJson)
USING timestamp toTimeStamp($myJson.mytimestamp)
The quick answer is no, it's not possible to do that with CQL.
The norm is to parse the elements of the JSON object within your application to extract the corresponding values to construct the CQL statement.
As a side note, I would discourage using the CQL blob type due to possible performance issues should the blob size exceeed 1MB. If it's JSON, consider storing it as CQL text type instead. Cheers!
Worth mentioning, but CQL can do a limited amount of JSON parsing on its own. Albeit, not as detailed as you're asking here (ex: USING timestamp).
But something like this works:
> CREATE TABLE myjsontable (
... id TEXT,
... some_string TEXT,
... PRIMARY KEY (id));
> INSERT INTO myjsontable JSON '{"id":"123","some_string":"hello there"}';
> SELECT * FROM myjsontable WHERE id='123';
id | some_string
-----+-------------
123 | hello there
(1 rows)
In your case you'd either have to redesign the table or the JSON payload so that they match. But as Erick and Cédrick have mentioned, the USING timestamp part would have to happen client-side.
What you detailed is doable with Cassandra.
Timestamp
To insert timestamp in a query it should be formatted as an ISO 8601 String. Sample examples could be found here. In your code, you might have to convert your incoming value to expected type and format.
Blob:
Blob expects to store binary data, as such it cannot be put Ad hoc as a String in a CQL query. (you can use TEXT type to do it if you want to encode base64)
When you need to insert binary data you need to provide proper type as well. For instance if you are working with Javascript to need to provide a Buffer as describe in the documentation Then when you execute your query you externalized your parameters
const sampleId = 123;
const sampleData = Buffer.from('hello world', 'utf8');
const sampleTimeStamp = new Date();
client.execute('INSERT INTO myspace.mytable (id, data) VALUES (?, ?) USING timestamp toTimeStamp(?)', [ sampleId, sampleData, sampleTimeStamp ]);

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?
Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

How should I insert decimal values via cqlsh (DecimalSerializer.deserialize error when attempting)?

In cqlsh I'm trying to insert a record with:
INSERT INTO data
(order_id,order_ts,transaction_id,transaction_discount,transaction_qty,
transaction_total,
product_category,product_profit,product_upc,product_name,product_price,
product_distributor,
store_id,store_name,store_state,store_region,id)
values ('YBC8RWE18',1368438171000,'LQKLVVI4E', 0, 1, 140.72,
'Office Supplies', 12.42,
'YT8899H3357', 'Casio USB Model FX-9860GII', 140.72, 'QR0', '2YOYWMR28Q',
'BigLots', 'AZ',
'Southwest', 2259a88e-b62d-4625-a86e-b86d77418a34 );
Looks fine, but I'm getting a number exception:
Caused by: java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.<init>(BigInteger.java:190)
at org.apache.cassandra.serializers.DecimalSerializer.deserialize(DecimalSerializer.java:41)
at org.apache.cassandra.serializers.DecimalSerializer.deserialize(DecimalSerializer.java:26)
at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142)
at org.apache.cassandra.db.marshal.DecimalType.compare(DecimalType.java:46)
Looks like DecimalSerializer.deserialize is the real issue here. If I try to surround the decimals in quotes (worth a shot, I thought) I get:
Bad Request: Invalid STRING constant (140.72) for product_price of type decimal
So that didn't help. What do I need to do to insert a decimal? Should I post up the COLUMNFAMILY def?
Here's the DESCRIBE TABLE:
CREATE TABLE data (
id uuid,
order_id text,
order_ts timestamp,
product_category text,
product_distributor text,
product_name text,
product_price decimal,
product_profit decimal,
product_upc text,
store_id text,
store_name text,
store_region text,
store_state text,
transaction_discount decimal,
transaction_id text,
transaction_qty int,
transaction_total decimal,
PRIMARY KEY (id)
)
If I take off the quotes around 140.72 I get: Request did not complete within rpc_timeout. and the logs show the deserialize error. If I try to just insert a few columns it's fine -- until I try to insert the product_price field.
Haven't got any answers, but did get some help.
The answer to this one isn't very satisfying. None of the errors really clued me in to what was going on, but I thought: well, there's really only a few moving parts here: the table def and the indexes. So I dropped the table and recreated it without indexes. Fixed the issue. Recreated the indexes and the issue was still fixed. So perhaps the indexes were messed up?
Got the same exception even without index on the table. If you look into the stack trace in the constructor of BigInteger public BigInteger(byte[] val), the exception are thrown when the passed in byte array is empty. Maybe there are bugs in Cassandra driver to deserialize decimal? However, there are following comments are interesting:
Translates a byte array containing the two's-complement binary representation of a BigInteger into a BigInteger. The input array is assumed to be in big-endian byte-order: the most significant byte is in the zeroth element.
Will the Cassandra driver guarantee the passed byte array is big-endian? The solution could be use other types like double instead.

Resources