Handle duplicate data in Kusto

Handle duplicate data in Kusto - azure

Problem description:
when the data are received to the source_table the update policy in kusto is run to store the data in end_point_table.
the update function should handle the duplication data and store just the new data inside the end_point_table. that mean that if we got data form the source_table which are the same what we have in the end_point_table then no data will store.
what i did:
the end_point_table have already data
.ingest inline into table end_point_table <|
1,2020-01-01T12:00:00Z,property,128
i have a source table called source_table, which i ingest the data to it like the following
.ingest inline into table source_table <|
1,2020-01-01T12:00:00Z,128
.ingest inline into table source_table <|
1,2020-01-01T12:00:00Z,property,128
the following function will triggered automatically
let _incoming =(
source_table
| where property == "property"
| project device_id, timestamp, value
| distinct *
);
let _old_data = (
end_point_table
);
_incoming
| join kind = leftouter(
_old_data
| summarize arg_max(timestamp, *) by device_id
) on device_id
| where (
timestamp != timestamp1
or value != value1
)
| project device_id, timestamp, value
result:
when i query the data after the ingestion i got three row instead of one like this
1,2020-01-01T12:00:00Z,property,128
1,2020-01-01T12:00:00Z,property,128
1,2020-01-01T12:00:00Z,property,128
the question is:
is there any solution to avoid ingestion the duplicated data in the end_point_table. or did i used the update policy wrongly

Update policies is not the right approach for this.
There are multiple correct approaches to handle dedup, please read about them here:
https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates

Related

Read JSON data from CSV in Synapse Azure

I have a CSV with some columns, one of them with JSON content. Can I query this JSON column with special handling?
An example below:
ei
My goal is to run a query (openrowset documentation) and get output similar to this.
ID
Name
0
Valarie Strickland
1
Mathews Harrison
2
Cecilia Giles

I tried to reproduce the same in my environment.
My sample data:
to convert the column with nested Json in the form of table. First, I created variable with nvarchar(max). set the select Querys value to it.
DECLARE #json nvarchar(max)
SET #json = (SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://dlsg2p.dfs.core.windows.net/fsn2p/jsoncolumn.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
firstrow=3
) AS [result])
with below statement checking the value is assigned properly to variable.
select #json as JSON
Using CROSS APPLY for converting Json to table format:
SELECT b.id as ID ,b.name as Name
FROM
OPENJSON(#json)
WITH
(
friends NVARCHAR(MAX) AS JSON
) AS a
CROSS APPLY
OPENJSON(a.friends)
WITH
(
id INT,
name VARCHAR(MAX)
) AS b;
Execution:
Output:

How to run CQL in Zeppelin by taking input in user input format?

I was trying to run CQL query by taking in user input format in Zeppelin tool:-
%cassandra
SELECT ${Select Fields Type=uuid ,uuid | created_by | email_verify| username} FROM
${Select Table=keyspace.table_name}
${WHERE email_verify="true" } ${ORDER BY='updated_date' }LIMIT ${limit = 10};
while running this query I was getting this error:
line 4:0 mismatched input 'true' expecting EOF
(SELECT uuid FROM keyspace.table_name ["true"]...)

You need to move WHERE and ORDER BY out of the dynamic form declaration.
The input field declaration is looks as following: ${field_name=default_value}. In your case, instead of WHERE ..., you've got the field name of WHERE email_verify.
It should be as following (didn't tested):
%cassandra
SELECT ${Select Fields Type=uuid ,uuid | created_by | email_verify| username} FROM
${Select Table=keyspace.table_name}
WHERE ${where_cond=email_verify='true'} ORDER BY ${order_by='updated_date'} LIMIT ${limit = 10};
Update:
here is the working example for table with following structure:
CREATE TABLE test.scala_test2 (
id int,
c int,
t text,
tm timestamp,
PRIMARY KEY (id, c)
) WITH CLUSTERING ORDER BY (c ASC)

How to cache subquery result in WITH clause in Spark SQL

I wonder if Spark SQL support caching result for the query defined in WITH clause.
The Spark SQL query is something like this:
with base_view as
(
select some_columns from some_table
WHERE
expensive_udf(some_column) = true
)
... multiple query join based on this view
While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times.
In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.
P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.

That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.

The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.

This can be done by excuting several sql query.
-- first cache sql
spark.sql("
CACHE TABLE base_view as
select some_columns
from some_table
WHERE
expensive_udf(some_column) = true")
-- then use
spark.sql("
... multiple query join based on this view
")

Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-
spark.sql("""
| create temp view my_view
| as
| WITH base_view as
| (
| select some_columns
| from some_table
| WHERE
| expensive_udf(some_column) = true
| )
| SELECT *
| from base_view
""");
spark.sql("""CACHE TABLE my_view""");
Now you can use the my_view temp view to join to other tables as shown below-
spark.sql("""
| select mv.col1, t2.col2, t3.col3
| from my_view mv
| join tab2 t2
| on mv.col2 = t2.col2
| join tab3 t3
| on mv.col3 = t3.col3
""");
Remember to uncache the view after using-
spark.sql("""UNCACHE TABLE my_view""");
Hope this helps.

Updating to empty set

I just created a new column for my table
alter table user add (questions set<timeuuid>);
Now the table looks like
user (
google_id text PRIMARY KEY,
date_of_birth timestamp,
display_name text,
joined timestamp,
last_seen timestamp,
points int,
questions set<timeuuid>
)
Then I tried to update all those null values to empty sets, by doing
update user set questions = {} where google_id = ?;
for each google id.
However they are still null.
How can I fill that column with empty sets?

A set, list, or map needs to have at least one element because an
empty set, list, or map is stored as a null set.
source
Also, this might be helpful if you're using a client (java for instance).

I've learnt that there's not really such a thing as an empty set, or list, etc.
These display as null in cqlsh.
However, you can still add elements to them, e.g.
> select * from id_set;
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | null
> update id_set set set_content = set_content + {'apple','orange'} where set_id = '105781005288147046623';
set_id | set_content
-----------------------+---------------------------------
104649882895086167215 | null
105781005288147046623 | { 'apple', 'orange' }
So even though it displays as null you can think of it as already containing the empty set.

Cassandra/Hector: Add a counter on a composite primary key

I've created a table in CQL3 console (no single primary key constituent is unique, together they will be):
CREATE TABLE aggregate_logs (
bpid varchar,
jid int,
month int,
year int,
value counter,
PRIMARY KEY (bpid, jid, month, year));
then been able to update and query by using:
UPDATE aggregate_logs SET value = value + 1 WHERE bpid='1' and jid=1 and month=1 and year=2000;
This works as expected. I wanted to do the same update in Hector (in Scala):
val aggregateMutator:Mutator[Composite] = HFactory.createMutator(keyspace, compositeSerializer)
val compKey = new Composite()
compKey.addComponent(bpid, stringSerializer)
compKey.addComponent(new Integer(jid), intSerializer)
compKey.addComponent(new Integer(month), intSerializer)
compKey.addComponent(new Integer(year), intSerializer)
aggregateMutator.incrementCounter(compKey, LogsAggregateFamily, "value", 1)
but I get an error with the message:
...HInvalidRequestException: InvalidRequestException(why:String didn't validate.)
Running the query direct from hector with:
val query = new me.prettyprint.cassandra.model.CqlQuery(keyspace, compositeSerializer, stringSerializer, new IntegerSerializer())
query.setQuery("UPDATE aggregate_logs SET value = value + 1 WHERE 'bpid'=1 and jid=1 and month=1 and year=2000")
query.execute()
which gives me the error:
InvalidRequestException(why:line 1:59 mismatched input 'and' expecting EOF)
I've not seem any other examples which use a counter under a composite primary key. Is it even possible?

It's definitely possible using directly cql (both via CQLSH and C++, at least):
cqlsh:goh_master> describe table daily_caps;
CREATE TABLE daily_caps
( caps_type ascii, id ascii, value counter, PRIMARY KEY
(caps_type, id) ) WITH COMPACT STORAGE AND comment='' AND
caching='KEYS_ONLY' AND read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND replicate_on_write='true' AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='SnappyCompressor';
cqlsh:goh_master> update daily_caps set value=value +1 where caps_type='xp' and id ='myid';
cqlsh:goh_master> select * from daily_caps;
caps_type | id | value
-----------+------+-------
xp | myid | 1

CQL3 and the thrift API are not compatible. So creating a column family with CQL3 and accessing it with Hector or another thrift based client will not work. For more information see:
https://issues.apache.org/jira/browse/CASSANDRA-4377

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Handle duplicate data in Kusto - azure

Update policies is not the right approach for this. There are multiple correct approaches to handle dedup, please read about them here: https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates

Related

Read JSON data from CSV in Synapse Azure

How to run CQL in Zeppelin by taking input in user input format?

How to cache subquery result in WITH clause in Spark SQL

Updating to empty set

Cassandra/Hector: Add a counter on a composite primary key

Categories

Resources