How to indicate database transaction commit in sequence diagram - diagram

I am not much sure this question is suit for SF. if not im sorry. my question is how could i draw the database transaction commit in the sequence diagram.
regards

You should have an object representing the transaction or the database.
Then you could represent it with a message labeled as "commit" from your business object to the transaction/database object.
For example (can't post image because of my reputation):
+-----------------+
| Business Object |
+-----------------+
|
| start transaction +----------------------+
+------------------------> | Database Transaction |
| +----------------------+
| |
| do lots of things |
+----------------------------------->|
| |
| commit |
+----------------------------------->|

Related

makeset operation not preserve ordering?

The following command does not produce a consistent ordering of items:
KubePodInventory
| where ClusterName == "mycluster"
| distinct Computer
| order by Computer asc
| summarize makeset(Computer)
But upon reading the documentation (see here) it states the following:
Like makelist, makeset also works with ordered data and will generate
the arrays based on the order of the rows that are passed into it.
Is this a bug or am I doing something wroing?
As per this issue #MohitVerma mentioned, makeset() should not support ordering, and they are planning to correct the doc : Like makelist, makeset also works with ordered data and will generate the arrays based on the order of the rows that are passed into it.
You can use makelist() as a workaround, which does support ordering as per my testing.
Please check this answer for the similar type of operation.
How to order item in Makeset?
Below code worked for me-
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc
You can follow this thread for the code snippet , marked the answer for closing this thread.
https://github.com/MicrosoftDocs/azure-docs/issues/24135#issuecomment-460185491
requests | summarize makeset(client_City) by client_City | distinct client_City | order by client_City asc | summarize makelist(client_City)

Retain last row for given key in spark structured streaming

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.

Cassandra isolation model

I have a use case for Cassandra where I need to store multiple rows of data, which will belong to different customers. I'm new to Cassandra and I need to provide a permissions model where only one customer is accessible at once from a base permissions role but all could be accessible from a 'supervisor' role. Essentially every time a query is made, one customer cannot see another customer's data, except for when the query is made from a supervisor. We have to enforce a security as a design approach.
The data could look like this:
-----------------------------------------
| id | customer name | data column1... |
-----------------------------------------
| 0 | customer1 | 3 |
-----------------------------------------
| 1 | customer2 | 23 |
-----------------------------------------
| 2 | customer3 | 33 |
-----------------------------------------
| 3 | customer3 | 32 |
-----------------------------------------
Is something like this easily doable with Cassandra?
The way you have modeled this is a perfectly good way to do multi-tenant. This is how UserGrid models multiple tenants and is used in several large scale applications.
Couple of drawbacks to be up-front:
Doesn't help with a "noisy neighbor" problem and unequal tenants
Application code has to manage the tenant security

Find similar items in a dataset

I have a dataset of of 500 mobile devices having 10 attributes namely
Date|Company|ModelName|Price|HardDisk|RAM|Colour|Display size|Cam1|Cam2
The sample dataset is given below :
24/10/2015 | walmart | Samsung Galaxy Note 4 N910H 32GB Unlocked GSM OctaCore Cell Phone-N910H 32GB GOLD | 599.99 | 32 | N/A | cell gold | N/A | 10.2 | 16
25/10/2015 | walmart | Samsung Galaxy Note 5 SM-N920i Gold International Model Unlocked GSM Mobile Phone | 717.95 | 32 | N/A | gold | N/A | 5.7 | 16
26/10/2015 | amazon | T-Mobile AllShare Cast Wireless Hub | 65.15 | N/A | N/A | streaming | N/A | N/A | N/A
I have to find the the most similar or unique devices or remove duplicate mobile devices from the dataset by taking into account the various attributes of the mobile devices.
I have explored many similarity algorithms like Jaccard similarity, cosine similarity. Levenshtein Distance but they seem to work upon attributes with same datatype.
Please suggest an algorithm or approach that could work on this type of mixed datatype dataset taking into account almost all attributes.
You can compute the hash code of each row.
Then use the difference of the hash codes as similarity measure.
Obviously, this depends on all the attributes.
It is very good for finding duplicates!
It may not be good for your application - but you did not specify what is good for your application.

Read time increases by 1ms for every 100 rows

I am stuck with this issue for almost a week now. I would like to get your suggestions and help with this. I have been getting read latency problems for simple table too. I just created simple table with 4k rows and when I read 500 rows it is fetching in 5ms but if I increase 1000 it gets ~10ms if take 4k it gets around 50ms. I tried checking stats, network, iostat, tpstats, heap but couldn't get a clue of what the issue is. Could anyone help me in what more i need to do resolve this high priority issue assigned to me. Thank you very much in advance.
Tracing session: b4287090-0ea5-11e5-a9f9-bbcaf44e5ebc
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
Execute CQL3 query | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 0
Parsing select * from location_eligibility_by_type12; [SharedPool-Worker-1] | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 33
Preparing statement [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 62
Computing ranges to query [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 101
Submitting range requests on 1537 ranges with a concurrency of 1537 (1235.85 rows per range expected) [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 314
Submitted 1 concurrent range requests covering 1537 ranges [SharedPool-Worker-1] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 6960
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] [SharedPool-Worker-2] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 7033
Read 4007 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-09 07:47:36.045000 | 10.65.133.202 | 84055
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-09 07:47:36.046000 | 10.65.133.202 | 84109
Request complete | 2015-06-09 07:47:36.052498 | 10.65.133.202 | 91498
Selecting lots of rows in Cassandra often takes unpredictably long since the query will be routed to more machines.
It's best to avoid such schemas if you need high read performance. A better approach is to store data in a single row and spread the load between nodes by having a higher replication factor. Wide rows are generally preferable: http://www.slideshare.net/planetcassandra/cassandra-summit-2014-39677149

Resources