Interpretation of Transaction Confirmation in Hyperledger Fabric - hyperledger-fabric

In Fabric docs, following example is explained.
World state: (k1,1,v1), (k2,1,v2), (k3,1,v3), (k4,1,v4), (k5,1,v5)
T1 -> Write(k1, v1'), Write(k2, v2')
T2 -> Read(k1), Write(k3, v3')
T3 -> Write(k2, v2'')
T4 -> Write(k2, v2'''), read(k2)
T5 -> Write(k6, v6'), read(k5)
T1 passes validation because it does not perform any read. Further, the tuple of
keys k1 and k2 in the world state are updated to (k1,2,v1'), (k2,2,v2')
T2 fails validation because it reads a key, k1, which was modified by a
preceding transaction - T1
T3 passes the validation because it does not perform a read. Further the tuple
of the key, k2, in the world state is updated to (k2,3,v2'')
T4 fails the validation because it reads a key, k2, which was modified by a
preceding transaction T1
T5 passes validation because it reads a key, k5, which was not modified by any
of the preceding transactions
It is mentioned that all transaction viz., T1 to T5 are based on same snapshot of the
world state DB before their validation. I am aware that Fabric follows Execute-Order-Validate phases for transaction confirmation.
But, I could not understand, how T3 passes the validation though value of K2 is
modified by T1. By the time T3 is to be validated the read-set acquired by T3 during its execution phase, is no longer valid as its value has been modified by T1. Can any one explain whether the example is wrong or it is being understood incorrectly.

There is no read set associated with T3 (or its read set is empty). T3 is only performing a write to a given ledger key so there is no inconsistency.
A concrete example of what T3 might be doing is:
Creating a new asset with a given ID (key) without first checking whether the asset already exists; or
Setting the balance of an account to a specific value without checking the current balance.
If T3 updated the key's value (by first reading the key and then writing an updated value), it would fail validation with an MVCC read conflict, which is the case with T4.

Related

How does Spark Structured Streaming determine an event has arrived late?

I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time?
Taking the above picture as an example Is the bold right arrow line "Time" represent processing time? If so
1) where does this processing time come from? since its streaming Is it assuming someone is likely using an upstream source that has processing timestamp in it or spark adds a processing timestamp field? For example, when reading messages from Kafka we do something like
Dataset<Row> kafkadf = spark.readStream().forma("kafka").load()
This dataframe has timestamp column by default which I am assuming is the processing time. correct? If so, Does Kafka or Spark add this timestamp?
2) I can see there is a time comparison between bold right arrow line and time in the message. And is that how spark determines an event is late?
Processing time of individual job (one RDD of the DStream) is what dictate the processing time in general. It is not when the actual processing of that RDD happen but when the RDD job was allocated to be processed.
In order to clearly understand what the above statement means, create a spark streaming application where batch time = 60 seconds and make sure the batch takes 2 minute. Eventually you will see that a job is allocated to be processed at a time but has not been picked up because the previous job has not finished.
Next:
Out of order data can be dealt two different ways.
Create a High water mark.
It is explained in the same spark user guide page you got your picture.
It is easy to understand where we have a key, value pair where key is the timestamp. Setting .withWatermark("timestamp", "10 minutes") we are essentially saying that if I have received a message for 10 AM then I will allow messages little older than that(Upto 9.50AM). Any message older than that gets dropped.
The other way out of order data can be handled is with in the mapGroupsWithState or mapWithState function.
Here you can decide what to do when you get bunch of values for a key. Drop anything before time X or go even fancier than that. (Like if it is data from A then allow 20 minutes delay, for the rest allow 30 minutes delay etc...)
This document from databricks explains it pretty clearly:
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Basically it comes down to the watermark (late data threshold) and the order in which data records arrive. Processing time does not play into it at all. You set a watermark on the column representing your event time. If a record R1 with event time T1 arrives after a record R2 with event time T2 has already been seen, and T2 > T1 + Threshold, then R1 will be discarded.
For example, suppose T1 = 09:00, T2 = 10:00, Threshold = 61 min. If a record R1 with event time T1 arrives before a record R2 with event time T2 then R1 is included in calculations. If R1 arrives after R2 then R1 is still included in calculations because T2 < T1 + Threshold.
Now suppose Threshold = 59 min. If R1 arrives before R2 then R1 is included in calculations. If R1 arrives after R2, then R1 is discarded because we've already seen a record with event time T2 and T2 > T1 + Threshold.

Are the same IDs always given out across the same logic plan?

Beneath you see a simplified version of what I'm trying to do. I load a Dataframe from 150 parquets(>10TB) stored in S3, Then I give this dataframe an id column with func.monotonically_increasing_id(). Afterwards I save a couple of deviates of this dataframe. The function I apply are a little bit more complicated than I present here but I hope this gets the point across
DF_loaded = spark.read.parquet(/some/path/*/')
DF_with_IDs = DF_loaded.withColumn('id',func.monotonically_increasing_id())
#creating parquet_1
DF_with_IDs.where(col('a').isNotNull()).write.parquet('/path/parquet_1/')
#creating parquet_2
DF_with_IDs.where(col('b').isNotNull()).write.parquet('/path/parquet_2/')
now I noticed that spark after creating parquet_1 loads again all the data from S3 to create parquet_2. Now I'm worried that the IDs given to parquet_1 do not match those of parquet_2. That the same row has different IDs in both parquets. Because as far is I understand it the logic plan spark comes up with looks like this:
#parquet_1
load_data -> give_ids -> make_selection -> write_parquet
#parquet_2
load_data -> give_ids -> make_selection -> write_parquet
So are the same IDs given to the same rows in both parquets?
As long as:
You use a recent version of Spark (SPARK-13473, SPARK-14241).
There is no configuration change between actions (Changes in a configuration can affect number of partitions and as a result ids).
monotonically_increasing_id should be stable. Note that this disables predicate pushdown.
rdd.zipWithindex.toDF should be stable independent of configuration, so it might be preferable.

Internal working of Spark - Communication/Synchronization

I am quite new to Spark but already have programming experience in BSP model. In BSP model (e.g. Apache Hama), we have to handle all the communication and synchronization of nodes on our own. Which is good on one side because we have a finer control on what we want to achieve but on the other hand it adds more complexity.
Spark on the other hand, takes all the control and handles everything on its own (which is great) but I don't understand how it works internally especially in cases where we have alot of data and message passing between nodes. Let me put an example
zb = sc.broadcast(z)
r_i = x_i.map(x => Math.pow(norm(x - zb.value), 2))
r_i.checkpoint()
u_i = u_i.zip(x_i).map(ux => ux._1 + ux._2 - zb.value)
u_i.checkpoint()
x_i = f.prox(u_i.map(ui => {zb.value - ui}), rho)
x_i.checkpoint()
x = x_i.reduce(_+_) / f.numSplits.toDouble
u = u_i.reduce(_+_) / f.numSplits.toDouble
z = g.prox(x+u, f.numSplits*rho)
r = Math.sqrt(r_i.reduce(_+_))
This is a method taken from here, which runs in a loop (let's say 200 times). x_i contains our data (let's say 100,000 entries).
In a BSP style program if we have to process this map operation, we will partition this data and distribute on multiple nodes. Each node will process sub part of data (map operation) and will return the result to master (after barrier synchronization). Since master node wants to process each individual result returned (centralized master- see figure below), we send the result of each entry to master (reduce operator in spark). So, (only) master receives 100,000 messages after each iterations. It processes this data and sends the new values to slaves again which again start processing for next iteration.
Now, since Spark takes control from user and does internally everything, I am unable to understand how Spark collects all the data after map operations (asynchronous message passing? i heard it has p2p message passing ? what about synchronization between map tasks? If it does synchronization, then is it right to say that Spark is actually a BSP model ?). Then in order to apply the reduce function, does it collects all the data on a central machine (If yes, does it receives 100,000 messages on a single machine?) or it reduces in a distributed fashion (If yes, then how can this be performed ?)
Following figure shows my reduce function on master. x_i^k-1 represents the i-th value calculated (in previous iteration) against x_i data entry of my input. x_i^k represents the value of x_i calculated in current iteration. Clearly, this equation, needs the results to be collected.
I actually want to compare both styles of distributed programming to understand when to use Spark and when to move to BSP. Further, I looked alot on the internet, all I find is how map/reduce works but nothing useful was available on actual communication/synchronization. Any helful material will be useful aswell.

Cassandra - Write doesn't fail, but values aren't inserted

I have a cluster of 3 Cassandra 2.0 nodes. My application I wrote a test which tries to write and read some data into/from Cassandra. In general this works fine.
The curiosity is that after I restarted my computer, this test will fail, because after writting I read the same value I´ve write before and there I get null instead of the value, but the was no exception while writing.
If I manually truncate the used column family, the test will pass. After that I can execute this test how often I want, it passes again and again. Furthermore it doesn´t matter if there are values in the Cassandra or not. The result is alwalys the same.
If I look at the CLI and the CQL-shell there are two different views:
Does anyone have an ideas what is going wrong? The timestamp in the CLI is updated after re-execution, so it seems to be a read-problem?
A part of my code:
For inserts I tried
Insert.Options insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue())
.using(timestamp(System.nanoTime() / 1000));
and
Insert insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue());
My select looks like
Select.Where select = QueryBuilder.select(WERT)
.from(KEYSPACE_NAME,TABLENAME)
.where(eq(ID, id))
.and(eq(JAHR, zonedDateTime.getYear()))
.and(eq(MONAT, zonedDateTime.getMonthValue()))
.and(eq(ZEITPUNKT, Date.from(instant)));
Consistencylevel is QUORUM (for both) and replicationfactor 3
I'd say this seems to be a problem with timestamps since a truncate solves the problem. In Cassandra last write wins and this could be a problem caused by the use of System.nanoTime() since
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
...
The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()
This means that the write that occured before the restart could have been performed "in the future" compared to the write after the restart. This would not fail the query, but the written value would simply not be visible due to the fact that there is a "newer" value available.
Do you have a requirement to use sub-millisecond precision for the insert timestamps? If possible I would recommend using System.currentTimeMillis() instead of nanoTime().
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()
If you have a requirement to use sub-millisecond precision it would be possible to use System.currentTimeMillis() with some kind of atomic counter that ranged between 0-999 and then use that as a timestamp. This would however break if multiple clients insert the same row at the same time.

Azure ServiceBus Eventhub, is the "offset" still available/durable when some of event data is expired?

When I write some code to test the EventHub which is a newly released on azure service bus.
As there is very few article online and msdn also do not have rich documentation about the detail of event hub. So I hope someone could share your experience for my question.
For EventHub, we have following statement:
we use "offset" to remember where we are when reading the event data from some partition
the event data on the EventHub would be expired (automatically?) after some configurable time span
So my question is, can the offset still be available/durable when some of the event data is deleted as the result of expiration?
For example, we have following data on one of partition:
M1 | M2 | M3 | M4 ( oldest --> latest )
After my processing logic runs, let's say that I have processed M1 and M2, so the offset would be the start of M2(when use exclusive mode).
After some time, and if my service is down during that time. M1 is deleted as the result of expiration. so the partition would become:
M2 | M3 | M4 | M.... ( oldest -> latest )
In this case, when my server is restart again, is the offset i stored before is still be available to be used to read from M3?
We can also image this case on runtime when my consumer server is reading the event data on eventhub when some of the oldest event data is expired, does the offset still be available on runtime?
Thanks for any sharing of this question.
Based upon how various parts of the documentation is written I believe you will start from the beginning of the current stream as desired if your starting offset is no longer available. EventProcessorHost should follow similar restrictions. Since the sequence numbers are 64 bits, I would expect one of those to be able to serve as an offset within a partition since they monotonically increase without being recycled. The offset should have a similar property. So if EventHubs are designed in a reasonable fashion (ie like similar solutions), then the offsets within partitions can hold despite data expiration. But since I have not yet tested this myself, I will be very unhappy if it is not so, and I'd expect an Azure person to be able to give true confirmation.

Resources