How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row? - apache-spark

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?

Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

Related

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

How can i update the column to a particular value in a cassandra table?

Hi I am having a cassandra table. my table has around 200 records in it . later i have altered the table to add a new column named budget which is of type boolean . I want to set the default value to be true for that column . what should be the cql looks like.
I am trying the following command but it didnt work
cqlsh:Openmind> update mep_primecastaccount set budget = true ;
SyntaxException: line 1:46 mismatched input ';' expecting K_WHERE
appreciate any help
thank you
Any operation that would require a cluster wide read before write is not supported (as it wont work in the scale that Cassandra is designed for). You must provide a partition and clustering key for an update statement. If theres only 200 records a quick python script or can do this for you. Do a SELECT * FROM mep_primecastaccount and iterate through ResultSet. For each row issue an update. If you have a lot more records you might wanna use spark or hadoop but for a small table like that a quick script can do it.
Chris's answer is correct - there is no efficient or reliable way to modify a column value for each and every row in the database. But for a 200-row table that doesn't change in parallel, it's actually very easy to do.
But there's another way that can work also on a table of billions of rows:
You can handle the notion of a "default value" in your client code. Pre-existing codes will not have a value for "budget" at all: It won't be neither true, nor false, but rather it will be outright missing (a.k.a. "null"). You client code may, when it reads a row with a missing "budget" value, replace it by some default value of its choice - e.g., "true".

Primary key : query & updates

Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Cassandra Query by Date

How do I update a colum based on a greater or less then date in Casandra?
Example:
update asset_by_file_path set received = true where file_path = '/file/path' and time_received = '2015-07-24 02:14:34-0600';
This works fine. But I would like to do it for all columns that match this file path and time_received is greater then 2015-07-24 02:14:34-0600.
time_received is date, clustering column.
file_path is string, partition key
Cassandra's WHERE clause has many limitations and if you have several clustering columns things could not work as you expect, at least there are limitations for >, >=, <, <= etc operators. Here is a quite fresh blog post from Databrix about WHERE clause nuances, it also covers some upcoming features.
I think UPDATE can only modify a single row at a time, so I don't see a way to update multiple rows on the server side in CQL.
A couple possible programmatic approaches:
Do a range query to return all the rows you want to update, and then on the client side, update each row returned. Since they would all be for the same partition, you could issue the updates as batched statements.
If you have Spark available, you could read all the rows you want to update into an RDD using a range query. Then do a transformation on the RDD to set the received value to true, then save the RDD back to Cassandra.

Resources