how to get rows in reverse sorted order using pycassa get_range? - cassandra

I want to get the rows returned by get_range in pycassa to be in reverse sorted order..i.e. from finish to start.
I know that there exists a parameter column_reversed for getting columns in reverse sorted order , but how do i get this done for rows?

Cassandra itself doesn't support getting a range of rows in reverse order. It also doesn't support getting rows in normal sorted order unless you're using an order preserving partitioner, which is almost never recommended. This post is a bit old, but still covers the topic quite well: http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Related

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

How can i update the column to a particular value in a cassandra table?

Hi I am having a cassandra table. my table has around 200 records in it . later i have altered the table to add a new column named budget which is of type boolean . I want to set the default value to be true for that column . what should be the cql looks like.
I am trying the following command but it didnt work
cqlsh:Openmind> update mep_primecastaccount set budget = true ;
SyntaxException: line 1:46 mismatched input ';' expecting K_WHERE
appreciate any help
thank you
Any operation that would require a cluster wide read before write is not supported (as it wont work in the scale that Cassandra is designed for). You must provide a partition and clustering key for an update statement. If theres only 200 records a quick python script or can do this for you. Do a SELECT * FROM mep_primecastaccount and iterate through ResultSet. For each row issue an update. If you have a lot more records you might wanna use spark or hadoop but for a small table like that a quick script can do it.
Chris's answer is correct - there is no efficient or reliable way to modify a column value for each and every row in the database. But for a 200-row table that doesn't change in parallel, it's actually very easy to do.
But there's another way that can work also on a table of billions of rows:
You can handle the notion of a "default value" in your client code. Pre-existing codes will not have a value for "budget" at all: It won't be neither true, nor false, but rather it will be outright missing (a.k.a. "null"). You client code may, when it reads a row with a missing "budget" value, replace it by some default value of its choice - e.g., "true".

Is there a way to filter a counter column in cassandra?

I have been unable to decipher on how to proceed with a use case....
I want to keep count of some items, and query the data such that
counter_value < threshold value
Now in cassandra, indexes cannot be made on counters, that is something that is a problem, is there a workaround modelling which can be done to accomplish something similar??
thanks
You have partially answered your own question, saying what you want to query. So lets say first model the data the way you will query it later.
If you want to query through counter value, it cannot be a counter type. As it doesn't complies the two conditions needed to query the data
Cannot be part of index
Cannot be part of the partition key
Counters are the most efficient way to do fast writes in Cassandra for a counter use of case. But unfortunately they cannot be part of where clause, because of above two restrictions.
So if you want to solve the problem using Cassandra, change the type to a long in Cassandra, make it the clustering key or make an index over that column. In any case this will slower your writes and increase the latency of every operation of updating counter value, as you will be using the anti parttern of read-before-write.
I would recommend to use the index.
Last but not least, I would consider using a SQL database for this problem.
Depending on what you're trying to return as a result, you might be able to do something with a user defined aggregate function. You can put arbitrary code in the user defined function to filter based on the value of the counter.
See some examples here and here.
Other approaches would be to filter the returned rows on the client side, or to load the data into Spark and filter the rows in Spark.

Cassandra Query by Date

How do I update a colum based on a greater or less then date in Casandra?
Example:
update asset_by_file_path set received = true where file_path = '/file/path' and time_received = '2015-07-24 02:14:34-0600';
This works fine. But I would like to do it for all columns that match this file path and time_received is greater then 2015-07-24 02:14:34-0600.
time_received is date, clustering column.
file_path is string, partition key
Cassandra's WHERE clause has many limitations and if you have several clustering columns things could not work as you expect, at least there are limitations for >, >=, <, <= etc operators. Here is a quite fresh blog post from Databrix about WHERE clause nuances, it also covers some upcoming features.
I think UPDATE can only modify a single row at a time, so I don't see a way to update multiple rows on the server side in CQL.
A couple possible programmatic approaches:
Do a range query to return all the rows you want to update, and then on the client side, update each row returned. Since they would all be for the same partition, you could issue the updates as batched statements.
If you have Spark available, you could read all the rows you want to update into an RDD using a range query. Then do a transformation on the RDD to set the received value to true, then save the RDD back to Cassandra.

How to arrange data in Cassandra to get data in last in first out format

As we cannot sort data in Cassandra, I wanted to store data in such format that when I retrieve the data, I need to get data in ' last in first out format ' i.e if user enter comments when I retrieve data, I should first get very latest comment first and then older comments. I think it's something to do with comparator.
I have set following when configuring Cassandra:
assume posts comparator as utf8;
assume posts validator as utf8;
assume posts keys as utf8;
Please help - how should I create the column to arrange data in time format so that latest data is stored first?
Columns in a row are always sorted, and you can iterate over the columns in a row in reverse order. Given these two facs we could model the situation you're describing by storing comments in a column family called "comments" where the row key is the post ID, and the columns represent the comments to the corresponding post. The columns are timestamts (either ISO formatted dates, UNIX timestamps or time UUIDs) and the values are the comment text bodies.
If you would now get the columns for a row and specify that you wanted them in reverse order you would get what you want. How to specify reverse order depends on your driver, but it's usually just an option to the command that retrieves a row, or a column slice.
Another way, which is more hackish, would be to take the UNIX timestamp of a post, and subtract it from a large integer, like 2^31, and use that as column key. That way columns would sort in reverse order by default. It's not pretty and the above method is more elegant.
If you worry about using timestamps because there could be collisions where two comments are posted at exactly the same time, use Cassandra's time UUID type.
You need to organize your data such that the comparator is a timestamp. You store your data in natural order and specify reverse order in your slice query.

Resources