Cassandra aggregate function "first_row" - cassandra

I know, that Cassandra allows to GROUP BY and can run UDF on that data.
Is there any default function to get the first row of each aggregated set?
(How) Can I stop processing data and return result from my UDF immediately (E.G. After 1 or few rows processed)?
Now I'm using ... COUNT(1) ... as workaround.

Actualy You fon't need any UDF. It works as described out of the box.
Jusr GROUP BY fields you need.

Related

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

How to run spark job sequentially?

I have a use case where in there is a table with one column which has sequence of SQL queries.
I want to run these SQL queries in spark program one after the other and not in parallel. This is because SQL query on Nth row will have dependency on (N-1)th row.
Now due to this constraint I want to execute this sequentially one after the other rather than in parallel. How can I achieve this?
I think you could use something like this:
val listOfQueryRows = spark.sqlContext.table("foo_db.table_of_queries")
.select(col("sql_query"))
.orderBy(col("query_index"))
.collectAsList()
listOfQueryRows.forEach(queryRow => spark.sql(queryRow.getString(0)))
This will select all your queries in the sql_query column, order them by the index given in the query_index and collects them in the list listOfQueryRows in the driver. The list is then iterated over sequentially executing the query for each returned row.

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

Is there a way to filter a counter column in cassandra?

I have been unable to decipher on how to proceed with a use case....
I want to keep count of some items, and query the data such that
counter_value < threshold value
Now in cassandra, indexes cannot be made on counters, that is something that is a problem, is there a workaround modelling which can be done to accomplish something similar??
thanks
You have partially answered your own question, saying what you want to query. So lets say first model the data the way you will query it later.
If you want to query through counter value, it cannot be a counter type. As it doesn't complies the two conditions needed to query the data
Cannot be part of index
Cannot be part of the partition key
Counters are the most efficient way to do fast writes in Cassandra for a counter use of case. But unfortunately they cannot be part of where clause, because of above two restrictions.
So if you want to solve the problem using Cassandra, change the type to a long in Cassandra, make it the clustering key or make an index over that column. In any case this will slower your writes and increase the latency of every operation of updating counter value, as you will be using the anti parttern of read-before-write.
I would recommend to use the index.
Last but not least, I would consider using a SQL database for this problem.
Depending on what you're trying to return as a result, you might be able to do something with a user defined aggregate function. You can put arbitrary code in the user defined function to filter based on the value of the counter.
See some examples here and here.
Other approaches would be to filter the returned rows on the client side, or to load the data into Spark and filter the rows in Spark.

Resources