Best way to aggregate by key(spark/cql) - apache-spark

Given a simple table with the columns
id(partition), timestamp(clustering column) and value(a long)
, whats the best way to get the sum of values for each id? I'd try to select all distinct ids in a query and then use this list of ids to run a query for each id
SELECT sum(value) FROM mytable WHERE id = ?
Unfortunately I cant figure out how to write the spark job and I am not really sure this is the best way. This is how far I got:
sc.cassandraTable("mykeyspace", "mytable")
.select("select distinct id")
.select("select sum(value)")
.where("id=?", ???)
Any hints on how I should proceed would be really appreciated.
Edit: Also here is an working example of how I currently do the aggregation: https://gist.github.com/Phil-Ba/72a7e762c8ab1ff1f3c9e8cff92cb223#file-cassandrasum-scala
The performance is lackluster though :/

This is called group by.
it can achieved with sql
select sum(value) from mytable group by id
it can achieved with function call in Spark
import org.apache.spark.sql.functions._
val df = sqlContext.table("mytable")
df.groupBy("id").agg(sum($"value"))

Related

search within a cassandra column

I'm working with the movielens dataset and I have a column called 'genres' which has entries such as 'Action|War', 'Action|Adventure|Comedy|Sci-Fi'. I wish to count the number of rows that have the text 'Comedy' in them.
SELECT COUNT(*) FROM movielens.data_movies WHERE genres = 'Comedy' ALLOW FILTERING
But this counts only the exact instances of 'Comedy'. It does not count 'Action|Adventure|Comedy|Sci-Fi' which I want it to do. So I tried,
SELECT COUNT(*) FROM movielens.data_movies WHERE genres CONTAINS 'Comedy' ALLOW FILTERING
However, that gives me the error
Cannot use CONTAINS on non-collection column genres
From this it seems that there is no easy way to do what I'm asking. Does anyone know of a simpler solution?
So what you can do, is to create a CUSTOM index on genres.
CREATE CUSTOM INDEX ON movielens.data_movies(genres)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
Then this query should work:
SELECT COUNT(*) FROM movies
WHERE genres LIKE '%Comedy%';
However, if you're running a query across millions of rows over multiple nodes, this query will likely timeout. This is because Cassandra has to poll multiple partitions and nodes to build the result set. Queries like this don't really work well in Cassandra.
The best way to solve for this, is to create a table partitioned by genre, like this:
CREATE TABLE movies_by_genre (
id int,
title TEXT,
genre TEXT,
PRIMARY KEY(genre,title,id));
This is of course also assuming that genres is split-out by each individual genre. But then this query would work:
SELECT COUNT(*) FROM movies_by_genre
WHERE genre = 'Comedy';

what is the row id equivalent in pyspark?

In our legacy DWH process, we find duplicates and track that duplicate records based on rowid in traditional RDBMS.
For ex.
select pkey_columns, max(rowdid) from table group by pkey_columns
will return only the duplicate records corresponding max records. Even when we identify the duplicate records, this helps in identifying/tracking the record.
Is there an equivalent in pySpark ? How is this handled in dwh to pyspark dwh translation projects ?
I would suggest that you use the analytic function library, perhaps a
ROW_NUMBER()
OVER( PARTITION BY group pkey_columns
ORDER BY sort columns)

how Cql's Collection contains alternative value?

I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.

How to use ORDER BY and GROUP BY together in u-sql

I am having a u-sql query which fetch some from 3 tables and this query already had the GROUP BY. I want to fetch only top 10 rows, so i have to use the FETCH.
#data= SELECT C.id,C.Name,C.Address,ph.phoneLabel,ph.phone
FROM person AS C
INNER JOIN
phone AS ph
ON ph.id == C.id
GROUP BY id
ORDER BY id ASC
FETCH 100 ROWS;
Please provide me some samples.
Thanks in Advance!
I am not an expert or anything but few days ago I executed a query which uses both group by and order by clause. Here's how it looks: SELECT distinct savedposters.*, comments.rating, comments.posterid FROM savedposters INNER JOIN comments ON savedposters.id=comments.posterid WHERE savedposters.display=1 GROUP BY comments.posterid HAVING avg(comments.rating)>=4 and count(comments.rating)>=2 ORDER BY avg(comments.rating) DESC
What is your exact goal? There is no relationship between ORDER BY and GROUP BY. In your query you have GROUP BY but there is no aggregation so the GROUP BY is not needed, plus the query would fail. If you're looking to limit the output by 10 rows then see the first example at Output Statement (U-SQL).

Dynamics AX Nested Query

Maybe I'm missing something simple, but is there a way to write a nested query in AX? I tried some syntax I thought would work, but with no luck.
The following standard SQL statement would accomplish what I'm trying to do, but I need to do this in AX, not SQL.
SELECT table1.column1A, table1.column1B,
(SELECT Top 1 column2B FROM table2
WHERE table1.column1A = table2.column2A
ORDER BY table2.column1A)
AS lookupResult
FROM table1
My problem is that table1 has a one-to-many relationship with table2, and since AX doesn't have a DISTINCT function that I'm aware of, I receive many copies of each record when using a JOIN statement.
Thanks
Nested queries are not supported in AX.
One way to bypass the missing distinct is to use group by (assuming max value of column2B is interesting):
while select column1A, column1B from table1
group column1A, column1B
join max-of(column2B) from table2
where table2.column2A == table1.column1A
{
...
}
Another method would be use a display method on table1 in the form or report.
display ColumnB column2B()
{
return (select max-of(column2B) from table2
where table2.column2A == this.column1A).column2A;
}
The performance is inferior to the first solution, but it may be acceptable.
As mentioned in the previous reply, group-by is the closest you can get to a distinct function. If you need a simpler query for some reason, or if you need a table or query object to use as a datasource on a form or report, you may entertain the idea of creating a view in the AOT, which contains the group-by. You can then use that view to easily join to on a query object or form datasource etc...
Ax2012 has support of computed columns in views, you can use the SysComputedColumn class to build query you want

Resources