Reading guarantees for full table scan while updating the table? - apache-spark

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.

Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

Related

Cassandra read from large dataset

I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.
select count(*) from conv_org where org_id = 'TEST_ORG'
I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?
Would the syntax look something like this?
Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);
I am unsure the above code will work as I do not need a result set back I just need a count.
Here is the data model.
CREATE TABLE ts.conv_org (
org_id text,
create_time timestamp,
test_id text,
org_type int,
PRIMARY KEY (org_id, create_time, conv_id)
)
If org_id isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.
In Java for example you can do something like this:
ResultSet rs = session.execute(...);
Iterator<Row> iter = rs.iterator();
while (iter.hasNext()) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults();
Row row = iter.next()
... process the row ...
}
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html
You could select a small colum and count your self. There is int getAvailableWithoutFetching() and isFullyFetched() that could help you.
In general if you really need a count - maintain it yourself.
On the other hand, if you have really many rows in one partition you can have also some other performance problems.
But that's hard to say without knowing the data model.
Maybe you want to use "counter table" in addition to your dataset.
Pros: get counter fast.
Cons: need to maintained that table.
Reference:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html

Cassandra SELECT on secondary index doesn't return row

I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.
The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.
I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.

Spark Cassandra Connector keyBy and shuffling

I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
#Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.
Thanks,
Shai
I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
.spanByKey
Read more in the docs:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

Having trouble querying by dates using the Java Cassandra Spark SQL Connector

I'm trying to use Spark SQL to query a table by a date range. For example, I'm trying to run an SQL statement like: SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate <= '2015-12-31' AND deployment_id = 1 AND device_id = 1. When I run the query no error is being thrown but I'm not receiving any results back when I would expect some. When running the query without the date range I am getting results back.
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("SparkTest")
.set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160");
JavaSparkContext context = new JavaSparkContext(sparkConf);
JavaCassandraSQLContext sqlContext = new JavaCassandraSQLContext(context);
sqlContext.sqlContext().setKeyspace("mykeyspace");
String sql = "SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate < '2015-12-31' AND deployment_id = 1 AND device_id = 1";
JavaSchemaRDD rdd = sqlContext.sql(sql);
List<Row> rows = rdd.collect(); // rows.size() is zero when I would expect it to contain numerous rows.
Schema:
CREATE TABLE trip (
device_id bigint,
deployment_id bigint,
utc_startdate timestamp,
other columns....
PRIMARY KEY ((device_id, deployment_id), utc_startdate)
) WITH CLUSTERING ORDER BY (utc_startdate ASC);
Any help would be greatly appreciated.
What does your table schema (in particular, your PRIMARY KEY definition) look like? Even without seeing it, I am fairly certain that you are seeing this behavior because you are not qualifying your query with a partition key. Using the ALLOW FILTERING directive will filter the rows by date (assuming that is your clustering key), but that is not a good solution for a big cluster or large dataset.
Let's say that you are querying users in a certain geographic region. If you used region as a partition key, you could run this query, and it would work:
SELECT * FROM users
WHERE region='California'
AND date >= '2015-01-01' AND date <= '2015-12-31';
Give Patrick McFadin's article on Getting Started with Timeseries Data a read. That has some good examples that should help you.

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources