We have a facility in our web app to delete large quantities of data. We do this by paginating through all records found against u_id.
The keys we have are designed for other queries we have in the application - ideally, it would be great to have a primary key for u_id but this would break all our other queries.
The below method works well most of the time, however, after deleting approximately 6-8 million of records, we get:
Dse\Exception\RuntimeException: All connections on all I/O threads are busy
We also sometimes get a slightly different error message:
Dse\Exception\ReadTimeoutException: Operation timed out - received only 0 responses
You'll notice in the below code usleep(2500000) which pauses the script. This has been our workaround but would be good to get this resolved as Cassandra should be able to handle this number of deletes.
$cluster = \Dse::cluster()
->withDefaultTimeout(3600)
->withContactPoints(env('CA_HOST'))
->build();
$session = $cluster->connect(env('CONNECT'));
$options = array('page_size' => 50);
$results = $session->execute("SELECT * FROM datastore WHERE u_id = $u_id;", $options);
$future_deletes = array();
while (true) {
foreach ($results as $result) {
$future_deletes[] = $session->executeAsync("DELETE FROM datastore WHERE record_id = '" . $result['record_id'] . "' AND record_version = " . $result['record_version'] . " AND user_id = " . $result['user_id']);
$future_deletes[] = $session->executeAsync("UPDATE data_count set u_count = u_count - 1 WHERE u_id = " . $u_id);
}
if( !empty($future_deletes) ){
foreach ($future_deletes as $future_delete) {
// we will not wait for each result for more than 5 seconds
$future_delete->get(5);
}
//usleep(2500000); //2.5 seconds
}
$future_deletes = array();
if ($results->isLastPage()) {
break;
}
$results = $results->nextPage();
}
//Disconnect
$session = NULL;
For your reference, here are our tables:
CREATE TABLE datastore (id uuid,
record_id varchar,
record_version int,
user_id int,
u_id int,
column_1 varchar,
column_2 varchar,
column_3 varchar,
column_4 varchar,
column_5 varchar,
PRIMARY KEY((record_id), record_version, user_id)
);
CREATE INDEX u_id ON datastore (u_id);
CREATE TABLE data_count (u_id int PRIMARY KEY, u_count counter);
We are running a server with 8GB RAM.
The version of the DSE driver is 6.0.1.
Thank you in advance!
You need to control, how many "in-flight" requests do you have a the same point of time. There is a limit on number of queries per connection, and number of connections. They are controlled by corresponding functions of the Cluster class (can't find fast enough in PHP docs, but it should be similar to Cluster functions in the C++ driver, because PHP is built on top of C++ driver).
Related
Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}
SELECT count(*) FROM device_stats
WHERE orgid = 'XYZ'
AND regionid = 'NY'
AND campusid = 'C1'
AND buildingid = 'C1'
AND floorid = '2'
AND year = 2017;
The above CQL query returns correct result - 32032, in CQL Shell
But when I run the same query using QueryBuilder Java API , I see the count as 0
BuiltStatement summaryQuery = QueryBuilder.select()
.countAll()
.from("device_stats")
.where(eq("orgid", "XYZ"))
.and(eq("regionid", "NY"))
.and(eq("campusid", "C1"))
.and(eq("buildingid", "C1"))
.and(eq("floorid", "2"))
.and(eq("year", "2017"));
try {
ResultSetFuture tagSummaryResults = session.executeAsync(tagSummaryQuery);
tagSummaryResults.getUninterruptibly().all().stream().forEach(result -> {
System.out.println(" totalCount > "+result.getLong(0));
});
I have only 20 partitions and 32032 rows per partition.
What could be the reason QueryBuilder not executing the query correctly ?
Schema :
CREATE TABLE device_stats (
orgid text,
regionid text,
campusid text,
buildingid text,
floorid text,
year int,
endofwindow timestamp,
categoryid timeuuid,
devicestats map<text,bigint>,
PRIMARY KEY ((orgid, regionid, campusid, buildingid, floorid,year),endofwindow,categoryid)
) WITH CLUSTERING ORDER BY (endofwindow DESC,categoryid ASC);
// Using the keys function to index the map keys
CREATE INDEX ON device_stats (keys(devicestats));
I am using cassandra 3.10 and com.datastax.cassandra:cassandra-driver-core:3.1.4
Moving my comment to an answer since that seems to solve the original problem:
Changing .and(eq("year", "2017")) to .and(eq("year", 2017)) solves the issue since year is an int and not a text.
I am trying to run a spark job (which talks to Cassandra) to read data, do some aggregation, and then write aggregates to Cassandra
I have 2 tables (monthly_active_users (MAU) , daily_user_metric_aggregates (DUMA))
For every record in MAU, there will be one or more records in DUMA
Get every records in MAU and fetch user_id in it then find records in DUMA for that user (with server side filters applied like metric_name in ('ms', 'md')
If one or more records in DUMA for the specified where clause then i need to increment the count of appMauAggregate map (app wise MAU counts)
I tested this algorithm, works as expected but i wanted to find out
1) Is it an optimized algorithm (or) is there any better way to do it? I have a sense that something is not correct and i am not seeing speedups. Looks like Cassandra client is being created and shutdown for each spark action (collect). Takes long time to process small dataset.
2) Spark workers are not co-located with cassandra, meaning spark worker is running in different node (container) than C* node (we may move spark worker to C* node for data locality)
3) I am seeing spark job is being created/submitted for every spark action (collect) and i belive that it is an expected behavior from spark, is there anyway to cutdown reads from C* and create joins so that data retrierval is fast?
4) What is the downside of this algorithm? Can you recommend better design approach, meaning w/r/t partition strategy, loading C* partition onto Spark partition, executor's / driver's memory requirement?
5) As long as algorithm and design approach is fine then i can play around with spark tuning. I am using 5 workers (each with 16 CPU and 64GB RAM)
C* Schema :
MAU:
CREATE TABLE analytics.monthly_active_users (
month text,
app_id uuid,
user_id uuid,
PRIMARY KEY (month, app_id, user_id)
) WITH CLUSTERING ORDER BY (app_id ASC, user_id ASC)
data:
cqlsh:analytics> select * from monthly_active_users limit 2;
month | app_id | user_id
--------+--------------------------------------+--------------------------------------
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 2c70a31a-031c-4dbf-8dbd-e2ce7bdc2bc7
DUMA:
CREATE TABLE analytics.daily_user_metric_aggregates (
metric_date timestamp,
user_id uuid,
metric_name text,
"count" counter,
PRIMARY KEY (metric_date, user_id, metric_name)
) WITH CLUSTERING ORDER BY (user_id ASC, metric_name ASC)
data:
cqlsh:analytics> select * from daily_user_metric_aggregates where metric_date='2015-02-08' and user_id=199c0a31-8e74-46d9-9b3c-04f67d58b4d1;
metric_date | user_id | metric_name | count
--------------------------+--------------------------------------+-------------------+-------
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | md | 1
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | ms | 1
Spark Job :
import java.net.InetAddress
import java.util.concurrent.atomic.AtomicLong
import java.util.{Date, UUID}
import com.datastax.spark.connector.util.Logging
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.{DateTime, DateTimeZone}
import scala.collection.mutable.ListBuffer
object MonthlyActiveUserAggregate extends App with Logging {
val KeySpace: String = "analytics"
val MauTable: String = "mau"
val CassandraHostProperty = "CASSANDRA_HOST"
val CassandraDefaultHost = "127.0.0.1"
val CassandraHost = InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty, CassandraDefaultHost))
val conf = new SparkConf().setAppName(getClass.getSimpleName)
.set("spark.cassandra.connection.host", CassandraHost.getHostAddress)
lazy val sc = new SparkContext(conf)
import com.datastax.spark.connector._
def now = new DateTime(DateTimeZone.UTC)
val metricMonth = now.getYear + "-" + now.getMonthOfYear
private val mauMonthSB: StringBuilder = new StringBuilder
mauMonthSB.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getMonthOfYear).append("-")
if (now.getDayOfMonth < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getDayOfMonth)
private val mauMonth: String = mauMonthSB.toString()
val dates = ListBuffer[String]()
for (day <- 1 to now.dayOfMonth().getMaximumValue) {
val metricDate: StringBuilder = new StringBuilder
metricDate.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) metricDate.append("0")
metricDate.append(now.getMonthOfYear).append("-")
if (day < 10) metricDate.append("0")
metricDate.append(day)
dates += metricDate.toString()
}
private val metricName: List[String] = List("ms", "md")
val appMauAggregate = scala.collection.mutable.Map[String, scala.collection.mutable.Map[UUID, AtomicLong]]()
case class MAURecord(month: String, appId: UUID, userId: UUID) extends Serializable
case class DUMARecord(metricDate: Date, userId: UUID, metricName: String) extends Serializable
case class MAUAggregate(month: String, appId: UUID, total: Long) extends Serializable
private val mau = sc.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.collect()
mau.foreach { monthlyActiveUser =>
val duma = sc.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? and user_id = ? and metric_name in ?", dates, monthlyActiveUser.userId, metricName)
//.map(_.userId).distinct().collect()
.collect()
if (duma.length > 0) { // if user has `ms` for the given month
if (!appMauAggregate.isDefinedAt(mauMonth)) {
appMauAggregate += (mauMonth -> scala.collection.mutable.Map[UUID, AtomicLong]())
}
val monthMap: scala.collection.mutable.Map[UUID, AtomicLong] = appMauAggregate(mauMonth)
if (!monthMap.isDefinedAt(monthlyActiveUser.appId)) {
monthMap += (monthlyActiveUser.appId -> new AtomicLong(0))
}
monthMap(monthlyActiveUser.appId).incrementAndGet()
} else {
println(s"No message_sent in daily_user_metric_aggregates for user: $monthlyActiveUser")
}
}
for ((metricMonth: String, appMauCounts: scala.collection.mutable.Map[UUID, AtomicLong]) <- appMauAggregate) {
for ((appId: UUID, total: AtomicLong) <- appMauCounts) {
println(s"month: $metricMonth, app_id: $appId, total: $total");
val collection = sc.parallelize(Seq(MAUAggregate(metricMonth.substring(0, 7), appId, total.get())))
collection.saveToCassandra(KeySpace, MauTable, SomeColumns("month", "app_id", "total"))
}
}
sc.stop()
}
Thanks.
Your solution is the least efficient possible. You are performing a join by looking up each key one-by-one, avoiding any possible parallelization.
I've never used the Cassandra connector, but I understand it returns RDDs. So you could do this:
val mau: RDD[(UUID, MAURecord)] = sc
.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.map(u => u.userId -> u) // Key by user ID.
val duma: RDD[(UUID, DUMARecord)] = sc
.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? metric_name in ?", dates, metricName)
.map(a => a.userId -> a) // Key by user ID.
// Count "duma" by key.
val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
// Join to "mau". This drops "mau" entries that have no count
// and "duma" entries that are not present in "mau".
val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
// Get per-application counts.
val appCounts: RDD[(UUID, Long)] = joined
.map { case (u, (mau, count)) => mau.appId -> 1 }
.countByKey
There is a parameter spark.cassandra.connection.keep_alive_ms which controls for how long keep the connection opened. Take a look at the documentation page.
If you colocate Spark Workers with Cassandra nodes, connector will take advantage of this and create partitions appropriately so that the executor will always fetch data from the local node.
I can see some design improvements you can make in DUMA table: metric_date seems to be not the best choice for partition key - consider making (user_id, metric_name) a partition key because in that case you will not have to generate dates for the query - you will just need to put user_id and metrics_name to the where clause. Moreover, you can add a month identifier to the primary key - then, each partition will include only those information which are related to what you want to fetch with each query.
Anyway, the functionality of join in Spark-Cassandra-Connector are currently being implemented (see this ticket).
I have a table created like this:
CREATE TABLE messages (
stream int,
sequence int,
timestamp bigint,
message blob,
PRIMARY KEY (stream, sequence)
) WITH gc_grace_seconds = 0;
Running the following query in CQLSH works perfectly fine:
select * from messages where stream = 1 and sequence >= 1 and sequence <= 100;
However, when I try to run the same via Java driver I get the following exception:
com.datastax.driver.core.exceptions.InvalidQueryException: String didn't validate.
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:271)
at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:187)
at com.datastax.driver.core.Session.execute(Session.java:126)
at com.datastax.driver.core.Session.execute(Session.java:100)
I'm using parameterized querying API:
public final String FETCH_CQL = "select stream, sequence, timestamp, message "
+ "from messages where stream = ? and sequence >= ? and sequence <= ?";
session.execute(FETCH_CQL, Integer.parseInt(stream), Integer.parseInt(fromSequence), Integer.parseInt(toSequence));
What gives? Overall setup works, as I have another query working on a different table.
Thanks!
Have you tried adding ; to the end of your query:
public final String FETCH_CQL = "select stream, sequence, timestamp, message "
+ "from messages where stream = ? and sequence >= ? and sequence <= ?**;**";
Another option is to use prepared statements:
PreparedStatement FETCH_PS = session.prepare(FETCH_CQL);
BoundStatement boundStatement = FETCH_PS.bind(Integer.parseInt(stream), Integer.parseInt(fromSequence), Integer.parseInt(toSequence));
session.execute(boundStatement);
Define stream and sequence fields as type UUID
I have a an Azure table which serves as an event log. I need the most efficient way to read the bottom of the table to retrieve the most recent entries.
What is the most efficient way of doing this?
First of all, I would really advice you to base your partition key on UTC ticks. You can do it in a way that all the antities are ordered from latest to oldest.
Then if you want to get lets say 100 latest logs you just call (lets say that query is IQueryable something from your favorite client - we use Lucifure Stash): query.Take(100);
If you want to fetch entities for certain period you write: query.Where(x => x.PartitionKey <= value); or something similar.
The "value" variable has to be constructed based on the way you construct the values for partition key.
Assuming you want to fetch the data for last 15 minutes, try this pseudo code:
DateTime toDateTime = DateTime.UtcNow;
DateTime fromDateTime = toDateTime.AddMinutes(-15);
string myPartitionKeyFrom = fromDateTime.ToString("yy-MM");
string myPartitionKeyTo = toDateTime.ToString("yy-MM");
string query = "";
if (myPartitionKeyFrom.Equals(myPartitionKeyTo))//In case both time periods fall in same month, then we can directly hit that partition.
{
query += "(PartitionKey eq '" + myPartitionKeyFrom + "') ";
}
else // Otherwise we would need to do a greater than and lesser than stuff.
{
query += "(PartitionKey ge '" + myPartitionKeyFrom + "' and PartitionKey le '" + myPartitionKeyTo + "') ";
}
query += "and (RowKey ge '" + fromDateTime.ToString() + "' and RowKey le '" + toDateTime.ToString() + "')";
If you want to fetch latest 'n' number of entries then you need to slightly modify your PartitionKey and RowKey value, So that latest entries will be pushed to the top of the table.
For this you need to compute both the keys using DateTime.MaxValue.Subtract(DateTime.UtcNow).Ticks; instead of DateTime.UtcNow.
Microsoft provides a SemanticLogging framework that has a specific sink to log to Azure Table.
If you look at the library code, it generates a partition key (in reverse order) based on a Datetime :
static string GeneratePartitionKeyReversed(DateTime dateTime)
{
dateTime = dateTime.AddMinutes(-1.0);
return GetTicksReversed(
new DateTime(dateTime.Year, dateTime.Month, dateTime.Day, dateTime.Hour, dateTime.Minute, 0));
}
static string GetTicksReversed(DateTime dateTime)
{
return (DateTime.MaxValue - dateTime.ToUniversalTime())
.Ticks.ToString("d19", (IFormatProvider)CultureInfo.InvariantCulture);
}
So you can implement the same logic in your application to build your partitionkey.
If you want to retrieve the logs for a specific date range, you can write a query that looks like that:
var minDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-2));
var maxDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-1));
// Get the cloud table
var cloudTable = GetCloudTable();
// Build the query
IQueryable<DynamicTableEntity> query = cloudTable.CreateQuery<DynamicTableEntity>();
// condition for max date
query = query.Where(a => string.Compare(a.PartitionKey, maxDate,
StringComparison.Ordinal) >= 0);
// condition for min date
query = query.Where(a => string.Compare(a.PartitionKey, minDate,
StringComparison.Ordinal) <= 0);3