How to improve cassandra 3.0 read performance and throughput using async queries? - cassandra

I have a table:
CREATE TABLE my_table (
user_id text,
ad_id text,
date timestamp,
PRIMARY KEY (user_id, ad_id)
);
The lengths of the user_id and ad_id that I use are not longer than 15 characters.
I query the table like this:
Set<String> users = ... filled somewhere
Session session = ... builded somewhere
BoundStatement boundQuery = ... builded somewhere
(using query: "SELECT * FROM my_table WHERE user_id=?")
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
The Set of users has aproximately 3000 elements , and each users has aproximately 300 ads.
This code is excecuted in 50 threads in the same machine, (with differents users), (using the same Session object)
The algorithm takes between 2 and 3 seconds to complete
The Cassandra cluster has 3 nodes, with a replication factor of 2. Each node has 6 cores and 12 GB of ram.
The Cassandra nodes are in 60% of their CPU capacity, 33% of ram, 66% of ram (including page cache)
The querying machine is 50% of it's cpu capacity, 50% of ram
How do I improve the read time to less than 1 second?
Thanks!
UPDATE:
After some answers(thank you very much), I realized that I wasn' t doing the queries in parallel, so I changed the code to:
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.collect(toList())
.stream()
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
So now the queries are being done in parrallel, this gave me times of aprox 300 milliseconds, so great improvement there!.
But my question continues, can it be faster?
Again, thanks!

users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
A remark. On the 2nd map() you're calling ResultSetFuture::getUninterruptibly. It's a blocking call so you don't benefit much from asynchronous exec ...
Instead, try to transform a list of Futures returned by the driver (hint: ResultSetFuture is implementing the ListenableFuture interface of Guava) into a Future of List
See: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html#successfulAsList(java.lang.Iterable)

Related

Resolving performance issue with skewed partitions in PySpark Window function

I am attempting to calculate some moving averages in Spark, but am running into issues with skewed partitions. Here is the simple calculation I'm trying to perform:
Getting the base data
# Variables
one_min = 60
one_hour = 60*one_min
one_day = 24*one_hour
seven_days = 7*one_day
thirty_days = 30*one_day
# Column variables
target_col = "target"
partition_col = "partition_col"
df_base = (
spark
.sql("SELECT * FROM {base}".format(base=base_table))
)
df_product1 = (
df_base
.where(F.col("product_id") == F.lit("1"))
.select(
F.col(target_col).astype("double").alias(target_col),
F.unix_timestamp("txn_timestamp").alias("window_time"),
"transaction_id",
partition_col
)
)
df_product1.persist()
Calculating running averages
window_lengths = {
"1day": one_day,
"7day": seven_days,
"30day": thirty_days
}
# Create window specs for each type
part_windows = {
time: Window.partitionBy(F.col(partition_col))
.orderBy(F.col("window_time").asc())
.rangeBetween(-secs, -one_min)
for (time, secs) in window_lengths.items()
}
cols = [
# Note: not using `avg` as I will be smoothing this at some point
(F.sum(target_col).over(win)/F.count("*").over(win)).alias(
"{time}_avg_target".format(time=time)
)
for time, win in part_windows.items()
]
sample_df = (
df_product1
.repartition(2000, partition_col)
.sortWithinPartitions(F.col("window_time").asc())
.select(
"*",
*cols
)
)
Now, I can collect a limited subset of these data (say just 100 rows), but if I try to run the full query, and, for example, aggregate the running averages, Spark gets stuck on some particularly large partitions. The vast majority of the partitions have fewer than 1million records in them. Only about 50 of them have more than 1M record and only about 150 have more than 500K
However, a small handful have more than 2.5M (~10), and 3 of them have more than 5M records. These partitions have run for more than 12 hours and failed to complete. The skew in these partitions are a natural part of the data representing larger activity in these distinct values of the partitioning column. I have no control over the definition of the values of this partitioning column.
I am using a SparkSession with dynamic allocation enabled, 32G of RAM and 4 cores per executor, and 4 executors minimum. I have attempted to up the executors to 96G with 8 cores per executor and 10 executors minimum, but the job still does not complete.
This seems like a calculation which shouldn't take 13 hours to complete. The df_product1 DataFrame contains just shy of 300M records.
If there is other information that would be helpful in resolving this problem, please comment below.

deduplicating ArangoDB document collection

I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).
I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:
FOR doc IN Documents
COLLECT docId = doc.myId, doc2 = doc
INSERT doc2 IN Documents2
or
FOR doc IN Documents
LET existing = (FOR doc2 IN Documents2
FILTER doc.myId == doc2.myId
RETURN doc2)
UPDATE existing WITH doc IN Documents2
or (this gives me a "violated unique constraint" error)
FOR doc IN Documents
UPSERT {myId: doc.myId}}]}
INSERT doc
UPDATE doc IN Documents2
TL;DR
It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).
Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.
Preparation
I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:
// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
LET randomAttributes = MERGE(
FOR j IN 1..FLOOR(RAND() * 15) + 5
RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
)
INSERT MERGE(randomAttributes, {myid: i}) INTO test1
Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.
Baseline
Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:
FOR doc IN test1
INSERT doc INTO test2
Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:
LET result = (
FOR doc IN test1
COLLECT myid = doc.myid
RETURN 1
)
RETURN LENGTH(result)
Failed grouping
I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.
Grouping with INTO
To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?
FOR doc IN test1
COLLECT myid = doc.myid INTO groups
INSERT groups[0].doc INTO test2
Grouping with subquery
The following query showed a more stable memory consumption, peaking at 13.4GB:
FOR doc IN test1
COLLECT myid = doc.myid
LET doc2 = (
FOR doc3 IN test1
FILTER doc3.myid == myid
LIMIT 1
RETURN doc3
)
INSERT doc2[0] INTO test2
Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.
Grouping with INTO and KEEP
Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():
FOR doc IN test1
LET d = doc._id
COLLECT myid = doc.myid INTO groups KEEP d
INSERT DOCUMENT(groups[0].d) INTO test2
Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!
Grouping with INTO and projection
Instead of KEEP I also tried a projection out of curiosity:
FOR doc IN test1
COLLECT myid = doc.myid INTO groups = doc._id
INSERT DOCUMENT(groups[0]) INTO test2
RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.
Upsert
I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:
FOR doc IN test1
UPSERT {myid: doc.myid}
INSERT doc
UPDATE {} IN test2
It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.
Delete duplicates in-place
I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:
FOR doc IN test2
COLLECT myid = doc.myid INTO keys = doc._key
LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
FOR k IN allButFirst
REMOVE k IN test2
Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).
Verification
The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.
FOR doc IN test2
COLLECT myid = doc.myid WITH COUNT INTO count
COLLECT c = count WITH COUNT INTO cc
RETURN {[c]: cc}
Conclusion
The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).
Can you try my presented solutions on your data and check what the performance is on your machine?

Awfully slow execution on a small datasets – where to start debugging?

I do some experimentation on a MacBook (i5, 2.6GHz, 8GB ram) with Zeppelin NB and Spark in standalone mode. spark.executor/driver.memory both get 2g. I have also set spark.serializer org.apache.spark.serializer.KryoSerializer in spark-defaults.conf, but that seems to be ignored by zeppelin
ALS model
I have trained a ALS model with ~400k (implicit) ratings and want to get recommendations with val allRecommendations = model.recommendProductsForUsers(1)
Sample set
Next I take a sample to play around with
val sampledRecommendations = allRecommendations.sample(false, 0.05, 1234567).cache
This contains 3600 recommendations.
Remove product recommendations that users own
Next I want to remove all ratings for products that a given user already owns, the list I hold in a RDD of the form (user_id, Set[product_ids]): RDD[(Long, scala.collection.mutable.HashSet[Int])]
val productRecommendations = (sampledRecommendations
// add user portfolio to the list, but convert the key from Long to Int first
.join(usersProductsFlat.map( up => (up._1.toInt, up._2) ))
.mapValues(
// (user, (ratings: Array[Rating], usersOwnedProducts: HashSet[Long]))
r => (r._1
.filter( rating => !r._2.contains(rating.product))
.filter( rating => rating.rating > 0.5)
.toList
)
)
// In case there is no recommendation (left), remove the entry
.filter(rating => !rating._2.isEmpty)
).cache
Question 1
Calling this (productRecommendations.count) on the cached sample set generates a stage that includes flatMap at MatrixFactorizationModel.scala:278 with 10,000 tasks, 263.6 MB of input data and 196.0 MB shuffle write. Shouldn't the tiny and cached RDD be used instead and what is going (wr)on(g) here? The execution of the count takes almost 5 minutes!
Question 2
Calling usersProductsFlat.count which is fully cached according to the "Storage" view in the application UI takes ~60 seconds each time. It's 23Mb in size – shouldn't that be a lot faster?
Map to readable form
Next I bring this in some readable form replacing IDs with names from a broadcasted lookup Map to put into a DF/table:
val readableRatings = (productRecommendations
.flatMapValues(x=>x)
.map( r => (r._1, userIdToMailBC.value(r._1), r._2.product.toInt, productIdToNameBC.value(r._2.product), r._2.rating))
).cache
val readableRatingsDF = readableRatings.toDF("user","email", "product_id", "product", "rating").cache
readableRatingsDF.registerTempTable("recommendations")
Select … with patience
The insane part starts here. Doing a SELECT takes several hours (I could never wait for one to finish):
%sql
SELECT COUNT(user) AS usr_cnt, product, AVG(rating) AS avg_rating
FROM recommendations
GROUP BY product
I don't know where to look to find the bottlenecks here, there is obviously some huge kerfuffle going on here! Where can I start looking?
Your number of partitions may be too large. I think you should use about 200 when running in local mode rather than 10000. You can set the number of partitions in different ways. I suggest you edit the spark.default.parallelism flag in the Spark configuration file.

Understanding Cassandra, using C/C++ Driver

I have an application written in C++ that uses the DataStax C++ Driver to communicate with Cassandra.
I run 20 million inserts and then use 50 queries to read those 20 million rows. I have limited my partition key to 50 different possible values so the number of row partitions is a maximum of 50. Also, each query then returns around 300,000 - 400,000 rows.
I am keeping a track of the wall clock time for different parts of this application. The following piece of code that executes the query and gets result takes on an average 3 seconds to complete, which seems reasonable to me.
stopWatch.start()
CassFuture* result_future = cass_session_execute(session, statement);
if(cass_future_error_code(result_future) == CASS_OK) {
const CassResult* result = cass_future_get_result(result_future);
}
stopWatch.stop()
However, the following piece of code that iterates through the rows takes around 30 seconds on average!
resWatch.start();
CassIterator* rows = cass_iterator_from_result(result);
while(cass_iterator_next(rows)) {
const CassRow* row = cass_iterator_get_row(rows);
BAEL_LOG_INFO << "got a row " << BAEL_LOG_END;
}
resWatch.stop();
I realize that the CassIterator could be iterating over some 400,000 rows but is 30 seconds a reasonable time to achieve this?!
Or is there something I'm missing about the way Cassandra functions... does cass_session_execute(), cass_future_get_result() not fetch all the rows relevant to that query executed and return that to the client? Or does it do it in a lazy manner?

Spark RDD.isEmpty costs much time

I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.

Resources