I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).
I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:
FOR doc IN Documents
COLLECT docId = doc.myId, doc2 = doc
INSERT doc2 IN Documents2
or
FOR doc IN Documents
LET existing = (FOR doc2 IN Documents2
FILTER doc.myId == doc2.myId
RETURN doc2)
UPDATE existing WITH doc IN Documents2
or (this gives me a "violated unique constraint" error)
FOR doc IN Documents
UPSERT {myId: doc.myId}}]}
INSERT doc
UPDATE doc IN Documents2
TL;DR
It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).
Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.
Preparation
I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:
// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
LET randomAttributes = MERGE(
FOR j IN 1..FLOOR(RAND() * 15) + 5
RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
)
INSERT MERGE(randomAttributes, {myid: i}) INTO test1
Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.
Baseline
Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:
FOR doc IN test1
INSERT doc INTO test2
Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:
LET result = (
FOR doc IN test1
COLLECT myid = doc.myid
RETURN 1
)
RETURN LENGTH(result)
Failed grouping
I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.
Grouping with INTO
To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?
FOR doc IN test1
COLLECT myid = doc.myid INTO groups
INSERT groups[0].doc INTO test2
Grouping with subquery
The following query showed a more stable memory consumption, peaking at 13.4GB:
FOR doc IN test1
COLLECT myid = doc.myid
LET doc2 = (
FOR doc3 IN test1
FILTER doc3.myid == myid
LIMIT 1
RETURN doc3
)
INSERT doc2[0] INTO test2
Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.
Grouping with INTO and KEEP
Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():
FOR doc IN test1
LET d = doc._id
COLLECT myid = doc.myid INTO groups KEEP d
INSERT DOCUMENT(groups[0].d) INTO test2
Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!
Grouping with INTO and projection
Instead of KEEP I also tried a projection out of curiosity:
FOR doc IN test1
COLLECT myid = doc.myid INTO groups = doc._id
INSERT DOCUMENT(groups[0]) INTO test2
RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.
Upsert
I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:
FOR doc IN test1
UPSERT {myid: doc.myid}
INSERT doc
UPDATE {} IN test2
It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.
Delete duplicates in-place
I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:
FOR doc IN test2
COLLECT myid = doc.myid INTO keys = doc._key
LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
FOR k IN allButFirst
REMOVE k IN test2
Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).
Verification
The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.
FOR doc IN test2
COLLECT myid = doc.myid WITH COUNT INTO count
COLLECT c = count WITH COUNT INTO cc
RETURN {[c]: cc}
Conclusion
The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).
Can you try my presented solutions on your data and check what the performance is on your machine?
Related
I am using CassandraPageRequest for fetching data based on page size.
Here is my code:
public CassandraPage<CustomerEntity> getCustomer(int limit, String pagingState)
{
final CassandraPageRequest cassandraPageRequest = createCassandraPageRequest(limit, pagingState);
return getPageOfCustomer(cassandraPageRequest);
}
private CassandraPage<CustomerEntity> getPageOfCustomer(final CassandraPageRequest cassandraPageRequest) {
final Slice<CustomerEntity> recordSlice = CustomerPaginationRepository.findAll(cassandraPageRequest);
return new CassandraPage<>(recordSlice);
}
private CassandraPageRequest createCassandraPageRequest(final Integer limit, final String pagingState) {
final PageRequest pageRequest = PageRequest.of(0, limit);
final PagingState pageState = pagingState != null ? PagingState.fromString(pagingState) : null;
return CassandraPageRequest.of(pageRequest, pageState);
}
This works fine. However I want to know the recommendations on the "number of records per page". When I give 1000 as limit, it works fine. Suggest whether we can give 10000 or more than that for limit.
I work at ScyllaDB - Scylla is a Cassandra compatible database.
I ran an experiment a few years back on the effect of page size and row size on cassandra paging.
What I have found is that the total amount of information that needs to be returned in bytes is the item that really matters. If you have very large rows - even 1000 maybe to much, if you have small rows 10000 should be ok.
Other factor that should be considered are:
Amount of tombstones in your data - tombstones have to be read and skipped in a query searching for live data having many of them will cause cassandra (and scylla) more work in search of the next live row.
Type of query are you doing a range scan over multiple partitions or a single partition - a scan over multiple partitions maybe harder to fill data (especially in the case of alot of tombstones).
Timeout - by increasing the page size - cassandra will have to search for more rows, if the read timeout / range scan timeout values are low the query may timeout.
Please note that Scylla has removed the need for its users to optimize the page size - it will cap your queries to 1MB of data / page size of rows.
You can find the complete slide deck / session searching for "Planning your queries for maximum performance" its old but still holds (in Scylla we have more optimizations :) ).
I have many files containing millions of rows in format:
id, created_date, some_value_a, some_value_b, some_value_c
This way of repartitioning was super slow and created for me over million of small ~500b files:
rdd_df = rdd.toDF(["id", "created_time", "a", "b", "c"])
rdd_df.write.partitionBy("id").csv("output")
I would like to achieve output files, where each file contains like 10000 unique IDs and all their rows.
How could I achieve something like this?
You can repartition by adding a Random Salt key.
val totRows = rdd_df.count
val maxRowsForAnId = rdd_df.groupBy("id").count().agg(max("count"))
val numParts1 = totRows/maxRowsForAnId
val totalUniqueIds = rdd_df.select("id").distinct.count
val numParts2 = totRows/(10000*totalUniqueIds)
val numPart = numParts1.min(numParts2)
rdd_df
.repartition(numPart,col("id"),rand)
.csv("output")
The main concept is each partition will be written as 1 file. SO you would have bring your required rows in to 1 partition by repartition(numPart,col("id"),rand).
The first 4-5 operations is just to calculate how many partitions we need to achieve almost 10000 ids per file.
Calculate assuming 10000 ids per partition
Corner case : if a single id has too many rows and doesn't fit in the above calculated partition size.
Hence we calculate no of paritition according to the largest count of ID present
Take min of the 2 noOfPartitons
rand is necessary so, that we can bring multiple IDs in a single partition
NOTE : Although this will give you larger files and each file will contain a set of unique ids for sure. But this involves shuffling , due to which your operation actually might be slower than the code you have mentioned in question.
You would need something like this:
rdd_df.repartition(*number of partitions you want*).write.csv("output", header = True)
or honestly - just let the job decide the number partitions instead of repartitioning. In theory, that should be faster:
rdd_df.write.csv("output", header = True)
I have a table:
CREATE TABLE my_table (
user_id text,
ad_id text,
date timestamp,
PRIMARY KEY (user_id, ad_id)
);
The lengths of the user_id and ad_id that I use are not longer than 15 characters.
I query the table like this:
Set<String> users = ... filled somewhere
Session session = ... builded somewhere
BoundStatement boundQuery = ... builded somewhere
(using query: "SELECT * FROM my_table WHERE user_id=?")
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
The Set of users has aproximately 3000 elements , and each users has aproximately 300 ads.
This code is excecuted in 50 threads in the same machine, (with differents users), (using the same Session object)
The algorithm takes between 2 and 3 seconds to complete
The Cassandra cluster has 3 nodes, with a replication factor of 2. Each node has 6 cores and 12 GB of ram.
The Cassandra nodes are in 60% of their CPU capacity, 33% of ram, 66% of ram (including page cache)
The querying machine is 50% of it's cpu capacity, 50% of ram
How do I improve the read time to less than 1 second?
Thanks!
UPDATE:
After some answers(thank you very much), I realized that I wasn' t doing the queries in parallel, so I changed the code to:
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.collect(toList())
.stream()
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
So now the queries are being done in parrallel, this gave me times of aprox 300 milliseconds, so great improvement there!.
But my question continues, can it be faster?
Again, thanks!
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
A remark. On the 2nd map() you're calling ResultSetFuture::getUninterruptibly. It's a blocking call so you don't benefit much from asynchronous exec ...
Instead, try to transform a list of Futures returned by the driver (hint: ResultSetFuture is implementing the ListenableFuture interface of Guava) into a Future of List
See: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html#successfulAsList(java.lang.Iterable)
I do some experimentation on a MacBook (i5, 2.6GHz, 8GB ram) with Zeppelin NB and Spark in standalone mode. spark.executor/driver.memory both get 2g. I have also set spark.serializer org.apache.spark.serializer.KryoSerializer in spark-defaults.conf, but that seems to be ignored by zeppelin
ALS model
I have trained a ALS model with ~400k (implicit) ratings and want to get recommendations with val allRecommendations = model.recommendProductsForUsers(1)
Sample set
Next I take a sample to play around with
val sampledRecommendations = allRecommendations.sample(false, 0.05, 1234567).cache
This contains 3600 recommendations.
Remove product recommendations that users own
Next I want to remove all ratings for products that a given user already owns, the list I hold in a RDD of the form (user_id, Set[product_ids]): RDD[(Long, scala.collection.mutable.HashSet[Int])]
val productRecommendations = (sampledRecommendations
// add user portfolio to the list, but convert the key from Long to Int first
.join(usersProductsFlat.map( up => (up._1.toInt, up._2) ))
.mapValues(
// (user, (ratings: Array[Rating], usersOwnedProducts: HashSet[Long]))
r => (r._1
.filter( rating => !r._2.contains(rating.product))
.filter( rating => rating.rating > 0.5)
.toList
)
)
// In case there is no recommendation (left), remove the entry
.filter(rating => !rating._2.isEmpty)
).cache
Question 1
Calling this (productRecommendations.count) on the cached sample set generates a stage that includes flatMap at MatrixFactorizationModel.scala:278 with 10,000 tasks, 263.6 MB of input data and 196.0 MB shuffle write. Shouldn't the tiny and cached RDD be used instead and what is going (wr)on(g) here? The execution of the count takes almost 5 minutes!
Question 2
Calling usersProductsFlat.count which is fully cached according to the "Storage" view in the application UI takes ~60 seconds each time. It's 23Mb in size – shouldn't that be a lot faster?
Map to readable form
Next I bring this in some readable form replacing IDs with names from a broadcasted lookup Map to put into a DF/table:
val readableRatings = (productRecommendations
.flatMapValues(x=>x)
.map( r => (r._1, userIdToMailBC.value(r._1), r._2.product.toInt, productIdToNameBC.value(r._2.product), r._2.rating))
).cache
val readableRatingsDF = readableRatings.toDF("user","email", "product_id", "product", "rating").cache
readableRatingsDF.registerTempTable("recommendations")
Select … with patience
The insane part starts here. Doing a SELECT takes several hours (I could never wait for one to finish):
%sql
SELECT COUNT(user) AS usr_cnt, product, AVG(rating) AS avg_rating
FROM recommendations
GROUP BY product
I don't know where to look to find the bottlenecks here, there is obviously some huge kerfuffle going on here! Where can I start looking?
Your number of partitions may be too large. I think you should use about 200 when running in local mode rather than 10000. You can set the number of partitions in different ways. I suggest you edit the spark.default.parallelism flag in the Spark configuration file.
I have an application written in C++ that uses the DataStax C++ Driver to communicate with Cassandra.
I run 20 million inserts and then use 50 queries to read those 20 million rows. I have limited my partition key to 50 different possible values so the number of row partitions is a maximum of 50. Also, each query then returns around 300,000 - 400,000 rows.
I am keeping a track of the wall clock time for different parts of this application. The following piece of code that executes the query and gets result takes on an average 3 seconds to complete, which seems reasonable to me.
stopWatch.start()
CassFuture* result_future = cass_session_execute(session, statement);
if(cass_future_error_code(result_future) == CASS_OK) {
const CassResult* result = cass_future_get_result(result_future);
}
stopWatch.stop()
However, the following piece of code that iterates through the rows takes around 30 seconds on average!
resWatch.start();
CassIterator* rows = cass_iterator_from_result(result);
while(cass_iterator_next(rows)) {
const CassRow* row = cass_iterator_get_row(rows);
BAEL_LOG_INFO << "got a row " << BAEL_LOG_END;
}
resWatch.stop();
I realize that the CassIterator could be iterating over some 400,000 rows but is 30 seconds a reasonable time to achieve this?!
Or is there something I'm missing about the way Cassandra functions... does cass_session_execute(), cass_future_get_result() not fetch all the rows relevant to that query executed and return that to the client? Or does it do it in a lazy manner?