Spark lists all leaf node even in partitioned data

Spark lists all leaf node even in partitioned data - apache-spark

I have parquet data partitioned by date & hour, folder structure:
events_v3
-- event_date=2015-01-01
-- event_hour=2015-01-1
-- part10000.parquet.gz
-- event_date=2015-01-02
-- event_hour=5
-- part10000.parquet.gz
I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data.
query:
select * from raw_events where event_date='2016-01-01'
similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs-Njw2YVd36bRfCG3DKZrH0tw#mail.gmail.com%3E ( but its old)
Log:
App > 16/09/15 03:14:03 main INFO HadoopFsRelation: Listing leaf files and directories in parallel under: s3a://bucket/events_v3/
and then it spawns 350 tasks since there are 350 days worth of data.
I have disabled schemaMerge, and have also specified the schema to read as, so it can just go to the partition that I am looking at, why should it print all the leaf files ?
Listing leaf files with 2 executors take 10 minutes, and the query actual execution takes on 20 seconds
code sample:
val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
df.createOrReplaceTempView("temp_events")
sparkSession.sql(
"""
|select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
""".stripMargin).show()

As soon as spark is given a directory to read from it issues call to listLeafFiles (org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala). This in turn calls fs.listStatus which makes an api call to get list of files and directories. Now for each directory this method is called again. This hapens recursively until no directories are left. This by design works good in a HDFS system. But works bad in s3 since list file is an RPC call. S3 on other had supports get all files by prefix, which is exactly what we need.
So for example if we had above directory structure with 1 year worth of data with each directory for hour and 10 sub directory we would have , 365 * 24 * 10 = 87k api calls, this can be reduced to 138 api calls given that there are only 137000 files. Each s3 api calls return 1000 files.
Code:
org/apache/hadoop/fs/s3a/S3AFileSystem.java
public FileStatus[] listStatusRecursively(Path f) throws FileNotFoundException,
IOException {
String key = pathToKey(f);
if (LOG.isDebugEnabled()) {
LOG.debug("List status for path: " + f);
}
final List<FileStatus> result = new ArrayList<FileStatus>();
final FileStatus fileStatus = getFileStatus(f);
if (fileStatus.isDirectory()) {
if (!key.isEmpty()) {
key = key + "/";
}
ListObjectsRequest request = new ListObjectsRequest();
request.setBucketName(bucket);
request.setPrefix(key);
request.setMaxKeys(maxKeys);
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: doing listObjects for directory " + key);
}
ObjectListing objects = s3.listObjects(request);
statistics.incrementReadOps(1);
while (true) {
for (S3ObjectSummary summary : objects.getObjectSummaries()) {
Path keyPath = keyToPath(summary.getKey()).makeQualified(uri, workingDir);
// Skip over keys that are ourselves and old S3N _$folder$ files
if (keyPath.equals(f) || summary.getKey().endsWith(S3N_FOLDER_SUFFIX)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Ignoring: " + keyPath);
}
continue;
}
if (objectRepresentsDirectory(summary.getKey(), summary.getSize())) {
result.add(new S3AFileStatus(true, true, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fd: " + keyPath);
}
} else {
result.add(new S3AFileStatus(summary.getSize(),
dateToLong(summary.getLastModified()), keyPath,
getDefaultBlockSize(f.makeQualified(uri, workingDir))));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fi: " + keyPath);
}
}
}
for (String prefix : objects.getCommonPrefixes()) {
Path keyPath = keyToPath(prefix).makeQualified(uri, workingDir);
if (keyPath.equals(f)) {
continue;
}
result.add(new S3AFileStatus(true, false, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd: " + keyPath);
}
}
if (objects.isTruncated()) {
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: list truncated - getting next batch");
}
objects = s3.listNextBatchOfObjects(objects);
statistics.incrementReadOps(1);
} else {
break;
}
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd (not a dir): " + f);
}
result.add(fileStatus);
}
return result.toArray(new FileStatus[result.size()]);
}
/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = {
logTrace(s"Listing ${status.getPath}")
val name = status.getPath.getName.toLowerCase
if (shouldFilterOut(name)) {
Array.empty[FileStatus]
}
else {
val statuses = {
val stats = if(fs.isInstanceOf[S3AFileSystem]){
logWarning("Using Monkey patched version of list status")
println("Using Monkey patched version of list status")
val a = fs.asInstanceOf[S3AFileSystem].listStatusRecursively(status.getPath)
a
// Array.empty[FileStatus]
}
else{
val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter))
}
if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats
}
// statuses do not have any dirs.
statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map {
case f: LocatedFileStatus => f
// NOTE:
//
// - Although S3/S3A/S3N file system can be quite slow for remote file metadata
// operations, calling `getFileBlockLocations` does no harm here since these file system
// implementations don't actually issue RPC for this method.
//
// - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not
// be a big deal since we always use to `listLeafFilesInParallel` when the number of
// paths exceeds threshold.
case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
}
}
}

To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).
One thing to consider is: what makes a good file layout for Object Stores.
Don't have deep directory trees with only a few files per directory
Do have shallow trees with many files
Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.
Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.

Related

Best suited data structure for prefix matching search

I have to create a system of customer list (can be as large as 10million customers), each customer will have a unique ID and a unique ID consists of 10 letters, the first 3 are upper case letters and the last 7 are digits (ex: LQK0333208, HCK1646129,...). The system must perform two search operations in a fastest way (exact matching search and partial matching search):
For the exact matching search, users enter a complete Customer ID, and system displays details of the matching customer or an error message if there is no matching customer.
For the partial matching search, users enter several (at least 5 and at most 8) starting letters of Customer ID, and system displays details of the matching customers or an error message if there is no matching customer. If the number of matching customers is greater than 10, display only 10 of them.
So what the suitable data structure for this system? Currently, I am using AVL tree to handle the problem:
For exact matching search, I will perform a logarithmic search (left and right subtree): O(log(n)).
For partial matching search, I will perform a inorder search of the AVL Tree and check if each customer have the demanded prefix. This is a linear search: O(n).
But I want for partial matching search, the system will perform a better search in term of time complexity.
So any suggestion about the data structure is suitable for the system's requirement?
EDIT 1: I have tested the program with Trie Tree and Ternary Search Tree, but for larger dataset like (10 milions customer). There is no way I could store that in-memory data structure in the memory with a larger dataset like that. So any suggestions?
EDIT 2: I have tested the sorted array data structure and It works well with the data set of 10 million users. Actually, this was my first approach when I did not know anything about the Trie or Ternary tree. As far as I understand, first we will store all the customer in an array, then use some sort algorithms like quicksort to sort the array. Then perform binary search to search for the key, which is O(log(n)) to perform the search operation, quite good! But for a long term, when we need to add extra data to the array (not create the new one, but add to the array), for instance just one more customer, so adding the new element will take O(n) in worst case, as we need to find where to add and shift the element.
But for data structure like Trie or Ternary tree, when adding the new element, it might just require O(1) as we just need to traversal the tree to find the string. If we don't mind about the space complexity, I think trie or ternary tree are suit best for this project.

A suitable data structure for this is a trie. This is a tree of all prefixes, where each node (except the root) represents a character, and each possible path from root to a leaf will be a character sequence that corresponds to a valid ID.
A partial match means that there is a path from the root that ends in an internal node.
If implemented with an efficient child lookup, a match can in this particular use case be found in 10 steps. So if we consider 10 to be a constant, the match can be done in constant time, irrespective of how large (i.e. how wide) the tree is. This assumes that looking up a child by its character can be done in constant time (on average).
As in this particular use case the alphabet is limited (upper case only or digit only), a node can have at most 26 child entries, which could be stored in an array of that size, where the indexes map to the corresponding character. This will ensure constant time for stepping from a parent node to the relevant child node. Alternatively a hashing system can also be used (instead of an array with 26 slots).
Here is a demo implementation in JavaScript (using a plain object for the children, i.e. a "dictionary"):
class TrieNode {
constructor(data=null) {
this.children = {}; // Dictionary, <character, TrieNode>
this.data = data; // Non-null when this node represents the end of a valid word
}
addWord(word, data) {
let node = this; // the root of the tree
for (let ch of word) {
if (!(ch in node.children)) {
node.children[ch] = new TrieNode();
}
node = node.children[ch]; // Walk down the tree
}
node.data = data;
}
*getAllData() { // This method returns an iterator over all data in this subtree
if (this.data != null) yield this.data;
// Recursively yield all data in the children's subtrees
for (const child in this.children) yield* this.children[child].getAllData();
}
*find(prefix) { // This method returns an iterator over matches
let node = this;
// Find the node where this prefix ends:
for (let ch of prefix) {
if (!(ch in node.children)) return; // No matches
node = node.children[ch];
}
// Yield all data in this subtree
yield* node.getAllData();
}
}
class Customer {
constructor(id, name) {
this.id = id;
this.name = name;
}
toString() {
return this.name + " (" + this.id + ")";
}
}
// Demo
// Create some Customer data:
const database = [
new Customer('LQK0333208', 'Hanna'),
new Customer('LQK0333311', 'Bert'),
new Customer('LQK0339999', 'Joline'),
new Customer('HCK1646129', 'Sarah'),
new Customer('HCK1646130', 'Pete'),
new Customer('HCK1700012', 'Cristine')
];
// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);
Sorted Array
Another approach is to store the Customer records in a sorted array. JavaScript has no such data structure, but splice is surprisingly fast in JavaScript, so you could just maintain a sorted order by inserting new entries in their sorted position. Binary search can be used to locate the index where to find or insert an entry:
class SortedArray {
constructor(keyField) {
this.arr = [];
this.keyField = keyField;
}
addObject(obj) {
const i = this.indexOf(obj[this.keyField]);
if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
this.arr.splice(i, 0, obj);
}
*find(prefix) { // This method returns an iterator over matches
for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
const obj = this.arr[i];
if (!obj[this.keyField].startsWith(prefix)) return;
yield obj;
}
}
indexOf(key) {
let low = 0, high = this.arr.length;
while (low < high) {
const mid = (low + high) >> 1;
if (key === this.arr[mid][this.keyField]) return mid;
if (key > this.arr[mid][this.keyField]) {
low = mid + 1;
} else {
high = mid;
}
}
return low;
}
}
class Customer {
constructor(id, name) {
this.id = id;
this.name = name;
}
toString() {
return this.name + " (" + this.id + ")";
}
}
const database = [
new Customer('LQK0333208', 'Hanna'),
new Customer('LQK0333311', 'Bert'),
new Customer('LQK0339999', 'Joline'),
new Customer('HCK1646129', 'Sarah'),
new Customer('HCK1646130', 'Pete'),
new Customer('HCK1700012', 'Cristine')
];
const arr = new SortedArray("id");
for (const customer of database) {
arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

What would be the reason that I can't make the ElementIDs of these objects in Revit match ones in a Revit file?

I am creating a plugin that makes use of the code available from BCFier to select elements from an external server version of the file and highlight them in a Revit view, except the elements are clearly not found in Revit as all elements appear and none are highlighted. The specific pieces of code I am using are:
private void SelectElements(Viewpoint v)
{
var elementsToSelect = new List<ElementId>();
var elementsToHide = new List<ElementId>();
var elementsToShow = new List<ElementId>();
var visibleElems = new FilteredElementCollector(OpenPlugin.doc, OpenPlugin.doc.ActiveView.Id)
.WhereElementIsNotElementType()
.WhereElementIsViewIndependent()
.ToElementIds()
.Where(e => OpenPlugin.doc.GetElement(e).CanBeHidden(OpenPlugin.doc.ActiveView)); //might affect performance, but it's necessary
bool canSetVisibility = (v.Components.Visibility != null &&
v.Components.Visibility.DefaultVisibility &&
v.Components.Visibility.Exceptions.Any());
bool canSetSelection = (v.Components.Selection != null && v.Components.Selection.Any());
//loop elements
foreach (var e in visibleElems)
{
//string guid = ExportUtils.GetExportId(OpenPlugin.doc, e).ToString();
var guid = IfcGuid.ToIfcGuid(ExportUtils.GetExportId(OpenPlugin.doc, e));
Trace.WriteLine(guid.ToString());
if (canSetVisibility)
{
if (v.Components.Visibility.DefaultVisibility)
{
if (v.Components.Visibility.Exceptions.Any(x => x.IfcGuid == guid))
elementsToHide.Add(e);
}
else
{
if (v.Components.Visibility.Exceptions.Any(x => x.IfcGuid == guid))
elementsToShow.Add(e);
}
}
if (canSetSelection)
{
if (v.Components.Selection.Any(x => x.IfcGuid == guid))
elementsToSelect.Add(e);
}
}
try
{
OpenPlugin.HandlerSelect.elementsToSelect = elementsToSelect;
OpenPlugin.HandlerSelect.elementsToHide = elementsToHide;
OpenPlugin.HandlerSelect.elementsToShow = elementsToShow;
OpenPlugin.selectEvent.Raise();
} catch (System.Exception ex)
{
TaskDialog.Show("Exception", ex.Message);
}
}
Which is the section that should filter the lists, which it does do as it produces IDs that look like this:
3GB5RcUGnAzQe9amE4i4IN
3GB5RcUGnAzQe9amE4i4Ib
3GB5RcUGnAzQe9amE4i4J6
3GB5RcUGnAzQe9amE4i4JH
3GB5RcUGnAzQe9amE4i4Ji
3GB5RcUGnAzQe9amE4i4J$
3GB5RcUGnAzQe9amE4i4GD
3GB5RcUGnAzQe9amE4i4Gy
3GB5RcUGnAzQe9amE4i4HM
3GB5RcUGnAzQe9amE4i4HX
3GB5RcUGnAzQe9amE4i4Hf
068MKId$X7hf9uMEB2S_no
The trouble with this is, comparing it to the list of IDs in the IFC file that we imported it from reveals that these IDs do not appear in the IFC file, and looking at it in Revit I found that none of the Guids in Revit weren't in the list that appeared either. Almost all the objects also matched the same main part of the IDs as well, and I'm not experienced enough to know how likely that is.
So my question is, is it something in this code that is an issue?

The IFC GUID is based on the Revit UniqueId but not identical. Please read about the Element Identifiers in RVT, IFC, NW and Forge to learn how they are connected.

EMR with multiple encryption key providers

I'm running EMR cluster with enabled s3 client-side encryption using custom key provider. But now I need to write data to multiple s3 destinations using different encryption schemas:
CSE custom key provider
CSE-KMS
Is it possible to configure EMR to use both encryption types by defining some kind of mapping between s3 bucket and encryption type?
Alternatively since I use spark structured streaming to process and write data to s3 I'm wondering if it's possible to disable encryption on EMRFS but then enable CSE for each stream separately?

The idea is to support any file systems scheme and configure it individually. For example:
# custom encryption key provider
fs.s3x.cse.enabled = true
fs.s3x.cse.materialsDescription.enabled = true
fs.s3x.cse.encryptionMaterialsProvider = my.company.fs.encryption.CustomKeyProvider
#no encryption
fs.s3u.cse.enabled = false
#AWS KMS
fs.s3k.cse.enabled = true
fs.s3k.cse.encryptionMaterialsProvider = com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider
fs.s3k.cse.kms.keyId = some-kms-id
And then to use it in spark like this:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(“s3x://aws-s3-bucket/input”)
.as(Encoders.bean(TestRecord.class))
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("path", “s3k://aws-s3-bucket/output”)
.option("checkpointLocation", “s3u://aws-s3-bucket/checkpointing”)
.start();
Ta handle this I’ve implemented a custom Hadoop file system (extends org.apache.hadoop.fs.FileSystem) that delegates calls to real file system but with modified configurations.
// Create delegate FS
this.config.set("fs.s3n.impl", “com.amazon.ws.emr.hadoop.fs.EmrFileSystem”);
this.config.set("fs.s3n.impl.disable.cache", Boolean.toString(true));
this.delegatingFs = FileSystem.get(s3nURI(originalUri, SCHEME_S3N), substituteS3Config(conf));
Configuration that passes to delegating file system should take all original settings and replace any occurrences of fs.s3*. with fs.s3n..
private Configuration substituteS3Config(final Configuration conf) {
if (conf == null) return null;
final String fsSchemaPrefix = "fs." + getScheme() + ".";
final String fsS3SchemaPrefix = "fs.s3.";
final String fsSchemaImpl = "fs." + getScheme() + ".impl";
Configuration substitutedConfig = new Configuration(conf);
for (Map.Entry<String, String> configEntry : conf) {
String propName = configEntry.getKey();
if (!fsSchemaImpl.equals(propName)
&& propName.startsWith(fsSchemaPrefix)) {
final String newPropName = propName.replace(fsSchemaPrefix, fsS3SchemaPrefix);
LOG.info("Substituting property '{}' with '{}'", propName, newPropName);
substitutedConfig.set(newPropName, configEntry.getValue());
}
}
return substitutedConfig;
}
Besides that make sure that delegating fs receives uris and paths with supporting scheme and returns paths with custom scheme
#Override
public FileStatus getFileStatus(final Path f) throws IOException {
FileStatus status = this.delegatingFs.getFileStatus(s3Path(f));
if (status != null) {
status.setPath(customS3Path(status.getPath()));
}
return status;
}
private Path s3Path(final Path p) {
if (p.toUri() != null && getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), SCHEME_S3N));
}
return p;
}
private Path customS3Path(final Path p) {
if (p.toUri() != null && !getScheme().equals(p.toUri().getScheme())) {
return new Path(s3nURI(p.toUri(), getScheme()));
}
return p;
}
private URI s3nURI(final URI originalUri, final String newScheme) {
try {
return new URI(
newScheme,
originalUri.getUserInfo(),
originalUri.getHost(),
originalUri.getPort(),
originalUri.getPath(),
originalUri.getQuery(),
originalUri.getFragment());
} catch (URISyntaxException e) {
LOG.warn("Unable to convert URI {} to {} scheme", originalUri, newScheme);
}
return originalUri;
}
The final step is to register custom file system with Hadoop (spark-defaults classification)
spark.hadoop.fs.s3x.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3u.impl = my.company.fs.DynamicS3FileSystem
spark.hadoop.fs.s3k.impl = my.company.fs.DynamicS3FileSystem

When you use EMRFS, you can specify per-bucket configs in the format:
fs.s3.bucket.<bucket name>.<some.configuration>
So, for example, to turn off CSE except for a bucket s3://foobar, you can set:
"Classification": "emrfs-site",
"Properties": {
"fs.s3.cse.enabled": "false",
"fs.s3.bucket.foobar.cse.enabled": "true",
[your other configs as usual]
}
Please note that it must be fs.s3 and not fs.{arbitrary-scheme} like fs.s3n.

I can't speak for Amazon EMR, but on hadoop's s3a connector, you can set the encryption policy on a bucket-by-bucket basis. However, S3A doesn't support client side encryption on account of it breaking fundamental assumptions about file lengths (the amount of data you can read MUST == the length in a directory listing/getFileStatus call).
I expect amazon to do something similar. You may be able to create a custom Hadoop Configuration object with the different settings & use that to retrieve the filesystem instance used to save things. Tricky in Spark though.

Delete Documents from CosmosDB based on condition through Query Explorer

What's the query or some other quick way to delete all the documents matching the where condition in a collection?
I want something like DELETE * FROM c WHERE c.DocumentType = 'EULA' but, apparently, it doesn't work.
Note: I'm not looking for any C# implementation for this.

This is a bit old but just had the same requirement and found a concrete example of what #Gaurav Mantri wrote about.
The stored procedure script is here:
https://social.msdn.microsoft.com/Forums/azure/en-US/ec9aa862-0516-47af-badd-dad8a4789dd8/delete-multiple-docdb-documents-within-the-azure-portal?forum=AzureDocumentDB
Go to the Azure portal, grab the script from above and make a new stored procedure in the database->collection you need to delete from.
Then right at the bottom of the stored procedure pane, underneath the script textarea is a place to put in the parameter. In my case I just want to delete all so I used:
SELECT c._self FROM c
I guess yours would be:
SELECT c._self FROM c WHERE c.DocumentType = 'EULA'
Then hit 'Save and Execute'. Viola, some documents get deleted. After I got it working in the Azure Portal I switched over the Azure DocumentDB Studio and got a better view of what was happening. I.e. I could see I was throttled to deleting 18 a time (returned in the results). For some reason I couldn't see this in the Azure Portal.
Anyway, pretty handy even if limited to a certain amount of deletes per execution. Executing the sp is also throttled so you can't just mash the keyboard. I think I would just delete and recreate the Collection unless I had a manageable number of documents to delete (thinking <500).
Props to Mimi Gentz #Microsoft for sharing the script in the link above.
HTH

I want something like DELETE * FROM c WHERE c.DocumentType = 'EULA'
but, apparently, it doesn't work.
Deleting documents this way is not supported. You would need to first select the documents using a SELECT query and then delete them separately. If you want, you can write the code for fetching & deleting in a stored procedure and then execute that stored procedure.

I wrote a script to list all the documents and delete all the documents, it can be modified to delete the selected documents as well.
var docdb = require("documentdb");
var async = require("async");
var config = {
host: "https://xxxx.documents.azure.com:443/",
auth: {
masterKey: "xxxx"
}
};
var client = new docdb.DocumentClient(config.host, config.auth);
var messagesLink = docdb.UriFactory.createDocumentCollectionUri("xxxx", "xxxx");
var listAll = function(callback) {
var spec = {
query: "SELECT * FROM c",
parameters: []
};
client.queryDocuments(messagesLink, spec).toArray((err, results) => {
callback(err, results);
});
};
var deleteAll = function() {
listAll((err, results) => {
if (err) {
console.log(err);
} else {
async.forEach(results, (message, next) => {
client.deleteDocument(message._self, err => {
if (err) {
console.log(err);
next(err);
} else {
next();
}
});
});
}
});
};
var task = process.argv[2];
switch (task) {
case "listAll":
listAll((err, results) => {
if (err) {
console.error(err);
} else {
console.log(results);
}
});
break;
case "deleteAll":
deleteAll();
break;
default:
console.log("Commands:");
console.log("listAll deleteAll");
break;
}

And if you want to do it in C#/Dotnet Core, this project may help: https://github.com/lokijota/CosmosDbDeleteDocumentsByQuery. It's a simple Visual Studio project where you specify a SELECT query, and all the matches will be a) backed up to file; b) deleted, based on a set of flags.

create stored procedure in collection and execute it by passing select query with condition to delete. The major reason to use this stored proc is because of continuation token which will reduce RUs to huge extent and will cost less.

##### Here is the python script which can be used to delete data from Partitioned Cosmos Collection #### This will delete documents Id by Id based on the result set data.
Identify the data that needs to be deleted before below step
res_list = "select id from id_del"
res_id = [{id:x["id"]}
for x in sqlContext.sql(res_list).rdd.collect()]
config = {
"Endpoint" : "Use EndPoint"
"Masterkey" : "UseKey",
"WritingBatchSize" : "5000",
'DOCUMENTDB_DATABASE': 'Database',
'DOCUMENTDB_COLLECTION': 'collection-core'
};
for row in res_id:
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['Endpoint'], {'masterKey': config['Masterkey']})
# use a SQL based query to get documents
## Looping thru partition to delete
query = { 'query': "SELECT c.id FROM c where c.id = "+ "'" +row[id]+"'" }
print(query)
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 1000
result_iterable = client.QueryDocuments('dbs/Database/colls/collection-core', query, options)
results = list(result_iterable)
print('DOCS TO BE DELETED : ' + str(len(results)))
if len(results) > 0 :
for i in range(0,len(results)):
# print(results[i]['id'])
docID = results[i]['id']
print("docID :" + docID)
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 1000
options['partitionKey'] = docID
client.DeleteDocument('dbs/Database/colls/collection-core/docs/'+docID,options=options)
print ('deleted Partition:' + docID)

Groovy GPars Script

GParsPool.withPool(numberPool) {
connection.withBatch(10000) { stmt ->
inputFile.eachParallel { data ->
//GParsPool.withPool() {
stmt.addBatch("DELETE FROM user WHERE number = ${data.toLong()} ")
println "IN"
//}
}
println "OUT"
Long startTimee = System.currentTimeMillis()
stmt.executeBatch()
println "deleted Batch"
Long endTime = System.currentTimeMillis()
println "Time taken for each batch: " + ((endTime - startTimee) / 1000)
}
}
The Above code is used to delete data from database. I First get the data from file and then match the each file data with database data and perform the delete query. But I have 5533179 records its take to much time. Even I have used the gpars but i get the same performance issue that is given by without using gpars. I have set the numberPool=5 but same issue. Even I increase the numberPool again same issue

why don't you use the SQL in operator? thus you can process the data much faster.
UPDATE:
from the top of my head:
GParsPool.withPool(numberPool) {
Map buffPerThread = [:].withDefaults{ [] }
inputFile.eachParallel { data ->
def buff = buffPerThread[ Thread.currentThread().id ]
buff << data.toLong()
if( 1000 == buff.size() ){
sql.execute 'DELETE FROM user WHERE number in (?)', [ buff ]
buff.clear()
}
}
}
I wouldn't use conn.withBatch here, as the in statement give the desired batching already

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string