ArangoDB Not copying all records - arangodb

I have a collection called TRANSACTION and I've been trying to copy the collection to a new collection however only 27,000,000 of the 27,763,392 will copy. I have a unique persistent index on a field called SERIAL and a non-unique index on the POSTING_DATE. Here is a sample of the records:
{
"SERIAL": 8,
"STATUS": "P",
"CATEGORY": "A",
"PERSON_SERIAL": "NULL",
"USER_SERIAL": 101,
"OVERRIDE_USER_SERIAL": "NULL",
"OVERRIDE_CATEGORY": "S",
"DEVICE_SERIAL": 104,
"BRANCH_SERIAL": 1,
"BATCH_SERIAL": "NULL",
"SECURITY_SEVERITY": 0,
"POSTING_DATE": "2019-04-15",
"POSTING_TIME": "2019-04-15 11:04:28.0"
}
The reason I want to copy to a new table is that I cannot create indexes on the existing collection, the system will reset (I assume crash) and restart. No errors are in the log file (/var/log/arangodb3/arangod.log). Here is a log entry when the indexing fails:
2021-12-17T15:30:17Z [32354] WARNING [66770] {engines} dropping failed index '378294016'
2021-12-17T15:30:17Z [32354] WARNING [66770] {engines} dropping failed index '379036163'
I tried to check for duplicates
FOR d IN TRANSACTION
COLLECT serial = d.SERIAL WITH COUNT INTO count
FILTER count > 1
RETURN { "serial": serial, "count": count }
but the system never returns. I've waited for over 2 days (52 hours) but the query never returns.
Here is my copy AQL command:
FOR d in TRANSACTION
INSERT d INTO TRAN_1 OPTIONS { ignoreErrors: true }
Server OS:
Ubuntu 18.04.6 LTS Ubuntu
Disk 78Gb with 24Gb free
Memory Total=16424396 Used=12331048 Free=323568 Available3779640
ArangoDB Enterprise 3.7.16
Any ideas on how to work with large data sets?

Related

Count and data in single query in Azure Cosmos DB

I want to return the count and data by writing it in a single Cosmos sql query.
Something like
Select *, count() from c
Or if possible i want get the count in a json document.
[
{
"Count" : 1111
},
{
"Name": "Jon",
"Age" : 30
}
]
You're going to have to issue two separate queries - one to get the total number of documents matching your query, and a second to get a page of documents.

Timeout for db.collection.distinct()?

I have a database with a collection of about 90k documents. Each document is as follows:
{
'my_field_name': "a", # Or "b" or "c" ...
'p1': Array[30],
'p2': Array[10000]
}
There are about 9 unique values for a field name. When there where ~30k documents in the collection:
>>> db.collection.distinct("my_field_name")
["a", "b", "c"]
However, now with 90k documents, db.collection.distinct() returns an empty list.
>>> db.collection.distinct("my_field_name")
[]
Is there a maxTimeMS setting for db.collection.distinct? If so how could I set it to a higher value. If not what else could I investigate?
One thing you can do to immediately speed up your query's execution time is to index the field on which you are running the 'distinct' operation on (if the field is not already indexed).
That being said, if you want to set a maxTimeMS, one work around is to rewrite your query as an aggregation and set the operation timeout on the returned cursor. E.g:
db.collection.aggregate([
{ $group: { _id: '$my_field_name' } },
]).maxTimeMS(10000);
However unlike distinct, a cursor will be returned by the above query.

couchbase add subdocument unique array values

I have a couchbase document as
{
"last": 123,
"data": [
[0, 1.1],
[1, 2.3]
]
}
currently have code to upsert the document to change the last property and add values to the data array, however, cannot find a way to insert unique values only. I'd like to avoid fetching the whole document and doing the filtering in javascript. Is there any way in couchbase?
arrayAddUnique will fail, cause there are floats in the subarrays per couchbase docs.
.mutateIn(`document`)
.upsert("last", 234)
.arrayAppend("data", newDataArray)
.execute( ... )

Why does a Azure Cosmos query sorting by timestamp (string) cost so much more than by _ts (built in)?

This query cost 265 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c.StartFetchDateTimeUtc DESC
StartFetchDateTimeUtc is a string property, serialized by using the Cosmos API
This query cost 5 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c._ts DESC
_ts is a built in field, a Unix-based numeric timestamp.
Example result (only including this field and _ts):
"StartFetchDateTimeUtc": "2017-08-08T03:35:04.1654152Z",
"_ts": 1502163306
The index is in place and follows the suggestions & tutorials how to configure a sortable string/timestamp. It looks like:
{
"path": "/StartFetchDateTimeUtc/?",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
According to this article, the "Item size,Item property count,Data consistency,Indexed properties,Document indexing,Query patterns,Script usage" variables will affect the RU.
So it is very strange that different property costs different RU.
I also create a test demo on my side(with your index and same document property). I have inserted 1000 records to the documentdb. The two different query costs same RU. I suggest you could start a new collection and test again.
The result is like this:
Order by StartFetchDateTimeUtc
Order by _ts

How to implement index update functionality in elasticsearch using spark?

I am new to ElasticSearch. I have a huge data to index using Elasticsearch.
I am use Apache Spark to index the data in hive table using Elasticsearch.
as part of this functionality, i wrote simple Spark Script.
object PushToES {
def main(args: Array[String]) {
val Array(inputQuery, index, host) = args
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("PushToES")
sparkConf.set("....",Host)
sparkConf.set("....","9200")
val sc = new SparkContext(sparkConf)
val ht = new org.apache.spark.sql.hive.HiveContext(sc)
val ps = hhiveSqlContext.sql(inputQuery)
ps.toJSON.saveJsonToEs(index)
}
}
After that I am generating jar and submitting the job by using spark-submit
spark-submit --jars ~/*.jar --master local[*] --class com.PushToES *.jar "select * from gtest where day=20170711" gest3 localhost
then I am executing the below command for
curl -XGET 'localhost:9200/test/test_test/_count?pretty'
first time it is showing properly
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute second time same curl command it is giving result like bleow
{
"count" : 20,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute 3rd time same command i am getting
{
"count" : 30,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But I am not understanding every time why it is adding count value to existing index value(i.e. Count)
Please let me know how can I resolve this issue i.e . if I am execute any number of time also I have to get same value (correct count value i.e 10)
I am expecting below result for this case because correct count value is 10.(I executed count query on hive table for getting every time count(*) as 10)
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Thanks in advance .
If you want to "replace" the data each time you run, and not to "append" it, then you have to configure for such a scenario in your Spark Elasticsearch properties.
First thing you need to do is to have an ID in your document, and tell elastisearch what is your id "column" (if you come from a dataframe) or key (in json terms).
This is documented here : https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the RDD configuration (it is also possible to set the property on the SparkConf though due to its global effect it is discouraged):
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "id"))
A second configuration key is available to control what kind of job elasticsearch tries to do upon writing data, but the default is correct for your user case :
es.write.operation (default index)
The write operation elasticsearch-hadoop should peform - can be any of:
index (default)
new data is added while existing data (based on its id) is replaced (reindexed).
create
adds new data - if the data already exists (based on its id), an exception is thrown.
update
updates existing data (based on its id). If no data is found, an exception is thrown.
upsert
known as merge or insert if the data does not exist, updates if the data exists (based on its id).

Resources