How to backup/dump structure of graphs in arangoDB

How to backup/dump structure of graphs in arangoDB - arangodb

Is there a way to dump the graphstructure of an arangoDB database, since
arangodump unfortunately just dumps the data of edges and collections.

According to the documentation in order to dump structural information of all collections (including system collections) you run the following
arangodump --dump-data false --include-system-collections true --output-directory "dump"
If you do not want the system collections to be included then don't provide the argument (it defaults to false) or provide a false value.
How is the structural and data of collections dumped, see below from the documentation
Structural information for a collection will be saved in files with
name pattern .structure.json. Each structure file will contains a JSON
object with these attributes:
parameters: contains the collection properties
indexes: contains the collection indexes
Document data for a collection will be saved in
files with name pattern .data.json. Each line in a data file is a
document insertion/update or deletion marker, alongside with some meta
data.

For testing I often want to extract a subgraph with a known structure. I use that to test my queries against. The method is not pretty but it might address your question. I blogged about it here.

Although #Raf's answer is accepted , --dump-data false will only give structure files for all the collections and but data wont be there. Including --include-system-collections true would give _graphs system collection's structure which wont have information pertaining to individual graphs creation/structure.
For Graph creation data as well
Right command is as follows.
arangodump --server.database <DB_NAME> --include-system-collections true --output-directory <YOUR_DIRECTORY>
We would be interested in _graphs_<long_id>.data.json named file which has below data format.
{
"type": 2300,
"data":
{
"_id": "_graphs/social",
"_key": "social",
"_rev": "_WaJxhIO--_",
"edgeDefinitions": [
{
"collection": "relation",
"from": ["female", "male"],
"to": ["female", "male"]
}
],
"numberOfShards": 1,
"orphanCollections": [],
"replicationFactor": 1
}
}
Hope this helps other users who were looking for my requirement!

Currently ArangoDB manages graphs via documents in the system collection _graphs.
One document equals one graph. It contains the graph name, involved vertex collections and Edge Definition that configure the directions of edge collections.

Related

Cannot identify the correct COSMOS DB SQL SELECT syntax to check if coordinates (Point) are within a Polygon

I m developing an app that uses Cosmos DB SQL. The intention is to identify if a potential construction site is within various restricted zones, such as national parks and sites of special scientific interest. This information is very useful in obtaining all the appropriate planning permissions.
I have created a container named 'geodata' containing 15 documents that I imported using data from a Geojson file provided by the UK National Parks. I have confirmed that all the polygons are valid using a ST_ISVALIDDETAILED SQL statement. I have also checked that the coordinates are anti-clockwise. A few documents contain MultiPolygons. The Geospatial Configuration of the container is 'Geography'.
I am using the Azure Cosmos Data Explorer to identify the correct format of a SELECT statement to identify if given coordinates (Point) are within any of the polygons within the 15 documents.
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates":[-3.139638969259495,54.595188276959284]}, c.geometry)
The embedded coordinates are within a National Park, in this case, the Lake District in the UK (it also happens to be my favourite coffee haunt).
'c.geometry' is the JSON field within the documents.
"type": "Feature",
"properties": {
"objectid": 3,
"npark18cd": "E26000004",
"npark18nm": "Northumberland National Park",
"npark18nmw": " ",
"bng_e": 385044,
"bng_n": 600169,
"long": -2.2370801,
"lat": 55.29539871,
"st_areashape": 1050982397.6985701,
"st_lengthshape": 339810.592994494
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-2.182235310191206,
55.586659699934806
],
[
-2.183754259805564,
55.58706479201416
], ......
Link to the full document: https://www.dropbox.com/s/yul6ft2rweod75s/lakedistrictnationlpark.json?dl=0
I have not been able to format the SELECT query to return the name of the park successfully.
Can you help me?
Is what I want to achieve possible?
Any guidance would be appreciated.
I appreciate any help you can provide.

You haven't mentioned what error you are getting. And you have misspelt "c.geometry".
This should work
SELECT c.properties.npark18nm
FROM c
WHERE ST_WITHIN({"type": "Point", "coordinates": [-3.139638969259495,54.595188276959284]}, c.geometry)
When running the query with your sample document, I was able to get the correct response(see image).
So this particular document is fine and the query in your question works too. Can you recheck your query on the explorer again? Also, are you referring to the incorrect database/collection by any chance?
Maybe a full screen shot of cosmos data explorer showing the dbs/collections, your query and response will also help.

I have fixed this problem. Not by altering the SQL statement but by deleting the container the data was held in, recreating it and reloading the data.
The SQL statement now produces the expected results.

Querying mongoDB using pymongo (completely new to mongo/pymongo)

If this question seems too trivial then please let me know in the comments, I will do further research on how to solve it.
I have a collection called products where I store details of a particular product from different retailers. The schema of a document looks like this -
{
"_id": "uuid of a product",
"created_at": "timestamp",
"offers": [{
"retailer_id": 123,
"product_url": "url - of -a - product.com",
"price": "1"
},
{
"retailer_id": 456,
"product_url": "url - of -a - product.com",
"price": "1"
}
]
}
_id of a product is system generated. Consider a product like 'iPhone X'. This will be a single document with URLs and prices from multiple retailers like Amazon, eBay, etc.
Now if a new URL comes into the system, I need to make a query if this URL already exists in our database. The obvious way to do this is to iterate every offer of every product document and see if the product_url field matches with the input URL. That would require loading up all the documents into the memory and iterate through the offers of every product one by one. Now my question arises -
Is there a simpler method to achieve this? Using pymongo?
Or my database schema needs to be changed since this basic check_if_product_url_exists() is too complex?

MongoDB provides searching within arrays using dot notation.
So your query would be:
db.collection.find({'offers.product_url': 'url - of -a - product.com'})
The same syntax works in MongoDB shell or pymongo.

Document Schema Performance

I am trying to determine the best document schema for a project for couchdb (2.3.1). In researching this I am finding some conflicting information and no relevant guidelines for the latest version of couchdb and similar scenarios. If this data does not lend itself to couchdb or a different method other than whats detailed below is prefered, I would like to better understand why.
My scenario is to track the manufacturing details of widgets:
100,000-300,000 widget types must be tracked
Each widget type is manufactured between 200-1,800 times a day
Widget type manufacturing may burst to ~10,000 in a day
Each widget creation and its associated details must be recorded and updated
Widget creation is stored for 30 days
Query widget details by widget type and creationStartTime/creationEndTime
I am not concerned with revisions, and can just update and use the same _rev if this may increase performance
Method 1:
{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creation": [{
"creationId" "da17faef-3591-4579-b5f6-ff0a719a6da7",
"creationStartTime": 1556471139,
"creationEndTime": 1556471173,
"color": "#ffffff",
"styleId": "92811",
"creatorId": "82812"
},{
"creationId" "893fede7-3874-44ed-b290-7001b4901bc9",
"creationStartTime": 1556471481,
"creationEndTime": 1556471497,
"color": "#cccccc",
"styleId": "75343",
"creatorId": "3211"
}]
}
Using method one would limit my document creation to 100,000-300,000 documents. However, these documents would be very tall and frequently updated.
Method 2:
{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creationId" "da17faef-3591-4579-b5f6-ff0a719a6da7",
"creationStartTime": 1556471139,
"creationEndTime": 1556471173,
"color": "#ffffff",
"styleId": "92811",
"creatorId": "82812"
},{
"_id": "*",
"_rev": "*",
"widgetTypeId": "1831",
"creationId" "893fede7-3874-44ed-b290-7001b4901bc9",
"creationStartTime": 1556471481,
"creationEndTime": 1556471497,
"color": "#cccccc",
"styleId": "75343",
"creatorId": "3211"
}
Method 2 creates a tall database

It's a common problem to be faced with. In general terms, small, immutable documents will likely be more performant than few, huge, mutable documents. The reasons for this include:
There is no support for partial updates (patch) in CouchDB. So if you need to insert data into an array in a big document, you need to fetch all of the data, unpack the json, insert the data, repack the json and send the whole thing back to CouchDB over the wire.
Larger documents provide for more internal overheads, too, especially when it comes to indexing.
It's best to let the data that change as a unit make up a document. Ever-growing lists in documents is a bad idea.
It seems to me that your second alternative is a perfect fit for what you want to achieve: a set of small documents that can be made immutable. Then make a set of views so you can query on time ranges and widget type.

U-SQL: How to skip files from analysis based on content

I have a lot of files each containing a set of json objects like this:
{ "Id": "1", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "2", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "3", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
etc.
Each file containts information about a specific type of session. In this case it are sessions from a Web App, but it could also be sessions of a Desktop App. In that case the value for Origin is "DesktopClient" instead of "WebClient"
For analysis purposes say I am only interested in DesktopClient sessions.
All files representing a session are stored in Azure Blob Storage like this:
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef77.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef78.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef79.json
Is it possible to skip files of which the first line already makes it clear if it is not a DesktopClient session file, like in my example? I think it would save a lot of query resources if files that I know of do not contain the right session type can be skipped since they can be quit big.
At the moment my query read the data like this:
#RawExtract = EXTRACT [RawString] string
FROM #"wasb://plancare-events-blobs#centrallogging/2017/07/20/{*}.json"
USING Extractors.Text(delimiter:'\b', quoting : false);
#ParsedJSONLines = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple([RawString]) AS JSONLine
FROM #RawExtract;
...
Or should I create my own version of Extractors.Text and if so, how should I do that.

To answer some questions that popped up in the comments to the question first:
At this point we do not provide access to the Blob Store meta data. That means that you need to express any meta data either as part of the data in the file or as part of the file name (or path).
Depending on the cost of extraction and sizes of files, you can either extract all the rows and then filter out the rows where the beginning of the row is not fitting your criteria. That will extract all files and all rows from all files, but does not need a custom extractor.
Alternatively, write a custom extractor that checks for only the files that are appropriate (that may be useful if the first solution does not give you the performance and you can determine the conditions efficiently inside the extractors). Several example extractors can be found at http://usql.io in the example directory (including an example JSON extractor).

How to find an object which is at nth nested level in mongoDB? (single collection, single document)

I am trying to find an nth object using '_id', which is in the same document.
Any suggestions or references or code samples would be appreciated.
(e.g)
Document will look as below:
{
"_id": "xxxxx",
"name": "One",
"pocket": [{
"_id": "xxx123",
"name": "NestedOne",
"pocket": []
}, {
"_id": "xxx1234",
"name": "NestedTwo",
"pocket": [{
"_id": "xxx123456",
"name": "NestedTwoNested",
"pocket": [{"_id": "xxx123666",
"name": "NestedNestedOne",
"pocket": []
}]
}]
}]
}
The pockets shall hold more pockets and it is dynamic.
Here, I would like to search "pocket" using "_id" , say "xxx123456", but without using static reference.
Thanks again.

I highly recommend you change your document structure to something easier to manage/search, as this will only become more of a pain to work with.
Why not use multiple collections, like explained in this answer?
So an easy way to think about this for your situation, which I hope is easier for you to reason about than dropping some schema code...
Store all of your things as children in the same document. Give them unique _ids.
Store all of the contents of their pockets as collections. The collections simply hold all the ids that would normally be inside the pocket (instead of referencing the pockets themselves).
That way, a lot of your work can happen outside of DB calls. You just batch pull out the items you need when you need them, instead of searching nested documents!
However, if you can work with the entire document:
Looks like you want to do a recursive search a certain of number of levels deep. I'll give you a general idea with some pseudocode, in hopes that you'll be able to figure the rest out.
Say your function will be:
function SearchNDeep(obj, n, id){
/**
You want to go down 1 level, to pocket
see if pocket has any things in it. If so:
Check all of the things...for further pockets...
Once you've checked one level of things, increment the counter.
When the counter reaches the right level, you'd want to then see if the object you're checking has a `'_id'` of `id`.
**/
}
That's the general idea. There is a cleaner, recursive way to do this where you call SearchNDeep while passing a number for how deep you are, base case being no more levels to go, or the object is found.
Remember to return false or undefined if you don't find it, and the right object if you do! Good luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to backup/dump structure of graphs in arangoDB - arangodb

Is there a way to dump the graphstructure of an arangoDB database, since arangodump unfortunately just dumps the data of edges and collections.

For testing I often want to extract a subgraph with a known structure. I use that to test my queries against. The method is not pretty but it might address your question. I blogged about it here.

Currently ArangoDB manages graphs via documents in the system collection _graphs. One document equals one graph. It contains the graph name, involved vertex collections and Edge Definition that configure the directions of edge collections.

Related

Cannot identify the correct COSMOS DB SQL SELECT syntax to check if coordinates (Point) are within a Polygon

Querying mongoDB using pymongo (completely new to mongo/pymongo)

Document Schema Performance

U-SQL: How to skip files from analysis based on content

How to find an object which is at nth nested level in mongoDB? (single collection, single document)

Categories

Resources