Background
I have a rocksdb collection that contains three fields: _id, author, subreddit.
Problem
I would like to create a Arango graph that creates a graph connecting these two existing columns. But the examples and the drivers seem to only accept collections as its edge definitions.
Issue
The ArangoDb documentation is lacking information on how I can create a graph using edges and nodes pulled from the same collection.
EDIT:
Solution
This was fixed with a code change at this Arangodb issues ticket.
Here's one way to do it using jq, a JSON-oriented command-line tool.
First, an outline of the steps:
1) Use arangoexport to export your author/subredit collection to a file, say, exported.json;
2) Run the jq script, nodes_and_edges.jq, shown below;
3) Use arangoimp to import the JSON produced in (2) into ArangoDB.
There are several ways the graph can be stored in ArangoDB, so ultimately you might wish to tweak nodes_and_edges.jq accordingly (e.g. to generate the nodes first, and then the edges).
INDEX
If your jq does not have INDEX defined, then use this:
def INDEX(stream; idx_expr):
reduce stream as $row ({};
.[$row|idx_expr|
if type != "string" then tojson
else .
end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
nodes_and_edges.jq
# This module is for generating JSON suitable for importing into ArangoDB.
### Generic Functions
# nodes/2
# $name must be the name of the ArangoDB collection of nodes corresponding to $key.
# The scheme for generating key names can be altered by changing the first
# argument of assign_keys, e.g. to "" if no prefix is wanted.
def nodes($key; $name):
map( {($key): .[$key]} ) | assign_keys($name[0:1] + "_"; 1);
def assign_keys(prefix; start):
. as $in
| reduce range(0;length) as $i ([];
. + [$in[$i] + {"_key": "\(prefix)\(start+$i)"}]);
# nodes_and_edges facilitates the normalization of an implicit graph
# in an ArangoDB "document" collection of objects having $from and $to keys.
# The input should be an array of JSON objects, as produced
# by arangoexport for a single collection.
# If $nodesq is truthy, then the JSON for both the nodes and edges is emitted,
# otherwise only the JSON for the edges is emitted.
#
# The first four arguments should be strings.
#
# $from and $to should be the key names in . to be used for the from-to edges;
# $name1 and $name2 should be the names of the corresponding collections of nodes.
def nodes_and_edges($from; $to; $name1; $name2; $nodesq ):
def dict($s): INDEX(.[$s]) | map_values(._key);
def objects: to_entries[] | {($from): .key, "_key": .value};
(nodes($from; $name1) | dict($from)) as $fdict
| (nodes($to; $name2) | dict($to) ) as $tdict
| (if $nodesq then $fdict, $tdict | objects
else empty end),
(.[] | {_from: "\($name1)/\($fdict[.[$from]])",
_to: "\($name2)/\($tdict[.[$to]])"} ) ;
### Problem-Specific Functions
# If you wish to generate the collections separately,
# then these will come in handy:
def authors: nodes("author"; "authors");
def subredits: nodes("subredit"; "subredits");
def nodes_and_edges:
nodes_and_edges("author"; "subredit"; "authors"; "subredits"; true);
nodes_and_edges
Invocation
jq -cf extract_nodes_edges.jq exported.json
This invocation will produce a set of JSONL (JSON-Lines) for "authors", one for "subredits" and an edge collection.
Example
exported.json
[
{"_id":"test/115159","_key":"115159","_rev":"_V8JSdTS---","author": "A", "subredit": "S1"},
{"_id":"test/145120","_key":"145120","_rev":"_V8ONdZa---","author": "B", "subredit": "S2"},
{"_id":"test/114474","_key":"114474","_rev":"_V8JZJJS---","author": "C", "subredit": "S3"}
]
Output
{"author":"A","_key":"name_1"}
{"author":"B","_key":"name_2"}
{"author":"C","_key":"name_3"}
{"subredit":"S1","_key":"sid_1"}
{"subredit":"S2","_key":"sid_2"}
{"subredit":"S3","_key":"sid_3"}
{"_from":"authors/name_1","_to":"subredits/sid_1"}
{"_from":"authors/name_2","_to":"subredits/sid_2"}
{"_from":"authors/name_3","_to":"subredits/sid_3"}
Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.
We start the arangoimp to import our base dataset:
arangoimp --create-collection true --collection RawSubReddits --type jsonl ./RC_2017-01
We use arangosh to create the collections where our final data is going to live in:
db._create("authors")
db._createEdgeCollection("authorsToSubreddits")
We fill the authors collection by simply ignoring any subsequently occuring duplicate authors;
We will calculate the _key of the author by using the MD5 function,
so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:
db._query(`
FOR item IN RawSubReddits
INSERT {
_key: MD5(item.author),
author: item.author
} INTO authors
OPTIONS { ignoreErrors: true }`);
After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges.
Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned,
we can use the MD5()-function again to reference the author previously created:
db._query(`
FOR onesubred IN RawSubReddits
INSERT {
_from: CONCAT('authors/', MD5(onesubred.author)),
_to: CONCAT('RawSubReddits/', onesubred._key)
} INTO authorsToSubreddits")
After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:
db._graphs.save({
"_key": "reddits",
"orphanCollections" : [ ],
"edgeDefinitions" : [
{
"collection": "authorsToSubreddits",
"from": ["authors"],
"to": ["RawSubReddits"]
}
]
})
We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:
db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_Eu-----_",
"author" : "punchyourbuns"
}
]
We identified an author, and now run a graph query for him:
db._query(`FOR vertex, edge, path IN 0..1
OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
GRAPH 'reddits'
RETURN path`).toArray()
One of the resulting paths looks like that:
{
"edges" : [
{
"_key" : "128327199",
"_id" : "authorsToSubreddits/128327199",
"_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_to" : "RawSubReddits/38026350",
"_rev" : "_W_LOxgm--F"
}
],
"vertices" : [
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_HAL-y--_",
"author" : "punchyourbuns"
},
{
"_key" : "38026350",
"_id" : "RawSubReddits/38026350",
"_rev" : "_W-JS0na--b",
"distinguished" : null,
"created_utc" : 1484537478,
"id" : "dchfe6e",
"edited" : false,
"parent_id" : "t1_dch51v3",
"body" : "I don't understand tension at all."
"Mine is set to auto."
"I'll replace the needle and rethread. Thanks!",
"stickied" : false,
"gilded" : 0,
"subreddit" : "sewing",
"author" : "punchyourbuns",
"score" : 3,
"link_id" : "t3_5o66d0",
"author_flair_text" : null,
"author_flair_css_class" : null,
"controversiality" : 0,
"retrieved_on" : 1486085797,
"subreddit_id" : "t5_2sczp"
}
]
}
For a graph you need an edge collection for the edges and vertex collections for the nodes. You can't create a graph using only one collection.
Maybe this topic in the documentations is helpful for you.
Here's an AQL solution, which however presupposes that all the referenced collections already exist, and that UPSERT is not necessary.
FOR v IN testcollection
LET a = v.author
LET s = v.subredit
FILTER a
FILTER s
LET fid = (INSERT {author: a} INTO authors RETURN NEW._id)[0]
LET tid = (INSERT {subredit: s} INTO subredits RETURN NEW._id)[0]
INSERT {_from: fid, _to: tid} INTO author_of
RETURN [fid, tid]
Related
So i've been looking around on the pytransitions github and SO and it seems after 0.8 the way you could use macro-states (or super state with substates in it) has change. I would like to know if it's still possible to create such a machine with pytransition (the blue square is suppose to be a macro-state that has 2 states in it, one of them, the green one, being another macro) :
Or do I have to follow the workflow suggested here : https://github.com/pytransitions/transitions/issues/332 ?
Thx a lot for any info !
I would like to know if it's still possible to create such a machine with pytransition.
The way HSMs are created and managed has changed in 0.8 but you can of course use (deeply) nested states. For a state to have substates, you need to pass the states (or children) parameter with the state definitions/objects you'd like to nest. Furthermore, you can pass transitions for that particular scope. I am using HierarchicalGraphMachine since this allows me to create a graph right away.
from transitions.extensions.factory import HierarchicalGraphMachine
states = [
# create a state named A
{"name": "A",
# with the following children
"states":
# a state named '1' which will be accessible as 'A_1'
["1", {
# and a state '2' with its own children ...
"name": "2",
# ... 'a' and 'b'
"states": ["a", "b"],
"transitions": [["go", "a", "b"],["go", "b", "a"]],
# when '2' is entered, 'a' should be entered automatically.
"initial": "a"
}],
# we could also pass [["go", "A_1", "A_2"]] to the machine constructor
"transitions": [["go", "1", "2"]],
"initial": "1"
}]
m = HierarchicalGraphMachine(states=states, initial="A")
m.go()
m.get_graph().draw("foo.png", prog="dot") # [1]
Output of 1:
We are using the MSSQL module with Node.js.
I am running the following query:
SELECT AVG((RAT_VALUE * 1.0)) FROM RAT WHERE RAT_PER_ID_FROM IS NOT NULL AND RAT_PER_ID_ABOUT = 139 AND RAT_USE = 'Y' AND RAT_ABOUT_ROLE = 'RS' AND RAT_DATE_INSERTED >= '10/1/2018' AND RAT_DATE_INSERTED < '10/1/2019'
If I run this against the database directly, it returns:
4.45
The output from MSSQL is:
4
The exact resultset returned is:
results { recordsets: [ [ [Object] ] ],
recordset: [ { '': 4 } ],
output: {},
rowsAffected: [ 1 ] }
In other words, MSSQL is always returning the value 4, instead of 4.45.
The column type od RAT_VALUE is INT in the database but I've tried changing it to DECIMAL(5, 2) without any luck.
I've tried explicitly returning a DECIMAL from the query like:
SELECT CAST(AVG((RAT_VALUE * 1.0)) AS DECIMAL(5, 2)) ...
But no luck there either.
It seems MSSQL is simply clipping and dropping the decimal part of any number, even numbers of Decimal types.
I even set the value as 4.75 in the database and returned it directly and it still returns 4.
Any ideas out there?
I have the following python code that is to replace low-precision temperatures in a list of JSON trees, ec2_tcs['zones'] with higher precision temps from a generator, ec1_api.temperatures().
if CONF_HIGH_PRECISION:
try:
from evohomeclient import EvohomeClient as EvohomeClientVer1
ec1_api = EvohomeClientVer1(client.username, client.password)
for temp in ec1_api.temperatures(force_refresh=True):
for zone in ec2_tcs['zones']:
if str(temp['id']) == str(zone['zoneId']):
if zone['temperatureStatus']['isAvailable']:
zone['temperatureStatus']['temperature'] \
= temp['temp']
break
# TypeError: usually occurs in client library if problems with vendor's website
except TypeError:
_LOGGER.warning(
"Failed to obtain higher-precision temperatures"
)
The JSON data looks like this (an array of JSON data, 1 per 'zone'):
[
{
'zoneId': '3432521',
'name': 'Main Room'
'temperatureStatus': {'temperature': 21.5, 'isAvailable': True},
'setpointStatus': {'targetHeatTemperature': 5.0, 'setpointMode': 'FollowSchedule'},
'activeFaults': [],
}, {
...
...
}
]
and each result from the generator like this:
{'thermostat': 'EMEA_ZONE', 'id': 3432521, 'name': 'Main Room', 'temp': 21.55, 'setpoint': 5.0}
I know Python must have a better way of doing this, but I can't seem to make it fly. Any suggestions would be gratefully received.
I could 'massage' the generator, but there are good reasons why the JSON tree's schema should remain unchanged.
The primary goal is to reduce a number of nested code blocks with a very fancy one-liner!
I need to test the framework that can observe the state of some json http resource (I'm simplifying a bit here) and can send information about its changes to message queue so that client of service based on this framework could reconstruct actual state without polling http resourse.
It's easy to formulate properties for such framework. Let say we have a list of triples State, Diff, Timestamp
gen_states = [(gs1, Nothing, t1), (gs2, Just d1-2, t2), (gs3, Just d2-3, t3), (gs4, Just d3-4, t4)]
and after mirroring all this state to the http resource (used as test double) we gathered [rs1, rd1-2, rd2-3] where r stands for received.
apply [rd1-2, rd2-3] rs1 == gs4 final states should be the same the same
Also let's say that polling interval was more than the time difference between changes t3 - t2 than we can loose the diff d2-3 but the state still have to be consisted with state that was at previous polling gs2 for example. So we can miss some changes, but the received state should be consisted with some of the previous states that was no later than one polling interval before.
The question is how to create a generator that generates random diffs for json resource, given that resource is always an array of objects that all have id key.
For example initial state could look like that
[
{"id": "1", "some": {"complex": "value"}},
{"id": "2", "other": {"simple": "value"}}
]
And the next state
[
{"id": "1", "some": {"complex": "value"}},
{"id": "3", "other": "simple_value"}
]
Which should make diff like
type Id = String
data Diff = Diff {removed :: [Id], added :: [(Id, JsonValue)]}
added = [aesonQQ| {"id": 3, "other": "simple_value"} |]
Diff [2] [added]
I've tried to derive Arbitrary for aeson Object, but got this
<interactive>:15:1: warning: [-Wmissing-methods]
• No explicit implementation for
‘arbitrary’
• In the instance declaration for
‘Arbitrary
(unordered-containers-0.2.8.0:Data.HashMap.Base.HashMap
Data.Text.Internal.Text Value)’
But even if I would accomplished that how would I specify that added should have new unique id?
Does arangodb provide a utility to list clusters for a given edge definition?
E.g. Given the graph:
Tyrion ----sibling---> Cercei ---sibling---> Jamie
Bran ---sibling--> Arya ---sibling--> Jon
I'd want something like the following:
my_graph._getClusters({edge: "sibling"}) -> [ [Tyrion, Cercei, Jamie], [Bran, Arya, Jon] ]
Provided you have a graph named siblings, then the following query will find all paths in the graph that are connected by edges with type sibling and that have a (path) length of 3. This should match the example data you provided:
LET options = {
followEdges: [
{ type: 'sibling' }
]
}
FOR i IN GRAPH_TRAVERSAL('sibling', { }, "outbound", options)
FILTER LENGTH(i) == 3
RETURN i[*].vertex._key
Omitting or adjusting the FILTER will also find longer or shorter paths in the graph.