I'm struggling to get a betweenness centrality computation done on a graph in AWS Neptune using Gremlin. I am using the Gremlin recipe:
g.V().as("v").
repeat(both().simplePath().as("v")).emit().
filter(project("x","y","z").by(select(first, "v")).
by(select(last, "v")).
by(select(all, "v").count(local)).as("triple").
coalesce(select("x","y").as("a").
select("triples").unfold().as("t").
select("x","y").where(eq("a")).
select("t"),
store("triples")).
select("z").as("length").
select("triple").select("z").where(eq("length"))).
select(all, "v").unfold().
groupCount().next()
this works fine on this toy graph, also from the recipe:
g.addV().property(id,'A').as('a').
addV().property(id,'B').as('b').
addV().property(id,'C').as('c').
addV().property(id,'D').as('d').
addV().property(id,'E').as('e').
addV().property(id,'F').as('f').
addE('next').from('a').to('b').
addE('next').from('b').to('c').
addE('next').from('b').to('d').
addE('next').from('c').to('e').
addE('next').from('d').to('e').
addE('next').from('e').to('f').iterate()
but when using the Movielens100k dataset, as pre-loaded in Neptune, it does not compute in reasonable time. Presumably this is because the graph is too large and the betweenness centrality is computationally too expensive.
The graph is loaded correctly in Neptune and I can do standard traversals.
And so my question: for larger graphs can Gremlin be used or do we need to rely on Networkx and other packages or is there a solution in Sparql? So my question is which is the most efficient method to compute centrality metrics in AWS Neptune? Thanks a lot!!
Gremlin is probably not the best solution for this, you can try to use the networkx or igraph python libraries for the graph algorithms
https://aws.amazon.com/about-aws/whats-new/2022/06/amazon-neptune-simplifies-graph-analytics-machine-learning-workflows-python-integration/
Related
Currently I am working big graph theory beyond mapreduce/spark.
The graph is too big to analysize, So I wanna remove some central nodes, making the big graph be splitted into several parts.
I tried NetworkX in python, it failed because the graph is too big.
Is there some solutions to realize it beyond spark or mapreduce?
I'm a newbie to the GIS world but after few weeks of hard searching/trying I feel stuck.
I am building an app to analyse landcover to assess terrain roughness. All the logic will come later but now I try to design the architecture.
A sample raster image is under the link:
Africa landcover 20m resolution - small sample
a visualization of the problem
Goal:
- read raster file (BigTIFF/GeoTIFF, ...around 6GB) from a cloud storage, like AWS S3
- use a javascript library to process the file, like node-gdal, geotiff.js, etc.
- apply a vector polygon to the raster and count the different pixels within the polygon. THis will be a "circle" with a radius of 3km for instance. Also make some histograms, so to see which pixels are dominant within a quadrant or 1/8 of the area.
- do some maths in JS with the data (about this I have no concerns)
- visualize the raster image on a map including the vector polygon
- push the code to production for multi users.
My skillset and limited experience is with the following as also my preferences to solve the challange:
Node + Express (Javascript)
Node-gdal library
PostgreSQL
Heroku for production
MApbox for visulaizing the maps
So far I have got the following issues when finding the right architecture or functioning codes:
node-gdal library can read only from the file system
geotiff.js can do the job but less documentation and I cannot see how to handle the specific tasks later
PostGIS should be very powerfull but a bit cumbersome to setup. Furthermore I am not sure if it is worth for a single tiff raster to
be feeded
Rasterio in Python can do very nice jobs and I managed in a Jupyter Notebook but I have no experience with Flask or others. Prefer Node JS
Turf.js can do a lot but more for vectors, I could not find modules for raster analysis
Thank you for your suggestions.
i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.
I need to use the Graphx Pregel API to run computations on an undirected graph. would it ignore the directionality of the graph if I set
activeDirection = EdgeDirection.Either ?
Looks like activeDirection is only used to determine whether or not to run "sendMsg" in the next iteration according to this source
Further, this post seems to suggest undirected graphs are not supported.
Finally, my experiments confirm what these guys are saying...
I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...
I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?
I guess I can't just pass pyTables object to one of scikit learn clustering methods...
Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?
Maybe you should look at algorithms that don't need a full distance matrix.
I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...
I think the main problem is memory. 1,5 x 1,5 million x 10B (1 element size) > 20TB
You can use bigdata database like pyTables, Hadoop http://en.wikipedia.org/wiki/Apache_Hadoop and MapReduce algorithm.
Here some guides: http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html
Or use Google App Engine Datastore with MapReduce https://developers.google.com/appengine/docs/python/dataprocessing/ - but now it isn't production version