Which rdf files to load to recreate main DBpedia sparql endpoint locally? - dbpedia

I have spun the local instance of DBpedia using docker images at https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart and loading https://databus.dbpedia.org/dbpedia/collections/latest-core hoping I could reproduce the main DBpedia SPARQL endpoint locally.
I understood from the documentation that this is the collection that is loaded to the main endpoint.
However the total number of triples is different (808587892 locally and 1104129087 on the main endpoint) and I'm not finding a single dbo:wikiPageWikiLink relation locally, while there are 240388379 in the main endpoint.
I'd really appreciate some pointers as to how to debug this or information on which files to load to virtuoso to have dbo:wikiPageWikiLinkrelations available - thank you.

I have just received an answer to this question here: https://github.com/dbpedia/databus-maven-plugin/issues/150

Related

Basic question on downloading data for a Kubeflow pipeline

I'm a newbie on Kubeflow, just started exploring. I've setup a microk8s cluster and charmed kubeflow. I have executed a few examples trying to understand the different components. Now I'm trying to setup a pipeline from scratch for a classification problem. The problem that I'm facing is with handling the download of data.
Could anyone please point me to an example where data (preferably images) is downloaded from an external source?
All the examples that I can find are based on snakk datasets from sklearn or mnist etc. I'm rather looking for an example using a real world (or near to) data, example
https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
Thanks in advance for any direction.
Tried exploring multiple kubeflow examples, blogs etc to find an example that contains real data rather than toy dataset. I couldn't find one.
I've found some jupyter notebook examples that use !wget to download in the notebook kernel, but I couldnt find how that can be converted to a kubeflow op step. I presumed func_to_container_op wouldn't work for such a scenario. As a next step I'm going to try using specs.AppDef from torchx to download. As I'm a total newbie, I wanted to make sure if I'm in the right direction.
I was able to download using wget for direct links and also I was able to configure k8s secrets and patch the serviceaccount with ImagePullSecret to get the downloads done from newly created containers.

Deploying PySpark as a service

I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance

How do I use Node.js to read server file within a Django application?

I have a django application that I use to visualize data. The data itself is in an Amazon S3 bucket, and the plotting of the data is done using bokeh. One of the applications however, uses d3 instead. And while I managed to configure the bokeh apps to work properly, I don't know how to do that with the d3 app. It has to read data either directly from S3 or locally (If I download it within django views before rendering the web page) and plot it. But whatever I try I get a 404 Not Found error.
I'm learning as I go, so I don't know what I'm doing wrong. Going through SO, I found this question which gives an example of downloading a file from S3 using Node.js, but I am already running a django server so I don't know if that works. I should also mention that the files to be read are quite large (several megabytes). So, to summarize, my question is:
Is using Node.js my only solution here and can it be done without having both Nodejs and django running at the same time? I'm worried that this might be too complex for me to set up. Or better yet, what would be a recommended solution in my case? I am almost done with the whole project but, unfortunately, I've gotten stuck pretty bad here.
Thank you to anyone willing to help or offer advice.

Is there a way to log stats/artifacts from AWS Glue Job using mlfow?

Could you please let me know if any such feature available in the current version of mlflow?
I think the general answer here is that you can log arbitrary data and artifacts from your experiment to your MLflow tracking server using mlflow_log_artifact() or mlflow_set_tag(), depending on how you want to do it. If there's an API to get data from Glue and you can fetch it during your MLflow run, then you can log it. Write a csv, save a .png to disk and log that, or declare a variable and access it when you are setting the tag.
This applies for Glue or any other API that you are getting a response from. One of the key benefits of MLflow is that it is such a general framework, so you can track what matters to that particular experiment.
Hope this helps!

How to set up a local DBPedia Server and using it

I am wondering if there is a tutorial on how i could get a local copy of the dbpedia knowledge base up and running.
I understand that i'll have to download all the .nt files that i would like to include from the dbpedia website. But i don't really get how to utilize them.
From other questions i found out, that there are different interfaces of different languages for using this. But i don't know how to find the right one for node.js or java. I found a lot of libraries that use rdf data and sparql query language but i don't get how to connect them with the .nt files.
Can anyone give a short introduction on where to start to setup the knowledge base for i.e. node.js ?

Resources