Datomic - select entities with highest in some attribute - search

Let's say I have a list of movies and I can get them all this way:
[:find ?year ?title ?rating
:where
[?Movie :movie/year ?year ]
[?Movie :movie/year ?title ]
[?Movie :movie/year ?rating]]
How can I further restrict this to get only the movie with the highest rating? I feel like I want something like...
[:find ?year ?title ?rating
:where
[?Movie :movie/year ?year ]
[?Movie :movie/year ?title ]
[?Movie :movie/year ?rating]
[(= ?rating (max ?rating) ]]
But obviously that won't do what I want =) Tips?

It's OK to do this sort of thing with two queries in Datomic. Having peers cache databases locally means the pressure to get as much done in one query is relieved a bit.
[:find (max ?year) .
:where [?m :movie/year ?year]]
Followed by:
[:find ?maxyear ?title ?rating
:in $ ?maxyear
:where [?m :movie/year ?maxyear]
[?m :movie/title ?title]
[?m :movie/rating ?rating]]
Is ok. The chained queries in Clojure could look like:
(let [db (d/db conn)]
(->>
(d/q '[:find (max ?year) .
:where [?m :movie/year ?year]]
db)
(d/q '[:find ?maxyear ?title ?rating
:in $ ?maxyear
:where [?m :movie/year ?maxyear]
[?m :movie/title ?title]
[?m :movie/rating ?rating]]
db)))
Notice we only get the db value from the connection once in the let binding. It is possible to use collection returning functions to get the result you've mentioned with the right constraints, but this may not behave how you'd expect - as with this example that gets the latest Beatles release:
(d/q '[:find (max ?tuple)
:where [?e :artist/name "The Beatles"]
[?a :release/artists ?e]
[?a :release/year ?y]
[?a :release/name ?n]
[(vector ?y ?n) ?tuple]]
(d/db conn))
This relies on the implicit ordering from the first element, but how does it behave for multiple maxes? This returns:
[[[2011 "Love"]]]
But without max, we can see the one pulled by max was not unique:
[[2011 "1"]] [[2011 "Love"]]
If you know how many to expect, you could set, e.g.:
:find (max 2 ?tuple)
But this is taking us down a road best avoided. For most instances it makes sense to prefer the simple, more robust case of combining queries.

Related

iterating through a list to look up data, and construct a string

Elisp newbie, looking for help with this.
I have this variable:
(setq bibtex-completion-additional-search-fields '(tags keywords))
I then have a function, which, if this variable is set, then needs to iterate through those field names, and look them up in a data record, concatenate the resulting values into a string, which it returns.
Here's what the data looks like:
("2009-03-01 Zukin, Sharon and Trujillo, Valerie and Frase, Peter and Jackson, Danielle and Recuber, Tim and Walker, Abraham gentrification New Retail Capital and Neighborhood Change: Boutiques and Gentrification in New York City article zukin_new_2009"
("date" . "2009-03-01")
("author" . "Zukin, Sharon and Trujillo, Valerie and Frase, Peter and Jackson, Danielle and Recuber, Tim and Walker, Abraham")
("tags" . "gentrification, retail")
("title" . "New {{Retail Capital}} and {{Neighborhood Change}}: {{Boutiques}} and {{Gentrification}} in {{New York City}}")
("=type=" . "article")
("=key=" . "zukin_new_2009"))
This is what I have for the function ATM, which I know is wrong. But I can't wrap my head around how to do this in elisp (I have more experience with Python and Ruby).
(defun bibtex-completion--get-extra-search-data (candidate)
"Return extended search metadata as string."
(if bibtex-completion-additional-search-fields
; if the data is present, pull its value(s), join into a single string
; TODO FIX ME, this is wrong
(format "%s" (cl-loop
for field in bibtex-completion-additional-search-fields
collect
(cdr (assoc field (cdr candidate)))
))))
So with the example data above, the function should return that string "gentrification, retail". And if that record were to have a keyword field with "foo", the return string would be "gentrification, retail, foo" (or could just be space-separated; not sure it matters).
First, the keys in your data structure are strings, not symbols. So, you could change your lookup fields,
(setq bibtex-completion-additional-search-fields '("tags" "keywords"))
but, using symbols as the cars in the candidate data structure is probably better (efficiency-wise I believe).
The canonical elisp for joining list into string is
(mapconcat #'identity ...),
(mapconcat
#'identity
(delq nil
(cl-loop for field in bibtex-completion-additional-search-fields
collect (cdr (assoc field (cdr candidate)))))
", ")

SPARQL CONSTRUCT trying to BIND yes/no values from conditional sub query

Continued on from another question here...
I have a(n excerpt from a) construct query below that is successfully pulling records as desired.
CONSTRUCT {
?publication fb:type ?type;
fb:publicationLabel ?publicationLabel;
fb:publicationType ?publicationTypeLabel;
fb:publicationLink ?publicationLink;
}
WHERE {
?publication a bibo:Document .
?publication rdfs:Label ?publicationLabel .
?publication vitro:mostSpecificType ?publicationType .
?publicationType rdfs:Label ?publicationTypeLabel .
?publication obo:ARG_2000028 ?vcard .
?vcard vcard:hasURL ?urllink .
?urllink vcard:url ?publicationLink
}
The above query (trimmed down a bit) currently works fine. I’m now trying to add the following variable: fb:linkInternalExists
To this variable, I want to bind the output of a conditional subquery that looks for a value (we’ll say “internal.url” for this example) within all the possible ?publicationLink values for a specific ?publication.
So the RDF output with the desired addition could return something like the following:
<rdf:Description rdf:about="https://abcd.fgh/individual/publication12345">
<fb:publicationLabel>example record 1</fb:publicationLabel>
<fb:publicationType>journal</fb:publicationType>
<fb:publicationLink>http://external.url/bcde</fb:publicationType>
<fb:publicationLink>http://external.url/abcd</fb:publicationType>
<fb:linkInternalExists>No</fb:linkInternalExists>
</rdf:Description>
<rdf:Description rdf:about="https://abcd.fgh/individual/publication23456">
<fb:publicationLabel>example record 2</fb:publicationLabel>
<fb:publicationType>conference paper</fb:publicationType>
<fb:publicationLink>http://external.url/2345</fb:publicationType>
<fb:publicationLink>http://external.url/1234</fb:publicationType>
<fb:publicationLink>http://internal.url/1234</fb:publicationType>
<fb:linkInternalExists>Yes</fb:linkInternalExists>
</rdf:Description>
My attempts at adding the required subquery to the above, and successfully bind its output to fb:linkInternalExists, have been unsuccessful. So my question is what would the modified query look like.
Regards
You don't actually need a subquery for this. All you need is an OPTIONAL pattern combined with a BIND expression.
The optional pattern should specifically look to find an internal link, like so:
OPTIONAL {
?vcard vcard:hasURL ?internal .
?internal vcard:url ?internalLink .
FILTER(CONTAINS(STR(?internalLlink), "internal.url")
}
or more concisely:
OPTIONAL {
?vcard vcard:hasURL/vcard:url ?internalLink .
FILTER(CONTAINS(STR(?internalLlink), "internal.url")
}
This clause will bind a value to ?internalLink if such a link exists, and leave it unbound otherwise. To then convert that to the output form you want, you can add the following conditional BIND-clause:
BIND (IF(BOUND(?internalLink), "Yes", "No") as ?internalLinkExists)
And then of course finally add the following to your CONSTRUCT-clause:
?publication fb:linkInternalExists ?internalLinkExists .
Upon trying Jeen Broekstra's approach, the query timed out, but it led me to trying other ways to isolate for the internalLink.
I tried the following instead, pulling both the publicationLink and the internalLink variables from distinct UNIONs.
{
?publication a bibo:Document.
?publication obo:ARG_2000028 ?vcard.
?vcard vcard:hasURL ?urllink.
?urllink vcard:url ?publicationLink .
}
UNION {
?publication a bibo:Document .
?publication obo:ARG_2000028 ?vcard .
?vcard vcard:hasURL/vcard:url ?internalLink .
FILTER(CONTAINS(STR(?internalLink), "internal.url"))
}
BIND (IF(BOUND(?internalLink), "Yes", "No") as ?internalLinkExists)
This successfully returned values for ?internalLink, and then the BIND added the Yes/No variable. Job done!

Query for best match to a string with SPARQL?

I have a list with movie titles and want to look these up in DBpedia for meta information like "director". But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match.
How can I get the best match for a movie title from DBpedia using SPARQL?
Some problematic examples:
My List: "Die Hard: with a Vengeance" vs. DBpedia: "Die Hard with a Vengeance"
My List: "Hachi" vs. DBpedia: "Hachi: A Dog's Tale"
My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. E.g.:
SELECT ?resource ?title ?director WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
FILTER (
contains(lcase(str(?title)), "die") &&
contains(lcase(str(?title)),"hard")
)
}
ORDER BY (?title)
LIMIT 1
This approach is very slow and also sometimes fails, e.g.:
SELECT ?resource ?title ?director WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
FILTER (
contains(lcase(str(?title)), "hachi")
)
}
ORDER BY (?title)
LIMIT 10
where the correct result is on second place:
resource title director
http://dbpedia.org/resource/Chachi_420 "Chachi 420"#en http://dbpedia.org/resource/Kamal_Haasan
http://dbpedia.org/resource/Hachi:_A_Dog's_Tale "Hachi: A Dog's Tale"#en http://dbpedia.org/resource/Lasse_Hallström
http://dbpedia.org/resource/Hachiko_Monogatari "Hachikō Monogatari"#en http://dbpedia.org/resource/Seijirō_Kōyama
http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"#en http://dbpedia.org/resource/Shajoon_Kariyal
Any ideas how to solve this problem? Or even better: How to query for best matches to a string with SPARQL in general?
Thanks!
I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains:
SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch
WHERE {
?resource foaf:name ?title .
?resource rdf:type schema:Movie .
?resource dbo:director ?director .
bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match )
}
ORDER BY DESC(?lenMatch) ASC(?lenTitle)
LIMIT 5
It's not perfect, so I'm still open for suggestions.

Summer in Greece with SPARQL

I want to pose a query, which for every region of Greece, shall count the best bathing waters (i.e. the number of waters that show perfect quality). So the (ordered) result should be something like:
Crete "2048"^^<http://www.w3.org/2001/XMLSchema#integer> # Crete has 2048 perfect bathing waters
Santorini "1024"^^<http://www.w3.org/2001/XMLSchema#integer>
..
The problem for me is how to get the bathing waters related to a region. Then I should worry on how to collect different sums. I know how to order. Let's assume that ?concie_0 determines the quality; if > 40, then it is of perfect quality. Here is what I have so far:
SELECT ?municipality ?bw
WHERE {
?regional_unit geo:έχει_επίσημο_όνομα "ΠΕΡΙΦΕΡΕΙΑΚΗ ΕΝΟΤΗΤΑ ΗΡΑΚΛΕΙΟΥ" .
?municipality geo:ανήκει_σε ?regional_unit .
?municipality geo:έχει_γεωμετρία ?geometry .
?bw geos:hasGeometry ?bw_geo .
?bw_geo geos:asWKT ?bw_geo_wkt .
FILTER(strdf:within(?geometry, ?bw_geo_wkt)) .
?bw unt:has_concie_0 ?concie_0 .
FILTER(?concie_0 > 40)
}
LIMIT 15
which gives:
municipality bw
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/340
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/456
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/972
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1041
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1365
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1849
http://geo.linkedopendata.gr/gag/id/9306 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/340
http://geo.linkedopendata.gr/gag/id/9306 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/456
...
I think that this groups the bathing waters with every regional unit. However, I do not know how to proceed, do you?
All you need to do is change your SELECT clause to include a COUNT, add a GROUP BY clause that groups per municipality, and finally an ORDER BY clause that ensures the highest scores come first. Like this:
SELECT ?municipality (COUNT(?bw) as ?bwCount)
WHERE {
....
}
GROUP BY ?municipality
ORDER BY DESC(?bwCount)

SPARQL how to deal with different cased queries?

I am still a bit new to SPARQL. I have set up a dbpedia endpoint for our company. I have no idea what the end user will be querying and, since DBpedia is case sensitive I pass both title case & uppercase versions for subjects vs something like a person; e.g. "Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate queries what is the most effecient way to achieve this? I've tried the IN operator (from this question) but I seem to be failing somewhere.
select ?label ?abstract where {
IN (<http://dbpedia.org/resource/alicia_keys>, <http://dbpedia.org/resource/Alicia_Keys>) rdfs:label ?label;
dbpedia-owl:abstract ?abstract.
}
LIMIT 1"""
since DBpedia is case sensitive I pass both title case & uppercase
versions for subjects vs something like a person; e.g.
"Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate
queries what is the most effecient way to achieve this?
URIs should be viewed as opaque. While DBpedia generally has some nice structure so that you can lucky by concatenating http://dbpedia.org/resource and some string with _ replacing , that's really not a very robust way to do something. A better idea is to note that the string you're getting is probably the same as a label of some resource, modulo variations in case. Given that, the best idea would be to look for something with the same label, modulo case. E.g.,
select ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's actually going to be pretty slow, though, because you'll have to find every resource, do some string processing on its label. It's an OK approach, in principle though.
What can be done to make it better? Well, if you know what kind of thing you're looking for, that will help a lot. E.g., you could restrict the query to Persons:
select distinct ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdf:type dbpedia-owl:Person ;
rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's an improvement, but it's still not all that fast. It still, at least conceptually, has to touch each Person and examine their name. Some SPARQL endpoints support text indexing, and that's probably what you need if you want to do this efficiently.
The best option, of course, would be to simply ask your users for a little bit more information, and to normalize the data in advance. If your user provides "AliCIA KEyS", then you can do the normalization to "Alicia Keys"#en, and then do something ilke:
select distinct ?resource where {
values ?input { "Alicia Keys"#en }
?resource rdfs:label ?input .
}

Resources