SPARQL how to deal with different cased queries? - nlp

I am still a bit new to SPARQL. I have set up a dbpedia endpoint for our company. I have no idea what the end user will be querying and, since DBpedia is case sensitive I pass both title case & uppercase versions for subjects vs something like a person; e.g. "Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate queries what is the most effecient way to achieve this? I've tried the IN operator (from this question) but I seem to be failing somewhere.
select ?label ?abstract where {
IN (<http://dbpedia.org/resource/alicia_keys>, <http://dbpedia.org/resource/Alicia_Keys>) rdfs:label ?label;
dbpedia-owl:abstract ?abstract.
}
LIMIT 1"""

since DBpedia is case sensitive I pass both title case & uppercase
versions for subjects vs something like a person; e.g.
"Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate
queries what is the most effecient way to achieve this?
URIs should be viewed as opaque. While DBpedia generally has some nice structure so that you can lucky by concatenating http://dbpedia.org/resource and some string with _ replacing , that's really not a very robust way to do something. A better idea is to note that the string you're getting is probably the same as a label of some resource, modulo variations in case. Given that, the best idea would be to look for something with the same label, modulo case. E.g.,
select ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's actually going to be pretty slow, though, because you'll have to find every resource, do some string processing on its label. It's an OK approach, in principle though.
What can be done to make it better? Well, if you know what kind of thing you're looking for, that will help a lot. E.g., you could restrict the query to Persons:
select distinct ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdf:type dbpedia-owl:Person ;
rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's an improvement, but it's still not all that fast. It still, at least conceptually, has to touch each Person and examine their name. Some SPARQL endpoints support text indexing, and that's probably what you need if you want to do this efficiently.
The best option, of course, would be to simply ask your users for a little bit more information, and to normalize the data in advance. If your user provides "AliCIA KEyS", then you can do the normalization to "Alicia Keys"#en, and then do something ilke:
select distinct ?resource where {
values ?input { "Alicia Keys"#en }
?resource rdfs:label ?input .
}

Related

Irrelevant results returned from view search in arangodb

We have a collection AbstractEvent with field 'text', which contains 1~30 Chinese characters and we want to perform LIKE match with %keyword%, with high performance(less than 0.3 second, for more 2 million records).
After a bunch of effort, we decided to use VIEW and analyzer identity to do this:
FOR i IN AbstractEventView
SEARCH ANALYZER(i.text LIKE '%keyword%', 'identity')
LIMIT 10
RETURN i.text
And here is the definition of view AbstractEventView
{
"name":"AbstractEventView",
"type":"arangosearch",
"links":{
"AbstractEvent":{
"analyzers":[
"identity"
],
"fields":{
"text":{}
}
}
}
}
However, records returned contain irrelevant ones.
The flowlling is an example:
FOR i IN AbstractEventView
SEARCH ANALYZER(i.text LIKE '%速%', 'identity')
LIMIT 10
RETURN i.text
and the result is
[
"全球经济增速虽军官下滑",
"油食用消费出现明显下滑",
"本次国家经济快速下行",
"这场所迅速爆发的情况",
"经济减速风景空间资本大规模流出",
"苜蓿草众人食品物资价格不稳定",
"荤菜价格快速走低",
"情况快速升级",
"情况快速进展",
"四季功劳增速断崖式回落后"
]
油食用消费出现明显下滑and苜蓿草众人食品物资价格不稳定 are irrelavent.
We've been struggling on this for days, can anyone help me out? Thanks.
PS:
Why we do not use FULL-TEXT index?
full-text index indexed fields by tokenized text, so that we can not get matching '货币超发' when keyword is '货',because '货币' is recgonized as a word.
Why we do not use FILTER with LIKE operator directly?
Filtering without index will cost about 1 second and it is not acceptable.

ArangoDB AQL: Find Gaps In Sequential Data

I've been given data to build an application that has sequential data in the form of part numbers of products: "000000", "000001", "000002", "000010", "000011" .... The previous application was an old MS Access database that didn't have any gap filling features in the part number generator, hence the gap between "000002" and "000010" (Yes, they are also strings, but I can work with that...).
We could continue to increment based on the last value and ignore the gaps, however, in an attempt to use all numbers available to us with our naming scheme, we'd like to be able to fill the gaps. Our naming scheme describes the "product family" with the first two digits such that: [00]0000 would be a different family from [02]0000.
I can find the starting and ending values using something like:
let query = `
LET first = (
MIN(
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN part.PartNumber
)
)
LET last = (
MAX(
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN part.PartNumber
)
)
RETURN { first, last }
`
The above example returns: {first: "000000", last: "000915"}
Using ArangoDB and AQL, how could I go about finding these gaps? I've found some SQL examples but I feel the features of AQL are a bit more limiting.
Thanks in advance!
To start with, I think your best bet for getting min/max values is using aggregates:
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
COLLECT x = 1
AGGREGATE first = MIN(part.PartNumber), last = MAX(part.PartNumber)
RETURN {
first: first,
last: last
}
But that won't really help when trying to find gaps. And you're right - SQL has several logical constructs that could help (like using variables and cursor iteration), but even that would be a pattern I would discourage.
The better path might be to do a "brute force" approach - compare a table containing your existing numbers with a table of all numbers, using a native method like JOIN to find the difference. Here's how you might do that in AQL:
LET allNumbers = 0..9999
LET existingParts = (
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
LET childId = RIGHT(part.PartNumber, 4)
RETURN TO_NUMBER(childId)
)
RETURN MINUS(allNumbers, existingParts)
The x..y construct creates a sequence (an array of numbers), which we use as the full set of possible numbers. Then, we want to return only the "non-family" part of the ID (I'm calling it "child"), which needs to be numeric to compare with the previous set. Then, we use MINUS to remove elements of existingParts from the allNumbers list.
One thing to note, that query would return only the "child" portion of the part number, so you would have to join it back to the family number later. Alternatively, you could also skip string-splitting, and get "fancy" with your list creation:
LET allNumbers = TO_NUMBER(CONCAT(#family, '0000'))..TO_NUMBER(CONCAT(#family, '9999'))
LET existingParts = (
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN TO_NUMBER(part.PartNumber)
)
RETURN MINUS(allNumbers, existingParts)

Filter the list generated by a Gremlin traversal and Groovy

I'm doing the following traversal:
g.V().has('Transfer','eventName','Airdrop').as('t1').
outE('sent_to').
inV().dedup().as('a2').
inE('sent_from').
outV().as('t2').
where('t1',eq('t2')).by('address').
outE('sent_to').
inV().as('a3').
select('a3','a2').
by('accountId').toList().groupBy { it.a3 }.collectEntries { [(it.key): [a2 : it.value.a2]]};
So as you can see I'm basically doing a traversal and at the end I'm using groovy with collectEntries to aggregate the results like I need them, which is aggregated by a3 in this case. The results look like this:
==>0xfe43502662ce2adf86d9d49f25a27d65c70a709d={a2=[0x99feb505a8ed9976cf19e757a9536117e6cdc5ba, 0x22019ad32ea3adabae68003bdefd099d7e5e3886]}
(This is GOOD, because the number of values in a2 is at least 2)
==>0x129e0131ea3cc16fe5252d7280bd1258f629f20f={a2=[0xf7958fad496d15cf9fd9e54c0012504f4fdb96ff]}
(This is NOT GOOD, I want to return in my list only those combinations where there are at lest 2 values for a2)
I have tried using filters and an additional where step in the traversal itself but I haven't been able to do it. I'm not sure if this is something I should skip using Groovy in my last line. Any help or orientation would be very much appreciated
I don't think you need to drop into Groovy to get the answer you want. It would be preferable to do this all in Gremlin especially since you intend to filter results which could yield some performance benefit. Gremlin has it's own group() step as well as methods for filtering the resulting Map:
g.V().has('Transfer','eventName','Airdrop').as('t1').
out('sent_to').
dedup().as('a2').
in('sent_from').as('t2').
where('t1',eq('t2')).by('address').
out('sent_to').inV().as('a3').
select('a3','a2').
by('accountId').
group().
by('a3').
by('a2').
unfold().
where(select(values).limit(local,2).count(local).is(gte(2)))
The idea is to build your Map with group() then deconstruct it to entries with unfold(). You the filter each entry with where() by selecting the values of the entry, which is a List of "a2" then counting the items locally in that List. I use limit(local,2) to avoid unnecessary iteration beyond 2 since the filter is gte(2).
The easiest way to do this is with findAll { }.
.groupBy { it.a3 }
.findAll { it.value.a2.size() > 1 }
.collectEntries { [(it.key): [a2: it.value.a2]] }
if some a2 are null, then value.a2 also evaluates to null and filters the results without the need for explicit nullchecks

Pass in SQL query the table name as parameter

A possible solution for this question is here:
https://stackoverflow.com/a/6223961/12343395
It will probably work with a lot of work around.
But I have stored my table names in string format and want to call them as needed.
I am using Pandas read_sql_query. So as in params, I am passing, the table name and a few parameters in the WHERE section.
The WHERE section is fine, since the parameters are originally strings. But in the FROM section,
I really want the schema.table as a non-string.
Here is a snippet.
SELECT "rainfall(mm)","tmin(C)","tmax(C)","TimeStamp"
FROM crop_tables[choose_crop][0]
WHERE "District_Name" = %s AND "Season" = %s
ORDER BY "TimeStamp" ASC
where crop_tables[choose_crop][0] is 'sagita_historic.soyabean_daily_analyses' in this case.
But FROM will throw an error since it doesn't accept strings. So in essence, I wish to strip the 'sagita_historic.soyabean_daily_analyses' as a non-string.
Is it possible to do so?
Thank you.
Not sure I fully understand but maybe this will do?
SELECT "rainfall(mm)","tmin(C)","tmax(C)","TimeStamp"
FROM f"{crop_tables[choose_crop][0]}"
WHERE "District_Name" = %s AND "Season" = %s
ORDER BY "TimeStamp" ASC

MYSQL: Using GROUP BY with string literals

I have the following table with these columns:
shortName, fullName, ChangelistCount
Is there a way to group them by a string literal within their fullName? The fullname represents file directories, so I would like to display results for certain parent folders instead of the individual files.
I tried something along the lines of:
GROUP BY fullName like "%/testFolder/%" AND fullName like "%/testFolder2/%"
However it only really groups by the first match....
Thanks!
Perhaps you want something like:
GROUP BY IF(fullName LIKE '%/testfolder/%', 1, IF(fullName LIKE '%/testfolder2/%', 2, 3))
The key idea to understand is that an expression like fullName LIKE foo AND fullName LIKE bar is that the entire expression will necessarily evaluate to either TRUE or FALSE, so you can only get two total groups out of that.
Using an IF expression to return one of several different values will let you get more groups.
Keep in mind that this will not be particularly fast. If you have a very large dataset, you should explore other ways of storing the data that will not require LIKE comparisons to do the grouping.
You'd have to use a subquery to derive the column values you'd like to ultimately group on:
FROM (SELECT SUBSTR(fullname, ?)AS derived_column
FROM YOUR_TABLE ) x
GROUP BY x.derived_column
Either use when/then conditions or Have another temporary table containing all the matches you wish to find and group. Sample from my database.
Here I wanted to group all users based on their cities which was inside address field.
SELECT ut.* , c.city, ua.*
FROM `user_tracking` AS ut
LEFT JOIN cities AS c ON ut.place_name LIKE CONCAT( "%", c.city, "%" )
LEFT JOIN users_auth AS ua ON ua.id = ut.user_id

Resources