Summer in Greece with SPARQL - geometry

I want to pose a query, which for every region of Greece, shall count the best bathing waters (i.e. the number of waters that show perfect quality). So the (ordered) result should be something like:
Crete "2048"^^<http://www.w3.org/2001/XMLSchema#integer> # Crete has 2048 perfect bathing waters
Santorini "1024"^^<http://www.w3.org/2001/XMLSchema#integer>
..
The problem for me is how to get the bathing waters related to a region. Then I should worry on how to collect different sums. I know how to order. Let's assume that ?concie_0 determines the quality; if > 40, then it is of perfect quality. Here is what I have so far:
SELECT ?municipality ?bw
WHERE {
?regional_unit geo:έχει_επίσημο_όνομα "ΠΕΡΙΦΕΡΕΙΑΚΗ ΕΝΟΤΗΤΑ ΗΡΑΚΛΕΙΟΥ" .
?municipality geo:ανήκει_σε ?regional_unit .
?municipality geo:έχει_γεωμετρία ?geometry .
?bw geos:hasGeometry ?bw_geo .
?bw_geo geos:asWKT ?bw_geo_wkt .
FILTER(strdf:within(?geometry, ?bw_geo_wkt)) .
?bw unt:has_concie_0 ?concie_0 .
FILTER(?concie_0 > 40)
}
LIMIT 15
which gives:
municipality bw
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/340
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/456
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/972
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1041
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1365
http://geo.linkedopendata.gr/gag/id/9302 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/1849
http://geo.linkedopendata.gr/gag/id/9306 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/340
http://geo.linkedopendata.gr/gag/id/9306 http://data.linkedeodata.eu/poiothta_ydatwn_kolymvhshs_2012/id/456
...
I think that this groups the bathing waters with every regional unit. However, I do not know how to proceed, do you?

All you need to do is change your SELECT clause to include a COUNT, add a GROUP BY clause that groups per municipality, and finally an ORDER BY clause that ensures the highest scores come first. Like this:
SELECT ?municipality (COUNT(?bw) as ?bwCount)
WHERE {
....
}
GROUP BY ?municipality
ORDER BY DESC(?bwCount)

Related

Kusto Query Language: set column name of summarize by evaluated expression

Me again asking another Kusto related question (I really wish there would be a thorough video tutorial on this somewhere).
I have a summarize statement, that produces two columns for y axis and one for x axis.
Now i want to relabel the columns for x axis to show a string, that i also got from the database and already put into a variable with let.
This basically looks like this:
let android_col = strcat("Android: ", toscalar(customEvents
| where application_Version contains secondLatestVersionAndroid));
let iOS_col = strcat("iOS: ", toscalar(customEvents
| where application_Version contains secondLatestVersionIOS));
... some Kusto magic ...
| summarize
Android = 100 - (round((countif(hasUnhandledErrorAndroid == 1 ) * 100.0 ) / countif(isAndroid == 1), 2)),
iOS = 100 - (round((countif(hasUnhandledErroriOS == 1) * 100.0 ) / countif(isIOS == 1), 2))
by Time
|render timechart with (ytitle="crashfree users in %", xtitle="date", legend=visible )
Now i want to have the summarize display not Android and iOS, but the value of android_col and iOS_col.
Is that possible?
Best regards
Maverick
Generally, it's suggested to have predefined column names, otherwise various features don't work. For example, IntelliSense won't know the names of the columns, as they would be determined at run time only. Also, if you create a function that returns a dynamic schema, you won't be able to run this function from other clusters.
However, if you do want to change column names, you definitely have a way to do it by using various plugins. For example, bag_unpack, pivot and others.
As for courses on Kusto, there are actually several excellent courses on Pluralsight (all are free):
How to start with Microsoft Azure Data Explorer
Basic KQL
Azure Data Explorer – Advanced KQL
The usage of the "toscalar" in this query looks wrong, it seems to me that you should use the "extend" operator with the same logic to create the additional columns.

How do I use Apache Spark and lxml to parse, filter, and aggregate data?

I've created a generic XMLparser from lxml utilizing etree.fromstring(x). Now I have to parse an XML like the following:
<row AcceptedAnswerId="88156" AnswerCount="6" Body="<p>I\'ve just played a game with my kids that basically boils down to: whoever rolls every number at least once on a 6-sided dice wins.</p>
<p>I won, eventually, and the others finished 1-2 turns later. Now I\'m wondering: what is the expectation of the length of the game?</p>
<p>I know that the expectation of the number of rolls till you hit a specific number is
$\\sum_{n=1}^\\infty n\\frac{1}{6}(\\frac{5}{6})^{n-1}=6$.</p>
<p>However, I have two questions:</p>
<ol>
<li>How many times to you have to roll a six-sided dice until you get every number at least once? </li>
<li>Among four independent trials (i.e. with four players), what is the expectation of the <em>maximum</em> number of rolls needed? [note: it\'s maximum, not minimum, because at their age, it\'s more about finishing than about getting there first for my kids]</li>
</ol>
<p>I can simulate the result, but I wonder how I would go about calculating it analytically.</p>
<hr>
<p>Here\'s a Monte Carlo simulation in Matlab</p>
<pre><code>mx=zeros(1000000,1);
for i=1:1000000,
%# assume it\'s never going to take us &gt;100 rolls
r=randi(6,100,1);
%# since R2013a, unique returns the first occurrence
%# for earlier versions, take the minimum of x
%# and subtract it from the total array length
[~,x]=unique(r);
mx(i,1)=max(x);
end
%# make sure we haven\'t violated an assumption
assert(~any(mx==100))
%# find the expected value for the coupon collector problem
expectationForOneRun = mean(mx)
%# find the expected number of rolls as a maximum of four independent players
maxExpectationForFourRuns = mean( max( reshape( mx, 4, []), [], 1) )
expectationForOneRun =
14.7014 (SEM 0.006)
maxExpectationForFourRuns =
21.4815 (SEM 0.01)
</code></pre>
" CommentCount="5" CreationDate="2013-01-24T02:04:12.570" FavoriteCount="9" Id="48396" LastActivityDate="2014-02-27T16:38:07.013" LastEditDate="2013-01-26T13:53:53.183" LastEditorUserId="198" OwnerUserId="198" PostTypeId="1" Score="23" Tags="<probability><dice>" Title="How often do you have to roll a 6-sided dice to obtain every number at least once?" ViewCount="5585" />',
' <row AnswerCount="1" Body="<p>Suppose there are $6$ people in a population. During $2$ weeks $3$ people get the flu. Cases of the flu last $2$ days. Also people will get the flu only once during this period. What is the incidence density of the flu?</p>
<p>Would it be $\\frac{3}{84 \\text{person days}}$ since each person is observed for $14$ days?</p>
" CommentCount="4" CreationDate="2013-01-24T02:23:13.497" Id="48397" LastActivityDate="2013-04-24T16:58:18.773" OwnerUserId="20010" PostTypeId="1" Score="1" Tags="<epidemiology>" Title="Incidence density" ViewCount="288" />',
We'll suppose that my goal is to pull out the CommentCount values, and aggregate them. As I'm doing this through PySpark, this is, of course, only a very small sample of the data.
I've attempted to use my parser jointly with .filter(), reduceByKey, but haven't had much success. Presumably my the lxml parser mentioned above returns a dictionary, though I haven't been able to confirm that's the case.
Can anyone explain the best way to aggregate the CommentCount values in the XML above?
Note: Databricks cannot be installed on my system, any solution must not require this.
One way you can try is the Spark SQL xpath related builtin functions, but only if the xmls are all valid XML(or can be easily converted into valid XMLs) and on their own line.
# read file in line mode, we get one column with column_name = 'value'
df = spark.read.text('....')
For example, with the current sample XMLs, we can trim the leading and trailing commas, single-quotes and spaces, take the XPATH row//#CommentCount which is the value of CommentCount attribute under the row tag, this will get an array column of matched attribute values:
df.selectExpr('''xpath(trim(both ",' " from value), "row//#CommentCount") as CommentCount''').show()
+------------+
|CommentCount|
+------------+
| [5]|
| [4]|
+------------+
You can then take the sum on the first element of each array:
df.selectExpr('''
sum(xpath(trim(both ",' " from value), "row//#CommentCount")[0]) as sum_CommentCount
''').show()
+----------------+
|sum_CommentCount|
+----------------+
| 9.0|
+----------------+
Problem with this method is that it's very fragile and any invalid XML will fail the whole process and I don't find any fix for this as of now.
Another way is to use API function: regex_extract, which can be practical since the texts you want to retrieve is simple (i.e. no embedded tags or quotes etc).
from pyspark.sql.functions import regexp_extract
df.select(regexp_extract('value', r'\bCommentCount="(\d+)"', 1).astype('int').alias('CommentCount')).show()
+------------+
|CommentCount|
+------------+
| 5|
| 4|
+------------+
you can then take the sum on this integer column. Just my 2 cents.

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

SPARQL DBpedia - Retrieve category information in any language by using labels

I have a problem, which I explain on following example:
I want to retrieve all information in any language on a category. I must use ?category as a label and the language labels en, as they are inputs in my program.
The query looks like this, but when I change the language I don't receive any information on the category. I know the problem lies in the dcterms:subject, because ?category returns http://dbpedia.org/resource/Category:Countries_in_Europe (see first example below).
For example to search for a category label in german you have to use http://de.dbpedia.org/resource/Kategorie:Staat_in_Europa (see second example below).
prefix dcterms: <http://purl.org/dc/terms/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?objectLabel WHERE {
?subject dcterms:subject ?category ; rdfs:label ?objectLabel
?category rdfs:label "Countries in Europe"#en .
FILTER (LANG(?objectLabel)='en')
}
Equivalent query in different language that doesn't work as example:
prefix dcterms: <http://purl.org/dc/terms/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?objectLabel WHERE {
?subject dcterms:subject ?category ; rdfs:label ?objectLabel
?category rdfs:label "Staat in Europa"#de .
FILTER (LANG(?objectLabel)='de')
}
Is there a similar or different way / method to solve the problem? Thanks in advance for any help.

SPARQL how to deal with different cased queries?

I am still a bit new to SPARQL. I have set up a dbpedia endpoint for our company. I have no idea what the end user will be querying and, since DBpedia is case sensitive I pass both title case & uppercase versions for subjects vs something like a person; e.g. "Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate queries what is the most effecient way to achieve this? I've tried the IN operator (from this question) but I seem to be failing somewhere.
select ?label ?abstract where {
IN (<http://dbpedia.org/resource/alicia_keys>, <http://dbpedia.org/resource/Alicia_Keys>) rdfs:label ?label;
dbpedia-owl:abstract ?abstract.
}
LIMIT 1"""
since DBpedia is case sensitive I pass both title case & uppercase
versions for subjects vs something like a person; e.g.
"Computer_programming" vs "Alcia_Keys". Rather than pass in 2 separate
queries what is the most effecient way to achieve this?
URIs should be viewed as opaque. While DBpedia generally has some nice structure so that you can lucky by concatenating http://dbpedia.org/resource and some string with _ replacing , that's really not a very robust way to do something. A better idea is to note that the string you're getting is probably the same as a label of some resource, modulo variations in case. Given that, the best idea would be to look for something with the same label, modulo case. E.g.,
select ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's actually going to be pretty slow, though, because you'll have to find every resource, do some string processing on its label. It's an OK approach, in principle though.
What can be done to make it better? Well, if you know what kind of thing you're looking for, that will help a lot. E.g., you could restrict the query to Persons:
select distinct ?resource where {
values ?input { "AliCIA KeYS" }
?resource rdf:type dbpedia-owl:Person ;
rdfs:label ?label .
filter ( ucase(str(?label)) = ucase(?input) )
}
That's an improvement, but it's still not all that fast. It still, at least conceptually, has to touch each Person and examine their name. Some SPARQL endpoints support text indexing, and that's probably what you need if you want to do this efficiently.
The best option, of course, would be to simply ask your users for a little bit more information, and to normalize the data in advance. If your user provides "AliCIA KEyS", then you can do the normalization to "Alicia Keys"#en, and then do something ilke:
select distinct ?resource where {
values ?input { "Alicia Keys"#en }
?resource rdfs:label ?input .
}

Resources