SPARQL Filter in-Text multiple terms

SPARQL Filter in-Text multiple terms - text

I would like to filter speeches for certain terms from a dictionary. Ideally, the outcome table would return the speeches from the dataset, containing one or more of the defined terms.
I have tried two versions so far.
One where I immediately try to match the respective terms:
SELECT ?name ?gender ?partyname ?countryname ?date ?speechnr
WHERE {
?speech tpf:match (lpv:text 'domestic abuse OR domestic violence OR intimate partner violence' ?text).
?speech lpv:spokenAs ?function.
?function lpv:institution ?party.
?party rdf:type lpv:NationalParty.
?party rdfs:label ?partyname.
?speech lpv:docno ?speechnr.
?speech dcterms:date ?date.
?speech lpv:speaker ?speaker.
?speaker lpv:name ?name.
?speaker lpv:gender ?gender.
?speaker lpv:countryOfRepresentation ?country.
?country rdfs:label ?countryname.
FILTER ( ?date >= "1999-07-20"^^xsd:date && ?date <= "2004-07-19"^^xsd:date )
} ORDER BY ?date ?speechnr LIMIT 10
and another where I try to filter for the terms in the end:
SELECT ?name ?gender ?partyname ?countryname ?date ?speechnr ?text
WHERE {
?speech lpv:spokenText ?text.
?speech lpv:spokenAs ?function.
?function lpv:institution ?party.
?party rdf:type lpv:NationalParty.
?party rdfs:label ?partyname.
?speech lpv:docno ?speechnr.
?speech dcterms:date ?date.
?speech lpv:speaker ?speaker.
?speaker lpv:name ?name.
?speaker lpv:gender ?gender.
?speaker lpv:countryOfRepresentation ?country.
?country rdfs:label ?countryname.
FILTER ( ?date >= "1999-07-20"^^xsd:date && ?date <= "2004-07-19"^^xsd:date )
FILTER(langMatches(lang(?text), "en"))
FILTER(?text = 'domestic abuse' || 'domestic violence' || 'intimate partner violence')
} ORDER BY ?date ?speechnr LIMIT 10
I also tried the following filter but don't know how I can include an OR there:
FILTER(CONTAINS(?text, 'domestic abuse')).
The Problems:
if I use OR I only get returned speeches in which one of the given terms is contained.
Using the logical || did not even return speeches with the looked-for terms for some reason.
additionally, for terms like 'domestic violence', I only want them to be returned if adjacent (e.g., not just 'violence').
Sorry for this long text, I would really appreciate your help.
The website if needed https://linkedpolitics.project.cwi.nl/web/html/home.html

Related

Unique nested dictionary from the 'for' loop Python3

I've got the host which executes commands via 'subprocess' and gets output list of several parameters. The problem is that output can not be correctly modified to be translated to the dictionary, whether it's yaml or json. After list is received Regexp takes part to match valuable information and to perform grouping. I am interested in getting a unique dictionary, where crossing keys are put into nested dictionary.
Here is the code and the example of output list:
from re import compile,match
# Output can differ from request to request, the "keys" from the #
# list_of_values can dublicate or appear more than two times. The values #
# mapped to the keys can differ too. #
list_of_values = [
"paramId: '11'", "valueId*: '11'",
"elementId: '010_541'", 'mappingType: Both',
"startRng: ''", "finishRng: ''",
'DbType: sql', "activeSt: 'false'",
'profile: TestPr1', "specificHost: ''",
'hostGroup: tstGroup10', 'balance: all',
"paramId: '194'", "valueId*: '194'",
"elementId: '010_541'", 'mappingType: Both',
"startRng: '1020304050'", "finishRng: '1020304050'",
'DbType: sql', "activeSt: 'true'",
'profile: TestPr1', "specificHost: ''",
'hostGroup: tstGroup10', 'balance: all']
re_compile_valueId = compile(
"valueId\*:\s.(?P<valueId>\d{1,5})"
"|elementId:\s.(?P<elementId>\d{3}_\d{2,3})"
"|startRng:\s.(?P<startRng>\d{1,10})"
"|finishRng:\s.(?P<finishRng>\d{1,10})"
"|DbType:\s(?P<DbType>nosql|sql)"
"|activeSt:\s.(?P<activeSt>true|false)"
"|profile:\s(?P<profile>[A-z0-9]+)"
"|hostGroup:\s(?P<hostGroup>[A-z0-9]+)"
"|balance:\s(?P<balance>none|all|priority group)"
)
iterator_loop = 0
uniq_dict = dict()
next_dict = dict()
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'valueId':
uniq_dict['valueId'+str(iterator_loop)] = ''
iterator_loop +=1
next_dict.update({key: value})
else:
next_dict.update({key: value})
uniq_dict['valueId'+str(iterator_loop-1)] = next_dict
print(uniq_dict)
This code right here responses with:
{
'valueId0':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
},
'valueId1':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
}
}
And I was waiting for something like:
{
'valueId0':
{
'valueId': '11',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'false',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '',
'finishRng': ''
},
'valueId1':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
}
}
I've also got another code below, which runs and puts values as expected. But the structure breaks the idea of having this all looped around, because each dictionary result key has its own order number mapped. The example below. The list_of_values and re_compile_valueId can be used from previous example.
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'balance':
key = key+str(iterator_loop)
uniq_dict.update({key: value})
iterator_loop +=1
else:
key = key+str(iterator_loop)
uniq_dict.update({key: value})
print(uniq_dict)
The output will look like:
{
'valueId1': '11', 'elementId1': '010_541',
'DbType1': 'sql', 'activeSt1': 'false',
'profile1': 'TestPr1', 'hostGroup1': 'tstGroup10',
'balance1': 'all', 'valueId2': '194',
'elementId2': '010_541', 'startRng2': '1020304050',
'finishRng2': '1020304050', 'DbType2': 'sql',
'activeSt2': 'true', 'profile2': 'TestPr1',
'hostGroup2': 'tstGroup10', 'balance2': 'all'
}
Would appreciate any help! Thanks!

It appeared that some documentation reading needs to be performed :D
The copy() of the next_dict under else statement needs to be applied. Thanks to thread:
Why does updating one dictionary object affect other?
Many thanks to answer's author #thefourtheye (https://stackoverflow.com/users/1903116/thefourtheye)
The final code:
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'valueId':
uniq_dict['valueId'+str(iterator_loop)] = ''
iterator_loop +=1
next_dict.update({key: value})
else:
next_dict.update({key: value})
uniq_dict['valueId'+str(iterator_loop-1)] = next_dict.copy()
Thanks for the involvement to everyone.

What changed in the SPARQL code for this parliamentary term?

I have been successfully retrieving data from the following Open-Link Dataset: http://linkedpolitics.ops.few.vu.nl/web/html/home.html
for the 5th, 6th and 7th parliamentary term of the EP which I am then cleaning in STATA.
However, the coding seems to differ for the 8th term because I get a lot less speeches when I use the lpv:translatedText function that I have used before. I can't help but think that a LOT more should come up in the timeframe I am specifying than what the SPARQL endpoint returns. Can anyone help me figure out what I am doing wrong?
Here is the code I used for National parties (here with the dates for anything after the 7th term):
SELECT DISTINCT ?name ?countryname ?birth ?gender ?partyname ?start ?end ?date ?speechnr ?parlterm ?dictionary
WHERE {
?speech lpv:translatedText ?text.
?speech dcterms:date ?date.
?speech lpv:docno ?speechnr.
?speech lpv:speaker ?speaker.
?speaker lpv:name ?name.
?speaker lpv:dateOfBirth ?birth.
?speaker lpv:gender ?gender.
?speaker lpv:politicalFunction ?function.
?function lpv:institution ?party.
?party rdf:type lpv:NationalParty.
?party rdfs:label ?partyname.
?function lpv:beginning ?start.
?function lpv:end ?end.
?speaker lpv:countryOfRepresentation ?country.
?country rdfs:label ?countryname.
BIND("8" as ?parlterm)
BIND("representation" as ?dictionary)
FILTER ( ?date > "2014-07-01"^^xsd:date )
FILTER(langMatches(lang(?text), "en"))
FILTER(CONTAINS(?text, 'female representation') || CONTAINS(?text, 'women’s representation') || CONTAINS(?text, 'equal representation') || CONTAINS(?text, 'gender representation') || CONTAINS(?text, 'women in science') || CONTAINS(?text, 'women in business') || CONTAINS(?text, 'women’s leadership'))
} ORDER BY ?date ?speechnr
and here is the code I used for the FEMM committee (again anything after 7th parliamentary term):
SELECT DISTINCT ?name ?countryname ?birth ?gender ?start_com ?end_com ?date ?speechnr ?parlterm ?dictionary ?FEMM
WHERE {
?speech lpv:translatedText ?text.
?speech dcterms:date ?date.
?speech lpv:docno ?speechnr.
?speech lpv:speaker ?speaker.
?speaker lpv:name ?name.
?speaker lpv:dateOfBirth ?birth.
?speaker lpv:gender ?gender.
?speaker lpv:politicalFunction ?function.
?function lpv:institution ?institution.
?institution rdfs:label ?committee.
FILTER CONTAINS (?committee, "Committee on Women's Rights and Gender Equality")
BIND("Yes" as ?FEMM).
?function lpv:beginning ?start_com.
?function lpv:end ?end_com.
?speaker lpv:countryOfRepresentation ?country.
?country rdfs:label ?countryname.
BIND("8" as ?parlterm)
BIND("representation" as ?dictionary)
FILTER ( ?date > "2014-07-01"^^xsd:date )
FILTER(langMatches(lang(?text), "en"))
FILTER(CONTAINS(?text, 'female representation') || CONTAINS(?text, 'women’s representation') || CONTAINS(?text, 'equal representation') || CONTAINS(?text, 'gender representation') || CONTAINS(?text, 'women in science') || CONTAINS(?text, 'women in business') || CONTAINS(?text, 'women’s leadership'))
} ORDER BY ?date ?speechnr
Thank you.

Convert HQL to SparkSQL

I'm trying to convert HQL to Spark.
I have the following query (Works in Hue with Hive editor):
select reflect('java.util.UUID', 'randomUUID') as id,
tt.employee,
cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date,
collect_set(tt.employee_detail) as employee_details,
collect_set( tt.emp_indication ) as employees_indications,
named_struct ('employee_info', collect_set(tt.emp_info),
'employee_mod_info', collect_set(tt.emp_mod_info),
'employee_comments', collect_set(tt.emp_comment) )
as emp_mod_details,
from (
select views_ctr.employee,
if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail,
if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info,
if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment,
if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info,
if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication,
from
( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr
) tt
group by employee
distribute by employee
First, What I'm trying is to write it in spark.sql as follow:
sparkSession.sql("select reflect('java.util.UUID', 'randomUUID') as id, tt.employee, cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date, collect_set(tt.employee_detail) as employee_details, collect_set( tt.emp_indication ) as employees_indications, named_struct ('employee_info', collect_set(tt.emp_info), 'employee_mod_info', collect_set(tt.emp_mod_info), 'employee_comments', collect_set(tt.emp_comment) ) as emp_mod_details, from ( select views_ctr.employee, if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail, if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info, if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment, if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info, if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication, from ( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr ) tt group by employee distribute by employee")
But I got the following exception:
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failute: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.unsafe.types.UTF8String$IntWrapper
-object not serializable (class : org.apache.spark.unsafe.types.UTF8String$IntWrapper, value:
org.apache.spark.unsafe.types.UTF8String$IntWrapper#30cfd641)
If I'm trying to run my query without collect_set function its work, It's can fail because struct column types in my table?
How can I write my HQL query in Spark / fix my exception?

Find every paths in any direction with specified labels and hops

I have the following graph:
Vertices and edges have been added like this:
def graph=ConfiguredGraphFactory.open('Baptiste');def g = graph.traversal();
graph.addVertex(label, 'Group', 'text', 'BNP Paribas');
graph.addVertex(label, 'Group', 'text', 'BNP PARIBAS');
graph.addVertex(label, 'Company', 'text', 'JP Morgan Chase');
graph.addVertex(label, 'Location', 'text', 'France');
graph.addVertex(label, 'Location', 'text', 'United States');
graph.addVertex(label, 'Location', 'text', 'Europe');
def v1 = g.V().has('text', 'JP Morgan Chase').next();def v2 = g.V().has(text, 'BNP Paribas').next();v1.addEdge('partOf',v2);
def v1 = g.V().has('text', 'JP Morgan Chase').next();def v2 = g.V().has(text, 'United States').next();v1.addEdge('doesBusinessIn',v2);
def v1 = g.V().has('text', 'BNP Paribas').next();def v2 = g.V().has(text, 'United States').next();v1.addEdge('doesBusinessIn',v2);
def v1 = g.V().has('text', 'BNP Paribas').next();def v2 = g.V().has(text, 'France').next();v1.addEdge('partOf',v2);
def v1 = g.V().has('text', 'BNP PARIBAS').next();def v2 = g.V().has(text, 'Europe').next();v1.addEdge('partOf',v2);
And I need a query that returns me every paths possible given specific vertex labels, edge labels and number of possible hops.
Let's say I need paths with maximum hops of 2 and every labels in this example. I tried this query:
def graph=ConfiguredGraphFactory.open('TestGraph');
def g = graph.traversal();
g.V().has(label, within('Location', 'Company', 'Group'))
.repeat(bothE().has(label, within('doesBusinessIn', 'partOf')).bothV().has(label, within('Location', 'Company', 'Group')).simplePath())
.emit().times(2).path();
This query returns 20 paths (supposed to return 10 paths). So it returns paths in the 2 possible directions. Is there a way to specify that I need only 1 direction? I tried adding dedup() in my query but it returns 7 paths instead of 10 so it's not working?
Also whenever I try to find paths with 4 hops, it doesn't return me the "cyclic" paths such as France -> BNP Paribas -> United States -> JP Morgan Chase -> BNP Paribas. Any idea what to add in my query to allow returning those kind of paths?
EDIT:
Thanks for your solution #DanielKuppitz. It seems to be exactly what I'm looking for.
I use JanusGraph built on top of Apache Tinkerpop:
I tried the first query:
g.V().hasLabel('Location', 'Company', 'Group').
repeat(bothE('doesBusinessIn', 'partOf').otherV().simplePath()).
emit().times(2).
path().
dedup().
by(unfold().order().by(id).fold())
And it threw the following error:
Error: org.janusgraph.graphdb.relations.RelationIdentifier cannot be cast to java.lang.Comparable
So I moved the dedup command. into the repeat loop like so:
g.V().hasLabel('Location', 'Company', 'Group').
repeat(bothE('doesBusinessIn', 'partOf').otherV().simplePath().dedup().by(unfold().order().by(id).fold())).
emit().times(2).
path().
And it only returned 6 paths :
[
[
"JP Morgan Chase",
"doesBusinessIn",
"United States"
],
[
"JP Morgan Chase",
"partOf",
"BNP Paribas"
],
[
"JP Morgan Chase",
"partOf",
"BNP Paribas",
"partOf",
"France"
],
[
"Europe",
"partOf",
"BNP PARIBAS"
],
[
"BNP PARIBAS",
"partOf",
"Europe"
],
[
"United States",
"doesBusinessIn",
"JP Morgan Chase"
]
]
I'm not sure what's going on here... Any ideas?

Is there a way to specify that I need only 1 direction?
You kinda need a bidirected traversal, so you'll have to filter duplicated paths in the end ("duplicated" in this case means that 2 paths contain the same elements). In order to do that you can dedup() paths by a deterministic order of elements; the easiest way to do it is to order the elements by their id.
g.V().hasLabel('Location', 'Company', 'Group').
repeat(bothE('doesBusinessIn', 'partOf').otherV().simplePath()).
emit().times(2).
path().
dedup().
by(unfold().order().by(id).fold())
Any idea what to add in my query to allow returning those kinds of paths (cyclic)?
Your query explicitly prevents cyclic paths through the simplePath() step, so it's not quite clear in which scenarios you want to allow them. I assume that you're okay with a cyclic path if the cycle is created by only the first and last element in the path. In this case, the query would look more like this:
g.V().hasLabel('Location', 'Company', 'Group').as('a').
repeat(bothE('doesBusinessIn', 'partOf').otherV()).
emit().
until(loops().is(4).or().cyclicPath()).
filter(simplePath().or().where(eq('a'))).
path().
dedup().
by(unfold().order().by(id).fold())
Below is the output of the 2 queries (ignore the extra map() step, it's just there to improve the output's readability).
gremlin> g.V().hasLabel('Location', 'Company', 'Group').
......1> repeat(bothE('doesBusinessIn', 'partOf').otherV().simplePath()).
......2> emit().times(2).
......3> path().
......4> dedup().
......5> by(unfold().order().by(id).fold()).
......6> map(unfold().coalesce(values('text'), label()).fold())
==>[BNP Paribas,doesBusinessIn,United States]
==>[BNP Paribas,partOf,France]
==>[BNP Paribas,partOf,JP Morgan Chase]
==>[BNP Paribas,doesBusinessIn,United States,doesBusinessIn,JP Morgan Chase]
==>[BNP Paribas,partOf,JP Morgan Chase,doesBusinessIn,United States]
==>[BNP PARIBAS,partOf,Europe]
==>[JP Morgan Chase,doesBusinessIn,United States]
==>[JP Morgan Chase,partOf,BNP Paribas,doesBusinessIn,United States]
==>[JP Morgan Chase,partOf,BNP Paribas,partOf,France]
==>[France,partOf,BNP Paribas,doesBusinessIn,United States]
gremlin> g.V().hasLabel('Location', 'Company', 'Group').as('a').
......1> repeat(bothE('doesBusinessIn', 'partOf').otherV()).
......2> emit().
......3> until(loops().is(4).or().cyclicPath()).
......4> filter(simplePath().or().where(eq('a'))).
......5> path().
......6> dedup().
......7> by(unfold().order().by(id).fold()).
......8> map(unfold().coalesce(values('text'), label()).fold())
==>[BNP Paribas,doesBusinessIn,United States]
==>[BNP Paribas,partOf,France]
==>[BNP Paribas,partOf,JP Morgan Chase]
==>[BNP Paribas,doesBusinessIn,United States,doesBusinessIn,JP Morgan Chase]
==>[BNP Paribas,doesBusinessIn,United States,doesBusinessIn,BNP Paribas]
==>[BNP Paribas,partOf,France,partOf,BNP Paribas]
==>[BNP Paribas,partOf,JP Morgan Chase,doesBusinessIn,United States]
==>[BNP Paribas,partOf,JP Morgan Chase,partOf,BNP Paribas]
==>[BNP Paribas,doesBusinessIn,United States,doesBusinessIn,JP Morgan Chase,partOf,BNP Paribas]
==>[BNP PARIBAS,partOf,Europe]
==>[BNP PARIBAS,partOf,Europe,partOf,BNP PARIBAS]
==>[JP Morgan Chase,doesBusinessIn,United States]
==>[JP Morgan Chase,doesBusinessIn,United States,doesBusinessIn,JP Morgan Chase]
==>[JP Morgan Chase,partOf,BNP Paribas,doesBusinessIn,United States]
==>[JP Morgan Chase,partOf,BNP Paribas,partOf,France]
==>[JP Morgan Chase,partOf,BNP Paribas,partOf,JP Morgan Chase]
==>[JP Morgan Chase,doesBusinessIn,United States,doesBusinessIn,BNP Paribas,partOf,France]
==>[JP Morgan Chase,doesBusinessIn,United States,doesBusinessIn,BNP Paribas,partOf,JP Morgan Chase]
==>[France,partOf,BNP Paribas,doesBusinessIn,United States]
==>[France,partOf,BNP Paribas,partOf,France]
==>[France,partOf,BNP Paribas,partOf,JP Morgan Chase,doesBusinessIn,United States]
==>[United States,doesBusinessIn,JP Morgan Chase,doesBusinessIn,United States]
==>[United States,doesBusinessIn,BNP Paribas,doesBusinessIn,United States]
==>[United States,doesBusinessIn,JP Morgan Chase,partOf,BNP Paribas,doesBusinessIn,United States]
==>[Europe,partOf,BNP PARIBAS,partOf,Europe]
UPDATE (based on latest comments)
Since JanusGraph has non-comparable edge identifiers, you'll need a unique comparable property on all edges. This can be as simple as a random UUID.
This is how I updated your sample graph:
g.addV('Group').property('text', 'BNP Paribas').as('a').
addV('Group').property('text', 'BNP PARIBAS').as('b').
addV('Company').property('text', 'JP Morgan Chase').as('c').
addV('Location').property('text', 'France').as('d').
addV('Location').property('text', 'United States').as('e').
addV('Location').property('text', 'Europe').as('f').
addE('partOf').from('c').to('a').
property('uuid', UUID.randomUUID().toString()).
addE('doesBusinessIn').from('c').to('e').
property('uuid', UUID.randomUUID().toString()).
addE('doesBusinessIn').from('a').to('e').
property('uuid', UUID.randomUUID().toString()).
addE('partOf').from('a').to('d').
property('uuid', UUID.randomUUID().toString()).
addE('partOf').from('b').to('f').
property('uuid', UUID.randomUUID().toString()).
iterate()
Now, that we have properties that can uniquely identify an edge, we also need unique properties (of the same data type) on all vertices. Luckily the existing text properties seem to be good enough for that (otherwise it would be the same story as with the edges - just add a random UUID). The updated queries now look like this:
g.V().hasLabel('Location', 'Company', 'Group').
repeat(bothE('doesBusinessIn', 'partOf').otherV().simplePath()).
emit().times(2).
path().
dedup().
by(unfold().values('text','uuid').order().fold())
g.V().hasLabel('Location', 'Company', 'Group').as('a').
repeat(bothE('doesBusinessIn', 'partOf').otherV()).
emit().
until(loops().is(4).or().cyclicPath()).
filter(simplePath().or().where(eq('a'))).
path().
dedup().
by(unfold().values('text','uuid').order().fold())
The result are, of course, the same as above.

Cassandra indexes vs materialized view

I have next Cassandra table structure:
CREATE TABLE ringostat.hits (
hitId uuid,
clientId VARCHAR,
session MAP<VARCHAR, TEXT>,
traffic MAP<VARCHAR, TEXT>,
PRIMARY KEY (hitId, clientId)
);
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('550e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'referal', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('650e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'cpc', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('750e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'referal', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
I want to select all rows where source='www.google.com' AND medium='referal'.
SELECT * FROM hits WHERE traffic['source'] = 'www.google.com' AND traffic['medium'] = 'referal' ALLOW FILTERING;
Without add ALLOW FILTERING I got error: No supported secondary index found for the non primary key columns restrictions.
That's why I see two options:
Create index on traffic column.
Create materialized view.
Create another table and set INDEX for traffic column.
Which is the best option ? Also, I have many fields with MAP type on which I will need to filter. What issues can be if on every field I will add INDEX ?
Thank You.

From When to use an index.
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of records for a small number of results. [...] Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense.
In tables that use a counter column
On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.
If your planned usage meets one or more of these criteria, it is probably better to use a materialized view.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

SPARQL Filter in-Text multiple terms - text

Related

Unique nested dictionary from the 'for' loop Python3

What changed in the SPARQL code for this parliamentary term?

Convert HQL to SparkSQL

Find every paths in any direction with specified labels and hops

Cassandra indexes vs materialized view

Categories

Resources