Lucene: how to search EAV or 1:m? - search

I'm using Zend Lucene, but don't think the question is specific to that library.
Say I want to provide fulltext search for a database of books. Assume following models:
Model 1:
TABLE: book
- book_id
- name
TABLE: book_author
- book_author_id
- book_id
- author_id
TABLE: author
- author_id
- name
(a book can have 0 or more authors)
Model 2:
TABLE: book
- book_id
- name
TABLE: book_eav
- book_eav_id
- book_id
- attribute (e.g. "author")
- value (e.g. "Tom Clancy")
(a book can have 0 or more authors + information about publisher, number of pages, etc.)
What do I need to do in order to insert all the authors associated with a particular book in a document to be indexed? Do I put all the authors in one field in the document? Would I use some sort of delimiter to group author information? I'm looking for general strategies with this kind of data.

Put all the authors in one field in the document with a delimiter.
So the document schema will be:
book_id
name
author: |author 1|author 2|...|author n|
other_attribute_1: |val 1|val 2|
other_attribute_2: |val 1|val 2|
With this schema you can search by author with different boosts with a query like:
(author:"|Tom Clancy|")^10 OR
(author:"Tom Clancy")^5 OR
(author:Tom Clancy)^1
This query will show the exact matches first, phrase matches then and finally other matches.

Related

cassandra: search for a record where a field (type set) is null

I need to do this query for cassndra:
select * from classes where students = null allow filtering;
students is a set
but looks like set do not allow = operator.
To test this out, I followed the DataStax docs on Indexing a Collection.
> CREATE TABLE cyclist_career_teams ( id UUID PRIMARY KEY, lastname text, teams set<text> );
> CREATE INDEX team_idx ON cyclist_career_teams ( teams );
With the table created and a secondary index on the teams set, I then inserted some test data:
> SELECT lastname,teams FROM cyclist_career_teams ;
lastname | teams
-----------------+---------------------------------------------------------------------------------------------------------
Vos | {'Neiderland bloeit', 'Rabobank Womens Team', 'Rabobonk-Liv Giant', 'Rabobonk-Liv Womens Cycling Team'}
Van Der Breggen | {'Rabobonk-Liv Womens Cycling Team', 'Sengers Ladies Cycling Team', 'Team Flexpoint'}
Brand | {'AA Drink - Leontien.nl', 'Rabobonk-Liv Giant', 'Rabobonk-Liv Womens Cycling Team'}
Armistead | null
Note that for Lizzie Armistead, I intentionally omitted a value for the teams column. While CQL does not allow the equals "=" relation on set types, it does allow CONTAINS. However, attempting to use that with null yields a different error:
> SELECT lastname,teams FROM cyclist_career_teams WHERE teams CONTAINS null;
[Invalid query] message="Unsupported null value for column teams"
The reason for this behavior, is related to how Cassandra has some special treatment for null values and the "null" keyword. Essentially, writing a null creates a tombstone, which is Cassandra's structure signifying a delete.
Even if Cassandra's treatment of null was not a factor, you'd still be faced with the problem that a value of "null" is not unique and your query would have to poll each node in the cluster. Such use cases are well-known anti-patterns. Unfortunately, Cassandra is just not good at querying data (or filtering on a key value) which does not exist.
One thing you could try, would be to use a string literal to indicate an empty value, like this:
> INSERT INTO cyclist_career_teams (id,lastname) VALUES (uuid(),'Armistead',{'empty'});
> SELECT lastname,teams FROM cyclist_career_teams WHERE teams CONTAINS 'empty';
lastname | teams
-----------+-----------
Armistead | {'empty'}
(1 rows)
To be honest though, because of the afore-mentioned anti-pattern, I can't recommend this approach in good faith. But with some added application logic at creation time, an "empty" string literal could work for you.

How to add multiple items into a column SQLite3?

I don't want to use different python packages like pickle.
I also don't want to use multiple databases.
So, how do I add a list or a tuple into a column of a database?
I had a theory of adding a string that would be like '(val1, val2, val3)' and then use exec to put it into a variable but that is too far-fetched and there is definitely a better and more efficient way of doing this.
EDIT:
I'll add some more information on what I'm looking for.
I want to get (and add) lists with this type of info:
{'pet':'name','type':'breed/species_of_pet', 'img':img_url, 'hunger':'100'}
I want this dict to be in the pets column.
Each pet can have many owners (many-to-many relationship)
If you want to have a users table and each user can have pets. You'd first make a pets table.
create table pets (
id integer primary key,
name text not null,
hunger int not null default 0
);
Then it depends on whether a pet has only one owner (known as a one-to-many relationship) or many owners (known as a many-to-many relationship).
If a pet has one owner, then add a column with the user ID to the pets table. This is a foreign key.
create table pets (
id integer primary key,
-- When a user is deleted, their pet's user_id will be set to null.
user_id integer references users(id) on delete set null,
name text not null,
hunger int not null default 0
);
To get all the pets of one user...
select pets.*
from pets
where user_id = ?
To get the name of the owner of a pet we do a join matching each rows of pets with their owner's rows using pets.user_id and users.id.
select users.name
from users
join pets on pets.user_id = users.id
where pets.id = ?
If each pet can have many owners, a many-to-many relationship, we don't put the user_id into pets. Instead we need an extra table: a join table.
create table pet_owners (
-- When a user or pet is deleted, delete the rows relating them.
pet_id integer not null references pets(id) on delete cascade,
user_id integer not null references users(id) on delete cascade
);
We declare that a user owns a pet by inserting into this table.
-- Pet 5 is owned by users 23 and 42.
insert into pet_owners (pet_id, user_id) values (5, 23), (5, 42);
To find a user's pets and their name, we query pet_owners and join with pets to get the name.
select pets.*
from pet_owners
join pets on pet_owners.pet_id = pets.id
where user_id = ?
This might seem weird and awkward, and it is, but it's why SQL databases are so powerful and fast. It's done to avoid having to do any parsing or interpretation of what's in the database. This allows the database to efficiently query data using indexes rather than having to sift through all the data. This makes even very large databases efficient.
When you query select pets.* from pets where user_id = ?, because foreign keys are indexed, SQLite does not search the entire pets table. It uses the index on user_id to jump straight to the matching records. This means the database will perform the same with 10 or 10 million pets.
There is nothing stopping you from storing JSON or other array-like text in SQLite; it's just that it's much harder to query when you do so. SQLite does have facilities for manipulating JSON, but in general I would probably lean toward #Schwern's solution.

Excel - Help Needed with Formulas

I'm looking to try do the following;
I want to have say 3 columns.
Transaction | Category | Amount
so I want to be able to enter a certain Name in Transaction say for argument sake "Tesco" then have a returned result in Category Column say "Groceries" and I can enter a specific amount then myself in Amount Colum.
Thing is I will need to have unlimited or quite a lot of different Transactions and have them all in pre determined Categories so that each time when I type in a Transaction it will automatically display the category for me.
All help much appreciated.
I know a simple If Statement wont suffice I can get it to work no problem using a Simple IF Statement but as each Transaction is different I don't know how to program further.
Thanks.
Colin
Use a lookup table. Let's say it's on a sheet called "Categories" and it looks like this:
| A | B
1 | Name | Category
2 | Tesco | Groceries
3 | Shell | Fuel
Then, in the table you describe, use =VLOOKUP(A2, Categories!$A$2:$B$3, 2, FALSE) in your "Category" field, assuming it's in B2.
I do this a fair bit using Data Validation and tables.
In this case I would have two tables containing my pick lists on a lookup sheet.
Transaction Table : [Name] = "loTrans" - with just the list of transactions sorted
Category Table : [Name] = "loCategory" - two columns in table, sorted by Both columns - Trans and Category
Header1 : Transactions
Header2 : Category
The Details Table:
the transaction field will have a simple data validation, using a
named range "trans", that selects from the table loTrans.
the transaction field will also use data validation, using a named
range, but the source of the named range ("selCat" will be a little more
complex. It will be something like:
=OFFSET(loCategory[Trans],MATCH(Enter_Details!A3,loCategory[Trans],0)-1,1,COUNTIF(loCategory[Trans],Enter_Details!A3),1)
As you enter details, and select different Transactions, the data validation will be limited to the Categorys of your selected transactions
An example file

Sub entities in solr

I'm using solr to index an entity which has indefinite number of related entities
Table 1
id name
1 | aa
2 | bb
3 | cc
Table 2
id field1 field2
1 | works in | New York
1 | likes to go to | Paris
As you see, each row represents an entity related to entity with id 1 and which value corresponds which matters.
How do I achieve this with Solr's data import handler?
I used SubEntity in data-config.xml and multiValued=true for field1 and field2, but the indexed document looks like
id 1
field1:[works in, likes to go to]
field2:[New York, Paris]
and the relationships between columns were completely lost. If one searches works in Paris he can also get entity 1. What should I do to maintain the relationships? Thanks a lot.
Schema Definition in schema.xml
id(type string)
name(type string)
worksIn (type string, multi value= true) - your choice if multi-value required or not
likesToGo (type string, multi value= true) - multivalue makes sense here as person is,most likely have more places to go, anyways your requirement
Sample docs after indexing
1,aa, worksIn[Newyork, New Jeysey], likesToGo[Paris, Moon]
2,bb, worksIn[Dallas], likesToGo[NewYork, Sun]
Querying
For "works in Paris", query is "worksIn:Paris".
You get doc with id 1
For "likes to go to sun", query is "likesToGo:sun".
You get doc with id 2

Cassandra - How to denormalize two joined tables?

I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how?
Suppose I have two tables
<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>TimeStamp</dd>
<dd>Address</dd>
<dd>PhoneNo</dd>
<dt>Book</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>ISBN</dd>
<dd>Year</dd>
<dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
<dd>Cost</dd>
</dt>
</dl>
Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.
I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.
Designing a whole schema is a rather big task for one question, but in general terms denormalization means you will repeat the same data in multiple tables so that you can read a single row to get all the data you need for each type of query.
So you would create a table for each type of query, something along these lines:
Create a table partitioned by publisher id and with book id as a clustering column.
Create a table partitioned by year and with publisher id as a clustering column.
Create a table with a list of all publishers. In an application you could then read this list and programmatically subtract the rows present in the desired year read from the table 2.
I'm not sure what "published till now" means. When you insert a new book, you could check if the publisher is present in table 3. If not, then it's a new publisher.
So within each row of the data, you would repeat all the data you wanted to get back with the query (i.e. the union of all the columns in your example tables). When you insert a new book, you would insert it into all of your tables.
This sounds like it could get huge, so I'll take the first one and walk through how I would approach it. You don't have to do it this way, it's just one way to go about it. Note that you may have to create query tables for each of your 4 scenarios above. This table will solve for the first scenario only.
First of all, I'll create a type for publisher address.
CREATE TYPE address (
street text,
city text,
state text,
postalCode text
);
Next I'll create a table called booksByPublisher. I'll use my address type for publisherAddress. And I'll build my PRIMARY KEY with publisherid as the partition key, clustering on bookYear and isbn.
As you want to be able to query all books by a particular publisher, it makes sense to designate that as the partition key. It may prove helpful to have your results sorted year, or at the very least be able to look at a specific year for a specific publisher, so I have bookYear as the first clustering key. And of course, to create a unique CQL row for each book within a publisher, I'll add isbn for uniqueness.
CREATE TABLE booksByPublisher (
publisherid UUID,
publisherName text,
publisherAddress frozen<address>,
publisherPhoneNo text,
bookName text,
isbn text,
bookYear bigint,
bookCost bigint,
bookAuthor text,
PRIMARY KEY (publisherid, bookYear, isbn)
);
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');
Now I can query all books (out of my 3 rows) published by Crown Publishing (publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b) like this:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2005 | 978-0307887443 | Ernest Cline | 812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(2 rows)
If I want, I can also query for all books by Crown Publishing during 2015:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(1 rows)
But I cannot query by just bookyear:
aploetz#cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"
And don't listen to the error message and add ALLOW FILTERING. That might work fine for a table with 3 rows (or even 300). But it won't work for a table with 3 million rows (you'll get a timeout). Cassandra works best when you query by a complete partition key. As publisherid is our partition key, this query will perform just fine. But if you need to query by bookYear, then you should create a table which uses bookYear as its partitioning key.

Resources