H2 database collation: what to choose? - search

After a lot of reading and experimentation, it seems like I want PRIMARY strength for searching, but TERTIARY or IDENTICAL for ordering. Main question: Is that possible to achieve with H2 (or any other DB)?
Secondary question: Am I the only one here or would any of you also like the above combination? Some confirmation would be helpful for my sanity.
Background:
It seems like the collation can only be set at the very beginning when creating the database. So I want to make sure to pick the right one. I am mainly thinking of these use cases (for now):
A search field where the user can start typing to filter a table: Here PRIMARY seems the most appropriate, in order to avoid missing any results (user is used to Google...). Although, it would be nice to be able to give the user the option to enable secondary or tertiary collation to do more precise searching.
Ordering: When the user clicks a table column to order the contents, TERTIARY/IDENTICAL ordering seems appropriate. That's what I am used to from everyday experience.
I read the offical H2 docs here: http://www.h2database.com/html/commands.html#set_collation.
and here: http://www.h2database.com/html/datatypes.html#varchar_ignorecase_type
Some more related info:
Collation STRENGTH and local language relation
The test sql (from https://groups.google.com/forum/?fromgroups=#!topic/h2-database/lBksrrcuGdY):
drop all objects;
set collation english STRENGTH PRIMARY;
create table test(name varchar);
insert into test values ('À'), ('Ä'), ('Â'), ('A'), ('à'), ('ä'), ('â'), ('a'), ('àa'), ('äa'), ('âa'), ('aa'), ('B'), ('b');
select * from test where name like 'a' order by name;
select * from test order by name;

If you want to have two behaviours for a single data you have to:
split data over two columns,
or uses two operator sets.
For your purpose, it is common to store "canonical" representation of a raw data in order to search on canonical form and then sort/display raw data. May be you should use some "text search engine" such as Apache Lucene.
For pure H2 solutions, you can use H2 alias with Computed columns or with query criteria. First solution allows indexing to speed up your queries.

Almost 8 years later, my own recommendation based on some hard learnings:
Use no collation at all (default for H2 databases).
Rationale: Using a collation will produce some really unexpected results and bugs.
Pitfall: UNIQUE constraints
By far the most common unique constraints i saw in daily business was to enforce unique (firstname, lastname). Typically, case should be ignored (prevent both 'thomas müller' and 'Thomas Müller'), but not umlauts (allow both 'Thomas Müller' and 'Thomas Muller').
It might be tempting to use a collation strength SECONDARY setting to achieve this (case-insensitive but umlaut-sensitive). Don't. Use VARCHAR_IGNORECASE columns instead.
{
// NOT recommended: using SECONDARY collation
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("ALTER TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
s.execute("INSERT INTO test (name) VALUES ('Müller')");
s.execute("INSERT INTO test (name) VALUES ('Muller')");
// s.execute("INSERT INTO test (name) VALUES ('muller')" /* will fail */);
}
{
// recommended: no collation, using VARCHAR_IGNORECASE instead of VARCHAR column
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("CREATE TABLE test ( name VARCHAR_IGNORECASE )");
s.execute("ALTER TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
s.execute("INSERT INTO test (name) VALUES ('Müller')");
s.execute("INSERT INTO test (name) VALUES ('Muller')");
// s.execute("INSERT INTO test (name) VALUES ('muller')" /* will fail */);
}
Pitfall: Searching / WHERE clauses
Recommendation: The default behavior without collation is just fine, and behaves as expected. For more fuzzy searching use your own code search or a library like Lucene.
SECONDARY collation strength will match even if case is different. You will not expect that behavior when using SELECT WHERE name = '...', because you will forget all about your collation setting.
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('Thomas Müller')");
ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'Thomas müller'" /* different case */);
rs.next();
/* prints 1 (!) */ System.out.println(rs.getLong(1));
}
PRIMARY collation strength will match even if SPACES are different. Would you believe the English primary collation ignores spaces? Check out this nugget: https://stackoverflow.com/a/16567963/1124509
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH PRIMARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('Thomas Müller')");
ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'ThomasMüller'" /* no space! */);
rs.next();
/* prints 1 (!) */ System.out.println(rs.getLong(1));
}
Sorting / ORDER BY clauses
The default ordering without collation is not really useful in real-world scenarios, as it will sort according to strict string comparison. Solve this by first loading the data from the database, and then order / sort it with code.
Personally, I mostly use an English primary strength collator with the spaces problem fixed. Works fine even for non-English text columns.
But you might also need to use a custom comparator to satisfy more difficult requirements like natural or intuitive sort orders, e.g. sort like windows explorer, or semantic versioning.
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('é6')");
s.execute("INSERT INTO test (name) VALUES ('e5')");
s.execute("INSERT INTO test (name) VALUES ('E4')");
s.execute("INSERT INTO test (name) VALUES ('ä3')");
s.execute("INSERT INTO test (name) VALUES ('a2')");
s.execute("INSERT INTO test (name) VALUES ('A1')");
ResultSet rs = s.executeQuery("SELECT name FROM test ORDER BY name");
List<String> names = new ArrayList<>();
while(rs.next()) {
names.add(rs.getString(1));
}
// not very useful strict String.compareTo() result: [A1, E4, a2, e5, ä3, é6]
System.out.print(names);
String rules = ((RuleBasedCollator) Collator.getInstance(new Locale("en", "US"))).getRules();
Collator collator = new RuleBasedCollator(rules.replaceAll("<'\u005f'", "<' '<'\u005f'"));
collator.setStrength(Collator.PRIMARY);
names.sort((a, b) -> collator.compare(a, b));
// as humans usually expect it in a name list / table: [A1, a2, ä3, E4, e5, é6]
System.out.print(names);
}
How to check if your H2 database is using a collation?
Look at the SETTINGS table. If no collation is set, there will be no entry in the table.

Related

Pass column name as argument - Postgres and Node JS

I have a query (Update statement) wrapped in a function and will need to perform the same statement on multiple columns during the course of my script
async function update_percentage_value(value, id){
(async () => {
const client = await pool.connect();
try {
const res = await client.query('UPDATE fixtures SET column_1_percentage = ($1) WHERE id = ($2) RETURNING *', [value, id]);
} finally {
client.release();
}
})().catch(e => console.log(e.stack))
}
I then call this function
update_percentage_value(50, 2);
I have many columns to update at various points of my script, each one needs to be done at the time. I would like to be able to just call the one function, passing the column name, value and id.
My table looks like below
CREATE TABLE fixtures (
ID SERIAL PRIMARY KEY,
home_team VARCHAR,
away_team VARCHAR,
column_1_percentage INTEGER,
column_2_percentage INTEGER,
column_3_percentage INTEGER,
column_4_percentage INTEGER
);
Is it at all possible to do this?
I'm going to post the solution that was advised by Sehrope Sarkuni via the node-postgres GitHub repo. This helped me a lot and works for what I require:
No column names are identifiers and they can't be specified as parameters. They have to be included in the text of the SQL command.
It is possible but you have to build the SQL text with the column names. If you're going to dynamically build SQL you should make sure to escape the components using something like pg-format or use an ORM that handles this type of thing.
So something like:
const format = require('pg-format');
async function updateFixtures(id, column, value) {
const sql = format('UPDATE fixtures SET %I = $1 WHERE id = $2', column);
await pool.query(sql, [value, id]);
}
Also if you're doing multiple updates to the same row back-to-back then you're likely better off with a single UPDATE statement that modifies all the columns rather than separate statements as they'd be both slower and generate more WAL on the server.
To get the column names of the table, you can query the information_schema.columns table which stores the details of column structure of your table, this would help you in framing a dynamic query for updating a specific column based on a specific result.
You can get the column names of the table with the help of following query:
select column_name from information_schema.columns where table_name='fixtures' and table_schema='public';
The above query would give you the list of columns in the table.
Now to update each one for a specific purpose, You can store the result set of column name to a variable and pass that variable to the function to perform the required action.

Azure CosmosDB: how to ORDER BY id?

Using a vanilla CosmosDB collection (all default), adding documents like this:
{
"id": "3",
"name": "Hannah"
}
I would like to retrieve records ordered by id, like this:
SELECT c.id FROM c
ORDER BY c.id
This give me the error Order-by item requires a range index to be defined on the corresponding index path.
I expect this is because /id is hash indexed and not range indexed. I've tried to change the Indexing Policy in various ways, but any change I make which would touch / or /id gets wiped when I save.
How can I retrieve documents ordered by ID?
The best way to do this is to store a duplicate property e.g. id2 that has the same value of id, and is indexed using a range index, then use that for sorting, i.e. query for SELECT * FROM c ORDER BY c.id2.
PS: The reason this is not supported is because id is part of a composite index (which is on partition key and row key; id is the row key part) The Cosmos DB team is working on a change that will allow sorting by id.
EDIT: new collections now support ORDER BY c.id as of 7/12/19
I found this page CosmosDB Indexing Policies , which has the below Note that may be helpful:
Azure Cosmos DB returns an error when a query uses ORDER BY but
doesn't have a Range index against the queried path with the maximum
precision.
Some other information from elsewhere in the document:
Range supports efficient equality queries, range queries (using >, <,
>=, <=, !=), and ORDER BY queries. ORDER By queries by default also require maximum index precision (-1). The data type can be String or
Number.
Some guidance on types of queries assisted by Range queries:
Range Range over /prop/? (or /) can be used to serve the following
queries efficiently:
SELECT FROM collection c WHERE c.prop = "value"
SELECT FROM collection c WHERE c.prop > 5
SELECT FROM collection c ORDER BY c.prop
And a code example from the docs also:
var rangeDefault = new DocumentCollection { Id = "rangeCollection" };
// Override the default policy for strings to Range indexing and "max" (-1) precision
rangeDefault.IndexingPolicy = new IndexingPolicy(new RangeIndex(DataType.String) { Precision = -1 });
await client.CreateDocumentCollectionAsync(UriFactory.CreateDatabaseUri("db"), rangeDefault);
Hope this helps,
J

cassandra : name provided was not in the list of valid column labels error

i'm using cassandra 1.2.8. i have a column family like below:
CREATE TABLE word_probability (
word text,
category text,
probability double,
PRIMARY KEY (word,category)
);
when i use query like this:
String query = "SELECT * FROM word_probability WHERE word='%s' AND category='%s';";
it works well but for some words i get this message:
name provided was not in the list of valid column labels error
every thing is ok and i don't know why i get this error :(
You're not doing anything wrong except mixing up cql with sql. Cql doesn't support % wildcards.

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Subsonic 3 Simple Query inner join sql syntax

I want to perform a simple join on two tables (BusinessUnit and UserBusinessUnit), so I can get a list of all BusinessUnits allocated to a given user.
The first attempt works, but there's no override of Select which allows me to restrict the columns returned (I get all columns from both tables):
var db = new KensDB();
SqlQuery query = db.Select
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
The second attept allows the column name restriction, but the generated sql contains pluralised table names (?)
SqlQuery query = new Select( new string[] { BusinessUnitTable.IdColumn, BusinessUnitTable.NameColumn } )
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
Produces...
SELECT [BusinessUnits].[Id], [BusinessUnits].[Name]
FROM [BusinessUnits]
INNER JOIN [UserBusinessUnits]
ON [BusinessUnits].[Id] = [UserBusinessUnits].[BusinessUnitId]
WHERE [BusinessUnits].[RecordStatus] = #0
AND [UserBusinessUnits].[UserId] = #1
So, two questions:
- How do I restrict the columns returned in method 1?
- Why does method 2 pluralise the column names in the generated SQL (and can I get round this?)
I'm using 3.0.0.3...
So far my experience with 3.0.0.3 suggests that this is not possible yet with the query tool, although it is with version 2.
I think the preferred method (so far) with version 3 is to use a linq query with something like:
var busUnits = from b in BusinessUnit.All()
join u in UserBusinessUnit.All() on b.Id equals u.BusinessUnitId
select b;
I ran into the pluralized table names myself, but it was because I'd only re-run one template after making schema changes.
Once I re-ran all the templates, the plural table names went away.
Try re-running all 4 templates and see if that solves it for you.

Resources