Collation type in column decleration

Collation type in column decleration - slick

My following column declaration ends up with a varchar(254) column in latin1_* (by default) collation in MySQL db.
def note = column[String]("NOTE")
I want to explicitly provide the collation. Is there a way to do that?

Related

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?

Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

How to configure Presto searches to be case-insensitive?

In my case, Presto connects to a MySQL database which has been configured to be case-insensitive. But any search through Presto seems to be case-sensitive.
Questions:
1) Is there a way to configure Presto searches to be case-insensitive? If not, can something be changed in the Presto-MySQL connector to make the searches case-insensitive?
2) If underlying DB is case-insensitive, shouldn't Presto searches also be case-insensitive? (I presume that Presto only generates the query plan and the actual execution happens on the underlying database)
Example: Consider the below table on MySQL.
name
____
adam
Alan
select * from table where name like '%a%'
// returns adam, Alan on MySQL
// returns only adam on Presto
select * from table where name = 'Adam'
// returns adam on MySQL
// returns NIL on Presto

You have to explicitly ask for case-insensitive comparison by normalizing compared values either to-lower, or to-upper, like this:
select * from table where lower(name) like '%a%';
select * from table where lower(name) = lower('Adam');

You can use regexp_like(), and prepend the regexp with (?i) for case insensitivity
select
*
from table_name
where
regexp_like(column_name, '(?i)fOO'); -- column contains fOO or FOO
or
select
*
from table_name
where
regexp_like(column_name, '(?i)^Foo'); -- column starts with fOO or FOO

COPY FROM CSV with static fields on Postgres

I'd like to switch an actual system importing data into a PostgreSQL 9.5 database from CSV files to a more efficient system.
I'd like to use the COPY statement because of its good performance. The problem is that I need to have one field populated that is not in the CSV file.
Is there a way to have the COPY statement add a static field to all the rows inserted ?
The perfect solution would have looked like that :
COPY data(field1, field2, field3='Account-005')
FROM '/tmp/Account-005.csv'
WITH DELIMITER ',' CSV HEADER;
Do you know a way to have that field populated in every row ?
My server is running node.js so I'm open to any cost-efficient solution to complete the files using node before COPYing it.

Use a temp table to import into. This allows you to:
add/remove/update columns
add extra literal data
delete or ignore records (such as duplicates)
, before inserting the new records into the actual table.
-- target table
CREATE TABLE data
( id SERIAL PRIMARY KEY
, batch_name varchar NOT NULL
, remote_key varchar NOT NULL
, payload varchar
, UNIQUE (batch_name, remote_key)
-- or::
-- , UNIQUE (remote_key)
);
-- temp table
CREATE TEMP TABLE temp_data
( remote_key varchar -- PRIMARY KEY
, payload varchar
);
COPY temp_data(remote_key,payload)
FROM '/tmp/Account-005'
;
-- The actual insert
-- (you could also filter out or handle duplicates here)
INSERT INTO data(batch_name, remote_key, payload)
SELECT 'Account-005', t.remote_key, t.payload
FROM temp_data t
;
BTW It is possible to automate the above: put it into a function (or maybe a prepared statement), using the filename/literal as argument.

Set a default for the column:
alter table data
alter column field3 set default 'Account-005'
Do not mention it the the copy command:
COPY data(field1, field2) FROM...

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.

A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

H2 database collation: what to choose?

After a lot of reading and experimentation, it seems like I want PRIMARY strength for searching, but TERTIARY or IDENTICAL for ordering. Main question: Is that possible to achieve with H2 (or any other DB)?
Secondary question: Am I the only one here or would any of you also like the above combination? Some confirmation would be helpful for my sanity.
Background:
It seems like the collation can only be set at the very beginning when creating the database. So I want to make sure to pick the right one. I am mainly thinking of these use cases (for now):
A search field where the user can start typing to filter a table: Here PRIMARY seems the most appropriate, in order to avoid missing any results (user is used to Google...). Although, it would be nice to be able to give the user the option to enable secondary or tertiary collation to do more precise searching.
Ordering: When the user clicks a table column to order the contents, TERTIARY/IDENTICAL ordering seems appropriate. That's what I am used to from everyday experience.
I read the offical H2 docs here: http://www.h2database.com/html/commands.html#set_collation.
and here: http://www.h2database.com/html/datatypes.html#varchar_ignorecase_type
Some more related info:
Collation STRENGTH and local language relation
The test sql (from https://groups.google.com/forum/?fromgroups=#!topic/h2-database/lBksrrcuGdY):
drop all objects;
set collation english STRENGTH PRIMARY;
create table test(name varchar);
insert into test values ('À'), ('Ä'), ('Â'), ('A'), ('à'), ('ä'), ('â'), ('a'), ('àa'), ('äa'), ('âa'), ('aa'), ('B'), ('b');
select * from test where name like 'a' order by name;
select * from test order by name;

If you want to have two behaviours for a single data you have to:
split data over two columns,
or uses two operator sets.
For your purpose, it is common to store "canonical" representation of a raw data in order to search on canonical form and then sort/display raw data. May be you should use some "text search engine" such as Apache Lucene.
For pure H2 solutions, you can use H2 alias with Computed columns or with query criteria. First solution allows indexing to speed up your queries.

Almost 8 years later, my own recommendation based on some hard learnings:
Use no collation at all (default for H2 databases).
Rationale: Using a collation will produce some really unexpected results and bugs.
Pitfall: UNIQUE constraints
By far the most common unique constraints i saw in daily business was to enforce unique (firstname, lastname). Typically, case should be ignored (prevent both 'thomas müller' and 'Thomas Müller'), but not umlauts (allow both 'Thomas Müller' and 'Thomas Muller').
It might be tempting to use a collation strength SECONDARY setting to achieve this (case-insensitive but umlaut-sensitive). Don't. Use VARCHAR_IGNORECASE columns instead.
{
// NOT recommended: using SECONDARY collation
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("ALTER TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
s.execute("INSERT INTO test (name) VALUES ('Müller')");
s.execute("INSERT INTO test (name) VALUES ('Muller')");
// s.execute("INSERT INTO test (name) VALUES ('muller')" /* will fail */);
}
{
// recommended: no collation, using VARCHAR_IGNORECASE instead of VARCHAR column
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("CREATE TABLE test ( name VARCHAR_IGNORECASE )");
s.execute("ALTER TABLE test ADD CONSTRAINT unique_name UNIQUE(name)");
s.execute("INSERT INTO test (name) VALUES ('Müller')");
s.execute("INSERT INTO test (name) VALUES ('Muller')");
// s.execute("INSERT INTO test (name) VALUES ('muller')" /* will fail */);
}
Pitfall: Searching / WHERE clauses
Recommendation: The default behavior without collation is just fine, and behaves as expected. For more fuzzy searching use your own code search or a library like Lucene.
SECONDARY collation strength will match even if case is different. You will not expect that behavior when using SELECT WHERE name = '...', because you will forget all about your collation setting.
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH SECONDARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('Thomas Müller')");
ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'Thomas müller'" /* different case */);
rs.next();
/* prints 1 (!) */ System.out.println(rs.getLong(1));
}
PRIMARY collation strength will match even if SPACES are different. Would you believe the English primary collation ignores spaces? Check out this nugget: https://stackoverflow.com/a/16567963/1124509
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("SET COLLATION ENGLISH STRENGTH PRIMARY");
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('Thomas Müller')");
ResultSet rs = s.executeQuery("SELECT count(*) FROM test WHERE name = 'ThomasMüller'" /* no space! */);
rs.next();
/* prints 1 (!) */ System.out.println(rs.getLong(1));
}
Sorting / ORDER BY clauses
The default ordering without collation is not really useful in real-world scenarios, as it will sort according to strict string comparison. Solve this by first loading the data from the database, and then order / sort it with code.
Personally, I mostly use an English primary strength collator with the spaces problem fixed. Works fine even for non-English text columns.
But you might also need to use a custom comparator to satisfy more difficult requirements like natural or intuitive sort orders, e.g. sort like windows explorer, or semantic versioning.
{
Statement s = DriverManager.getConnection("jdbc:h2:mem:", "test", "test").createStatement();
s.execute("CREATE TABLE test ( name VARCHAR )");
s.execute("INSERT INTO test (name) VALUES ('é6')");
s.execute("INSERT INTO test (name) VALUES ('e5')");
s.execute("INSERT INTO test (name) VALUES ('E4')");
s.execute("INSERT INTO test (name) VALUES ('ä3')");
s.execute("INSERT INTO test (name) VALUES ('a2')");
s.execute("INSERT INTO test (name) VALUES ('A1')");
ResultSet rs = s.executeQuery("SELECT name FROM test ORDER BY name");
List<String> names = new ArrayList<>();
while(rs.next()) {
names.add(rs.getString(1));
}
// not very useful strict String.compareTo() result: [A1, E4, a2, e5, ä3, é6]
System.out.print(names);
String rules = ((RuleBasedCollator) Collator.getInstance(new Locale("en", "US"))).getRules();
Collator collator = new RuleBasedCollator(rules.replaceAll("<'\u005f'", "<' '<'\u005f'"));
collator.setStrength(Collator.PRIMARY);
names.sort((a, b) -> collator.compare(a, b));
// as humans usually expect it in a name list / table: [A1, a2, ä3, E4, e5, é6]
System.out.print(names);
}
How to check if your H2 database is using a collation?
Look at the SETTINGS table. If no collation is set, there will be no entry in the table.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Collation type in column decleration - slick

My following column declaration ends up with a varchar(254) column in latin1_* (by default) collation in MySQL db. def note = column[String]("NOTE") I want to explicitly provide the collation. Is there a way to do that?

Related

Inserting a value on a frozen set in cassandra 3

How to configure Presto searches to be case-insensitive?

COPY FROM CSV with static fields on Postgres

non-ordinal access to rows returned by Spark SQL query

H2 database collation: what to choose?

Categories

Resources