Redshift: Truncate VARCHAR value automatically on INSERT or maybe use max length? - text

When performing an INSERT, Redshift does not allow you to insert a string value that is longer/wider than the target field in the table. Observe:
CREATE TEMPORARY TABLE test (col VARCHAR(5));
-- result: 'Table test created'
INSERT INTO test VALUES('abcdefghijkl');
-- result: '[Amazon](500310) Invalid operation: value too long for type character varying(5);'
One workaround for this is to cast the value:
INSERT INTO test VALUES('abcdefghijkl'::VARCHAR(5));
-- result: 'INSERT INTO test successful, 1 row affected'
The annoying part about this is that now all of my code will have to have these cast statements on every INSERT for each VARCHAR field like this, or the application code will have to truncate the string before trying to construct the query; either way, it means that the column's width specification has to go into the application code, which is annoying.
Is there any better way of doing this with Redshift? It would be great if there was some option to just have the server truncate the string and perform (and maybe raise a warning) the way it does with MySQL.
One thing I could do is just declare these particular fields as a very large VARCHAR, perhaps even 65535 (the maximum).
create table analytics.testShort (a varchar(3));
create table analytics.testLong (a varchar(4096));
create table analytics.testSuperLong (a varchar(65535));
insert into analytics.testShort values('abc');
insert into analytics.testLong values('abc');
insert into analytics.testSuperLong values('abc');
-- Redshift reports the size for each table is the same, 4 mb
The one disadvantage of this approach I have found is that it will cause bad performance if this column is used in a group by/join/etc:
https://discourse.looker.com/t/troubleshooting-redshift-performance-extensive-guide/326
(search for VARCHAR)
I am wondering though if there is no harm otherwise if you plan to never use this field in group by, join, and the like.
Some things to note in my scenario: Yes, I really don't care about the extra characters that may be lost with truncation, and no, I don't have a way to enforce the length of the source text. I am capturing messages and URLs from external sources which generally fall into certain range in length of characters, but sometimes there are longer ones. It doesn't matter in our application if they get truncated or not in storage.

The only way to automatically truncate the strings to match the column width is using the COPY command with the option TRUNCATECOLUMNS
Truncates data in columns to the appropriate number of characters so
that it fits the column specification. Applies only to columns with a
VARCHAR or CHAR data type, and rows 4 MB or less in size.
Otherwise, you will have to take care of the length of your strings using one of these two methods:
Explicitly CAST your values to the VARCHAR you want:
INSERT INTO test VALUES(CAST('abcdefghijkl' AS VARCHAR(5)));
Use the LEFT and RIGHT string functions to truncate your strings:
INSERT INTO test VALUES(LEFT('abcdefghijkl', 5));
Note: CAST should be your first option because it handles multi-byte characters properly. LEFT will truncate based on the number of characters not bytes and if you have a multi-byte character in your string, you might end up exceeding the limit of your column.

Related

Use node.js to read the rare Chinese word from SQL Server

I encounter a problem when I tried to read the data from a ERP database. There is a data called '鉫承工程' and it save in SQL Server like ' 承工程'(data type is varchar). The ERP System can show the correct word as '鉫承工程'.
However, I use node.js to read the data and it shows me the garbled text like '�r承工程'. How can I get the correct word without changing the data in database?
Have a read through the Constants (Transact-SQL) documentation with respect to Character string constants:
Character string constants are enclosed in single quotation marks and include alphanumeric characters (a-z, A-Z, and 0-9) and special characters, such as exclamation point (!), at sign (#), and number sign (#). Character string constants are assigned the default collation of the current database. If the COLLATE clause is used, the conversion to the database default code page still happens before the conversion to the collation specified by the COLLATE clause. Character strings typed by users are evaluated through the code page of the computer and are translated to the database default code page if it is required.
The key statement here is:
Character string constants are assigned the default collation of the current database.
The implication here is that, when a column's collation is different than the database's default collation, using regular character string constants to insert values containing international characters can cause loss of information (i.e.: corrupted characters) as the literal is first interpreted using the database's default collation and is then converted to the column's collation before storage.
When this happens character corruption is usually evident with unconvertible characters getting replaced by the question mark (?) character.
Consider the following examples:
use master;
-- Typical default collation on systems using en-US...
create database Z_Demo1 collate SQL_Latin1_General_CP1_CI_AS;
create database Z_Demo2 collate Chinese_PRC_CI_AS;
go
use Z_Demo1;
go
create table dbo.Demo (
ID int not null identity(1,1),
[鉫承工程] varchar(50) collate Chinese_PRC_CI_AS
);
insert dbo.Demo ([鉫承工程]) values ('承工程');
insert dbo.Demo ([鉫承工程]) values (N'承工程');
select * from dbo.Demo;
go
use Z_Demo2;
go
create table dbo.Demo (
ID int not null identity(1,1),
[鉫承工程] varchar(50) collate Chinese_PRC_CI_AS
);
insert dbo.Demo ([鉫承工程]) values ('承工程');
insert dbo.Demo ([鉫承工程]) values (N'承工程');
select * from dbo.Demo;
go
use master;
go
drop database if exists Z_Demo1;
drop database if exists Z_Demo2;
go
Which outputs the following result sets:
ID
鉫承工程
1
???
2
承工程
ID
鉫承工程
1
承工程
2
承工程
The first demo failed to insert both values correctly because the column's collation, Chinese_PRC_CI_AS, was different than the database's default collation, SQL_Latin1_General_CP1_CI_AS.
You can avoid this issue of differing collations by using National Language string constants which are prefixed with an uppercase N character, or from application code specify the column type as NVARCHAR(...length...) instead of VARCHAR(...length...). Using the above tables from NodeJS using the mssql module, for example, you would specify the data types as sql.NVarChar(50), e.g.: request.input("鉫承工程", sql.NVarChar(50), "承工程").

My database won't accept strings with letters

I'm using Mariadb and have the table setup with VARCHAR(30). When I insert a string containing numbers like "192" and then select it I'm able to print out 192. When I insert a string like "a48" it just seems to be ignored. I've tried inserting a complete letter string "a" and I still get nothing. In the Mariadb documentation for VARCHAR(M) I found this:
"If a unique index consists of a column where trailing pad characters are stripped or ignored, inserts into that column where values differ only by the number of trailing pad characters will result in a duplicate-key error"
I'm not sure if that could have anything to do with it? I am using letters just to make it easier to parse the data on my client side program. If I don't find a solution I will probably just pad it on the server after selecting.
Does anybody have any suggestions on what's going on here, or things I could try to find the problem?
Assuming that melon is the column to receive the string, then you should put single quotes around the $melon variable in the query, like this:
query("REPLACE INTO state (id, melon, image) VALUES (1, '$melon', $image)");
String values should be surrounded by single quotes; numeric values don't need to be.
Because the target column is a varchar(30) the value should always be surrounded by single quotes. MariaDB works out what you mean when you supply a numeric value, but it doesn't understand an alphanumeric value without single quotes. Both will work if you use single quotes, as shown.
To avoid SQL injection errors, it is better to use prepared statements, as described at https://www.w3schools.com/php/php_mysql_prepared_statements.asp.

FINDREP a short string with longer without overwriting next column

So I have a set of data such as this:
mxyzd1 0000015000
mxyzd2 0000016000
xyzmd5823 0000017000
I need to use dfsort to get this data:
123xyzd1 0000015000
123xyzd2 0000016000
xyz123d5820000017000
So what I mean is: replace all character 'm' by '123' without overwriting the second column, so truncate data before you get to the second column (which starts at pos 11).
So far I've been able to replace the data but can't prevent all of my data of getting shifted, this is my code so far:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
OUTREC FINDREP=(IN=C'm',OUT=C'123',STARTPOS=1,ENDPOS=10,
MAXLEN=20,OVERRUN=TRUNC,SHIFT=YES)
DATAEND
*
The problem you are facing is that all data on a record will be shifted to the right if the FINDREP change increases the length, and to the left if the FINDREP change decreases the length. Any change in the length of the changed data affects the entire record. You have discovered this yourself.
To put that another way, FINDREP does not know about fields (columns are best termed something like that) it only knows about records, even when it is looking only at a portion of the record, changes in length reflect on the rest of the record.
There is no way to write just a FINDREP to avoid this.
OPTION COPY
INREC IFTHEN=(WHEN=INIT,
OVERLAY=(21:1,10)),
IFTHEN=(WHEN=INIT,
FINDREP=(IN=C'm',
OUT=C'123',
STARTPOS=21)),
IFTHEN=(WHEN=INIT,
BUILD=(21,10,
11,10))
This will put the data from 1,10 into a temporary extension to the record. It will do the FINDREP on the temporary extension only. Then it will take the first 10 bytes of the extension and put them into position one for a length of 10.
Just make one small change in your sort card - SHIFT=NO

Quick SQL question

Working on postgres SQL.
I have a table with a column that contains values of the following format:
Set1/Set2/Set3/...
Seti can be a set of values for each i. They are delimited by '/'.
I would like to show distinct entries of the form set1/set2 and that is - I would like to trim or truncate the rest of the string in those entries.
That is, I want all distinct options for:
Set1/Set2
A regular expression would work great: I want a substring of the pattern: .*/.*/
to be displayed without the rest of it.
I got as far as:
select distinct column_name from table_name
but I have no idea how to make the trimming itself.
Tried looking in w3schools and other sites as well as searching SQL trim / SQL truncate in google but didn't find what I'm looking for.
Thanks in advance.
mu is too short's answer is fine if the the lengths of the strings between the forward slashes is always consistent. Otherwise you'll want to use a regex with the substring function.
For example:
=> select substring('Set1/Set2/Set3/' from '^[^/]+/[^/]+');
substring
-----------
Set1/Set2
(1 row)
=> select substring('Set123/Set24/Set3/' from '^[^/]+/[^/]+');
substring
--------------
Set123/Set24
(1 row)
So your query on the table would become:
select distinct substring(column_name from '^[^/]+/[^/]+') from table_name;
The relevant docs are http://www.postgresql.org/docs/8.4/static/functions-string.html
and http://www.postgresql.org/docs/8.4/static/functions-matching.html.
Why do you store multiple values in a single record? The preferred solution would be multiple values in multiple records, your problem would not exist anymore.
Another option would be the usage of an array of values, using the TEXT[] array-datatype instead of TEXT. You can index an array field using the GIN-index.
SUBSTRING() (like mu_is_too_short showed you) can solve the current problem, using an array and the array functions is another option:
SELECT array_to_string(
(string_to_array('Set1/Set2/Set3/', '/'))[1:2], '/' );
This makes it rather flexible, there is no need for a fixed length of the values. The separator in the array functions will do the job. The [1:2] will pick the first 2 slices of the array, using [1:3] would pick slices 1 to 3. This makes it easy to change.
If they really are that regular you could use substring; for example:
=> select substring('Set1/Set2/Set3/' from 1 for 9);
substring
-----------
Set1/Set2
(1 row)
There is also a version of substring that understands POSIX regular expressions if you need a little more flexibility.
The PostgreSQL online documentation is quite good BTW:
http://www.postgresql.org/docs/current/static/index.html
and it even has a usable index and sensible navigation.
If you want to use .*/.* then you'd want something like substring(s from '[^/]+/[^/]+'), for example:
=> select substring('where/is/pancakes/house?' from '[^/]+/[^/]+');
substring
-----------
where/is
(1 row)

PLSQL to modify VARCHAR2 column data

I am working on an app that involves evaluating modifications made to vehicles, and does some number crunching from figures stored in an Oracle 10g database. Unfortunately, I only have a text data in the database, yet I need to work with numbers and not text. I would like to know if anyone could help me with understanding how to perform string operations on VARCHAR2 column data in an Oracle 10g database with PLSQL:
For example: I need to take a VARCHAR2 column named TOP_SPEED in a table named CARS, parse the text data in its column to break it up into two new values, and insert these new values into two new NUMBER type columns in the CARS table, TOP_SPEED_KMH and TOP_SPEED_MPH.
The data in the TOP_SPEED column is as such: eg. "153 km/h (94.62 mph)"
I want to save the value of 153.00 into the TOP_SPEED_KMH column, and the 94.62 value into TOP_SPEED_MPH column.
I think what I have to do in a query/script is this:
select the text data in TOP_SPEED into a local text variable
modify the local text variable and save the new values into two number variables
write back the two number variables to the corresponding TOP_SPEED_KMH and TOP_SPEED_MPH columns
Could someone please confirm that I am on the right track? I would also really appreciate any example code if anyone has the time.
Cheers
I think it's a better idea to just have the top_speed_kmh column, and get rid of the mph one. As the number of kms in a mile never changes, you can simply multiply by 0.6 to convert to miles. So you can do the same update statement as N West suggested without the mph column:
UPDATE CARS SET TOP_SPEED_KMH = TO_NUMBER(SUBSTR(1, (INSTR(UPPER(TOP_SPEED), "KM/H") -1)));
And whenever you need the mph speed, just do
Select top_speed_kmh*0.6 as top_speed_mph from cars;
For the parsing bit, you would probably use either REGEXP_SUBSTR or INSTR with SUBSTR
Then use TO_NUMBER to convert to number
You can either create a PL/SQL function for each parsing, returning the number value, and run an UPDATE query on the fields, or you could create a PL/SQL procedure with a cursor looping over all the data that is to be updated.
Here are som links for some of the built-ins:
http://psoug.org/reference/substr_instr.html
http://download.oracle.com/docs/cd/B14117_01/server.101/b10759/functions116.htm
You probably don't even need to do this with PL/SQL at all.
As long as the data in the column is consistent "99.99 km/h (99.99 m/h)" you could do this directly with SQL:
UPDATE CARS
SET TOP_SPEED_KMH = TO_NUMBER(SUBSTR(1, (INSTR(UPPER(TOP_SPEED), "KM/H") - 1))),
TOP_SPEED_MPH = <similar substr/instr combination to pull the 99.99 mph out of code>;
Set-operations are typically much faster than procedural operations.
I am working on an app that involves
evaluating modifications made to
vehicles, and does some number
crunching from figures stored in an
Oracle 10g database. Unfortunately, I
only have a text data in the database,
yet I need to work with numbers and
not text
Sounds like you should have some number columns to store these parsed out values. Instead of always calling some parsing routine (be it regexp or substr or a custom function), pass through all the data in the table(s) ONCE and populate the new number fields. You should also modify the ETL process to populate the new number fields moving forward.
If you need numbers and can parse them out, do it once (hopefully in a staging area or off hours at least) and then have the numbers you want. Now you're free to do arithmetic and everything else you'd expect from real numbers ;)
with s as
(select '153 km/h (94.62 mph)' ts from dual)
select
ts,
to_number(substr(ts, 1, instr(ts, ' ') -1)) speed_km,
to_number(substr(regexp_substr(ts, '\([0-9]+'), 2)) speed_mph
from s
Thanks everyone, it was nice to be able to use everyone's input to get the answer below:
UPDATE CARS
SET
CAR_TOP_SPEED_KPH =
to_number(substr(CAR_TOP_SPEED, 1, instr(UPPER(CAR_TOP_SPEED), ' KM/H') -1)),
CAR_TOP_SPEED_MPH =
to_number(substr(regexp_substr(CAR_TOP_SPEED, '\([0-9]+'), 2));

Resources