What's the difference between the text data type and the character varying (varchar) data types?
According to the documentation
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
and
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
So what's the difference?
There is no difference, under the hood it's all varlena (variable length array).
Check this article from Depesz: http://www.depesz.com/index.php/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/
A couple of highlights:
To sum it all up:
char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing
spaces, plus it is problematic to change the limit
varchar(n) – it's problematic to change the limit in live environment (requires exclusive lock while altering table)
varchar – just like text
text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
The article does detailed testing to show that the performance of inserts and selects for all 4 data types are similar. It also takes a detailed look at alternate ways on constraining the length when needed. Function based constraints or domains provide the advantage of instant increase of the length constraint, and on the basis that decreasing a string length constraint is rare, depesz concludes that one of them is usually the best choice for a length limit.
As "Character Types" in the documentation points out, varchar(n), char(n), and text are all stored the same way. The only difference is extra cycles are needed to check the length, if one is given, and the extra space and time required if padding is needed for char(n).
However, when you only need to store a single character, there is a slight performance advantage to using the special type "char" (keep the double-quotes — they're part of the type name). You get faster access to the field, and there is no overhead to store the length.
I just made a table of 1,000,000 random "char" chosen from the lower-case alphabet. A query to get a frequency distribution (select count(*), field ... group by field) takes about 650 milliseconds, vs about 760 on the same data using a text field.
(this answer is a Wiki, you can edit - please correct and improve!)
UPDATING BENCHMARKS FOR 2016 (pg9.5+)
And using "pure SQL" benchmarks (without any external script)
use any string_generator with UTF8
main benchmarks:
2.1. INSERT
2.2. SELECT comparing and counting
CREATE FUNCTION string_generator(int DEFAULT 20,int DEFAULT 10) RETURNS text AS $f$
SELECT array_to_string( array_agg(
substring(md5(random()::text),1,$1)||chr( 9824 + (random()*10)::int )
), ' ' ) as s
FROM generate_series(1, $2) i(x);
$f$ LANGUAGE SQL IMMUTABLE;
Prepare specific test (examples)
DROP TABLE IF EXISTS test;
-- CREATE TABLE test ( f varchar(500));
-- CREATE TABLE test ( f text);
CREATE TABLE test ( f text CHECK(char_length(f)<=500) );
Perform a basic test:
INSERT INTO test
SELECT string_generator(20+(random()*(i%11))::int)
FROM generate_series(1, 99000) t(i);
And other tests,
CREATE INDEX q on test (f);
SELECT count(*) FROM (
SELECT substring(f,1,1) || f FROM test WHERE f<'a0' ORDER BY 1 LIMIT 80000
) t;
... And use EXPLAIN ANALYZE.
UPDATED AGAIN 2018 (pg10)
little edit to add 2018's results and reinforce recommendations.
Results in 2016 and 2018
My results, after average, in many machines and many tests: all the same (statistically less than standard deviation).
Recommendation
Use text datatype, avoid old varchar(x) because sometimes it is not a standard, e.g. in CREATE FUNCTION clauses varchar(x)≠varchar(y).
express limits (with same varchar performance!) by with CHECK clause in the CREATE TABLE e.g. CHECK(char_length(x)<=10). With a negligible loss of performance in INSERT/UPDATE you can also to control ranges and string structure e.g. CHECK(char_length(x)>5 AND char_length(x)<=20 AND x LIKE 'Hello%')
On PostgreSQL manual
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I usually use text
References: http://www.postgresql.org/docs/current/static/datatype-character.html
In my opinion, varchar(n) has it's own advantages. Yes, they all use the same underlying type and all that. But, it should be pointed out that indexes in PostgreSQL has its size limit of 2712 bytes per row.
TL;DR:
If you use text type without a constraint and have indexes on these columns, it is very possible that you hit this limit for some of your columns and get error when you try to insert data but with using varchar(n), you can prevent it.
Some more details: The problem here is that PostgreSQL doesn't give any exceptions when creating indexes for text type or varchar(n) where n is greater than 2712. However, it will give error when a record with compressed size of greater than 2712 is tried to be inserted. It means that you can insert 100.000 character of string which is composed by repetitive characters easily because it will be compressed far below 2712 but you may not be able to insert some string with 4000 characters because the compressed size is greater than 2712 bytes. Using varchar(n) where n is not too much greater than 2712, you're safe from these errors.
text and varchar have different implicit type conversions. The biggest impact that I've noticed is handling of trailing spaces. For example ...
select ' '::char = ' '::varchar, ' '::char = ' '::text, ' '::varchar = ' '::text
returns true, false, true and not true, true, true as you might expect.
Somewhat OT: If you're using Rails, the standard formatting of webpages may be different. For data entry forms text boxes are scrollable, but character varying (Rails string) boxes are one-line. Show views are as long as needed.
A good explanation from http://www.sqlines.com/postgresql/datatypes/text:
The only difference between TEXT and VARCHAR(n) is that you can limit
the maximum length of a VARCHAR column, for example, VARCHAR(255) does
not allow inserting a string more than 255 characters long.
Both TEXT and VARCHAR have the upper limit at 1 Gb, and there is no
performance difference among them (according to the PostgreSQL
documentation).
The difference is between tradition and modern.
Traditionally you were required to specify the width of each table column. If you specify too much width, expensive storage space is wasted, but if you specify too little width, some data will not fit. Then you would resize the column, and had to change a lot of connected software, fix introduced bugs, which is all very cumbersome.
Modern systems allow for unlimited string storage with dynamic storage allocation, so the incidental large string would be stored just fine without much waste of storage of small data items.
While a lot of programming languages have adopted a data type of 'string' with unlimited size, like C#, javascript, java, etc, a database like Oracle did not.
Now that PostgreSQL supports 'text', a lot of programmers are still used to VARCHAR(N), and reason like: yes, text is the same as VARCHAR, except that with VARCHAR you MAY add a limit N, so VARCHAR is more flexible.
You might as well reason, why should we bother using VARCHAR without N, now that we can simplify our life with TEXT?
In my recent years with Oracle, I have used CHAR(N) or VARCHAR(N) on very few occasions. Because Oracle does (did?) not have an unlimited string type, I used for most string columns VARCHAR(2000), where 2000 was at some time the maximum for VARCHAR, and in all practical purposes not much different from 'infinite'.
Now that I am working with PostgreSQL, I see TEXT as real progress. No more emphasis on the VAR feature of the CHAR type. No more emphasis on let's use VARCHAR without N. Besides, typing TEXT saves 3 keystrokes compared to VARCHAR.
Younger colleagues would now grow up without even knowing that in the old days there were no unlimited strings. Just like that in most projects they don't have to know about assembly programming.
I wasted way too much time because of using varchar instead of text for PostgreSQL arrays.
PostgreSQL Array operators do not work with string columns. Refer these links for more details: (https://github.com/rails/rails/issues/13127) and (http://adamsanderson.github.io/railsconf_2013/?full#10).
If you only use TEXT type you can run into issues when using AWS Database Migration Service:
Large objects (LOBs) are used but target LOB columns are not nullable
Due to their unknown and sometimes large size, large objects (LOBs) require more processing
and resources than standard objects. To help with tuning migrations of systems that contain
LOBs, AWS DMS offers the following options
If you are only sticking to PostgreSQL for everything probably you're fine. But if you are going to interact with your db via ODBC or external tools like DMS you should consider using not using TEXT for everything.
character varying(n), varchar(n) - (Both the same). value will be truncated to n characters without raising an error.
character(n), char(n) - (Both the same). fixed-length and will pad with blanks till the end of the length.
text - Unlimited length.
Example:
Table test:
a character(7)
b varchar(7)
insert "ok " to a
insert "ok " to b
We get the results:
a | (a)char_length | b | (b)char_length
----------+----------------+-------+----------------
"ok "| 7 | "ok" | 2
Related
I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)
As I am currently playing with huge number of strings (have a look at another question: VBA memory size of Arrays and Arraylist) I used a scripting dictionary just for the feature of the keyed access that it has.
Everything was looking fine except that it was some how slow in loading the strings and that it uses a lot of memory. For an example of 100,000 strings of 128 characters in length, the Task manager showed at the end of the sub approximately 295 MB and when setting Dictionary=Nothing a poor 12 MB was remaining in Excel. Even considering internal Unicode conversion of strings 128 * 2 * 100,000 gives 25.6 MB ! Can someone explain this big difference ?
Here is all the info I could find on the Scripting.Dictionary:
According to Eric Lippert, who wrote the Scripting.Dictionary, "the actual implementation of the generic dictionary is an extensible-hashing-with-chaining algorithm that re-hashes when the table gets too full." (It is clear from the context that he is referring to the Scripting.Dictionary) Wikipedia's article on Hash Tables is a pretty good introduction to the concepts involved. (Here is a search of Eric's blog for the Scripting.Dictionary, he occasionally mentions it)
Basically, you can think of a Hash Table as a large array in memory. Instead of storing your strings directly by an index, you must provide a key (usually a string). The key gets "hashed", that is, a consistent set of algorithmic steps is applied to the key to crunch it down into a number between 0 and current max index in the Hash Table. That number is used as the index to store your string into the hash table. Since the same set of steps is applied each time the key is hashed, it results in the same index each time, meaning if you are looking up a string by its key, there is no need to search through the array as your normally would.
The hash function (which is what converts a key to an index into the table) is designed to be as random as possible, but every once in a while two keys can crunch down to the same index - this is called a collision. This is handled by "chaining" the strings together in a linked list (or possibly a more searchable structure). So suppose you tried to look a string up in the Hash Table with a key. The key is hashed, and you get an index. Looking in the array at that index, it could be an empty slot if no string with that key was ever added, or it could be a linked list that contains one or more strings whose keys mapped to that index in the array.
The entire reason for going through the details above is to point out that a Hash Table must be larger than the number of things it will store to make it efficient (with some exceptions, see Perfect Hash Function). So much of the overhead you would see in a Hash Table are the empty parts of the array that have to be there to make the hash table efficient.
Additionally, resizing the Hash Table is an expensive operation because the all the existing strings have to be rehashed to new locations, so when the load factor of the Hash Table exceeds the predefined threshold and it gets resized, it might get doubled in size to avoid having to do so again soon.
The implementation of the structure that holds the chain of strings at each array position can also have a large impact on the overhead.
If I find anything else out, I'll add it here...
It's not uncommon to have a record like this:
TAddress = record
Address: string[50];
City : string[20];
State : string[2];
ZIP : string[5];
end;
Where it was nice to have hard-coded string sizes to ensure that the size of the string wouldn't exceed the database field size allotted for the data.
Given, however, that the ShortString type has been deprecated, what are Delphi developers doing to "solve" this problem? Declaring the record fields as string gets the job done, but doesn't protect the data from exceeding the proper length.
What is the best solution here?
If I had to keep data from exceeding the right length, I'd let the database code handle it as much as possible. Put size limits on the fields, and display data to the user in data-bound controls. A TDBEdit bound to a string field will enforce the length limit correctly. Set it up so the record gets populated directly from the dataset, and it will always have the right length.
Then all you need to worry about is data coming into the record from some outside source that is not part of your UI. For that, use the same process. Have the import code insert the data into the dataset, and let its length constraints do the validation for you. If it raises an exception, reject the import. If not, then you've got a valid dataset row that you can use to populate a record from.
The short string types in your question don't really protect the strings from exceeding the proper length. When you assign a longer value to these short strings, the value is silently truncated.
I'm not sure what database access method you are using but I rather imagine that it will do the same thing. Namely truncate any over-length strings to the maximum length. In which case there is nothing to do.
If your database access method throws an error when you give it an over long string then you would need to truncate before passing the value to the database.
If you have to truncate explicitly, then there are lots of places where you might choose to do so. My philosophy would be to truncate at the last possible moment. That's the point at which you are subject to the limit. Truncating anywhere else seems wrong. It means that a database limitation is spreading to parts of the code that are not obviously related to the database.
Of course, all this is based on the assumption that you want to carry on silently truncating. If you want to do provide user feedback in the event of truncation, then you will need to decide just where are the right points to action that feedback.
From my understanding, my answer should be "do not mix layers".
I suspect that the string length is specified at the database layer level (a column width), or at the business application layer (e.g. to validate a card number).
From the "pure Delphi code" point of view, you should not know that your string variable has a maximum length, unless you reach the persistence layer or even the business layer.
Using attributes could be an idea. But it may "pollute" the source code for the very same reason that it is mixing layers.
So what I recommend is to use a dedicated Data Modeling, in which you specify your data expectations. Then, at the Delphi var level, you just define a plain string. This is exactly how our mORMot framework implements data filtering and validation: at Model level, with some dedicated classes - convenient, extendable and clean.
If you're just porting from Delphi 7 to XE3, leave it be. Also, although "ShortString" may be deprecated, I'll eat my hat if they ever remove it completely, because there are a lot of bits of code that will never be able to be rebuilt without it. ShortString + Records is still the only practical way to specify a byte-oriented file-of-record data storage. Delphi will NEVER remove ShortString nor change its behaviour, it would be devastating to existing delphi code. So if you really must define records and limit their length, and you really don't want those records to support Unicode, then there is zero reason to stop using or stop writing ShortString code. That being said, I detest short-strings, and File-of-record, wish they would go away, and am glad they are marked deprecated.
That being said, I agree with mason and David entirely; I would say, Length checking, and validation are presentation/validation concerns, and Delphi's strong typing is NOT the right place or the right way to deal with them. If you need to put validation constraints on your classes, write helper classes that implement constraint-storage (EmployeeName is a string field and EmployeeName has the following length limit). In Edit controls for example, this is already a property. It seems to me that mapping DB Fields to visual fields, using the new Binding system would be much preferable to trying to express constraints statically in the code.
User input validation and storage are different and length limits should be set in your GUI controls not in your data structures.
You could for example use Array of UnicodeChar, if you wanted to have a Unicode wide but length limited string. You could even write your own LimitedString class using the new class helper methods in Delphi. But such approaches are not a maintainable and stable design.
If your SQL database has a field declared with VARCHAR(100) type, and you want to limit your user's input to 100 characters, you should do so at the GUI layer and forget about imposing truncation (data corruption, in fact) silently behind the scenes.
I had this problem - severely - upgrading from Delphi 6 to 2009 for what one program was/is doing it was imperative to be able to treat the old ASCII strings as individual ASCII characters.
The program outputs ASCII files (NO ANSI even) and has concepts such as over-punch on the last numeric digit to indicate negative. So the file format goes back a bit one could say!
After the first build in 2009 (10 year old code, well, you do don't you!) after sorting unit names etc there were literally hundreds of reported errors/ illegal assignments and data loss / conversion warnings...
No matter how good Delphi's back-room manipulation/magic with strings and chars I did not trust it enough. In the end to make sure everything was back as it was I re-declared them all as array of byte and then changed the code accordingly.
You haven't specified the delphi version, here's what works for me in delphi 2010:
Version1:
TTestRecordProp = record
private
FField20: string;
...
FFieldN: string
procedure SetField20(const Value: string);
public
property Field20: string read FField20 write SetField20;
...
property FieldN: string ...
end;
...
procedure TTestRecordProp.SetField20(const Value: string);
begin
if Length(Value) > 20 then
/// maybe raise an exception?
FField20 := Copy(FField20, 1, 20)
else
FField20 := Value;
end;
Version2:
TTestRecordEnsureLengths = record
Field20: string;
procedure EnsureLengths;
end;
...
procedure TTestRecordEnsureLengths.EnsureLengths;
begin
// for each string field, test it's length and truncate or raise exception
if Length(Field20) > 20 then
Field20 := Copy(Field20, 1, 20); // or raise exception...
end;
// You have to call .EnsureLength before push data to db...
Personally, I'd recommend replacing records with objects, then you can do more tricks.
I created a table as follows:
create table employers (
id integer primary key,
name string,
surname string,
description string
);
I noticed here that string is not between data types, should I replace it with something else? Is it alias for some other datatype? What is the maximum number of characters that can this string contain?
I used string and it does not make any problems. I'm only interested how is it possible, that I could use string. Is it some alias? What are the possibilities of this type?
thank you
You want to use TEXT which we use whatever it needs to in order to hold your data.
CREATE TABLE Employers (
id INTEGER PRIMARY KEY,
name TEXT,
surname TEXT,
description TEXT
);
See http://www.sqlite.org/datatype3.html and note:
SQLite uses a more general dynamic
type system. In SQLite, the datatype
of a value is associated with the
value itself, not with its container.
The dynamic type system of SQLite is
backwards compatible with the more
common static type systems of other
database engines in the sense that SQL
statement that work on statically
typed databases should work the same
way in SQLite. However, the dynamic
typing in SQLite allows it to do
things which are not possible in
traditional rigidly typed databases.
That noted:
SQLite does not impose any length
restrictions (other than the large
global SQLITE_MAX_LENGTH limit) on the
length of strings, BLOBs or numeric
values.
Finally:
Maximum length of a string or BLOB
The maximum number of bytes in a
string or BLOB in SQLite is defined by
the preprocessor macro
SQLITE_MAX_LENGTH. The default value
of this macro is 1 billion (1 thousand
million or 1,000,000,000). You can
raise or lower this value at
compile-time using a command-line
option like this:
-DSQLITE_MAX_LENGTH=123456789 The current implementation will only
support a string or BLOB length up to
231-1 or 2147483647. And some built-in
functions such as hex() might fail
well before that point. In
security-sensitive applications it is
best not to try to increase the
maximum string and blob length. In
fact, you might do well to lower the
maximum string and blob length to
something more in the range of a few
million if that is possible.
During part of SQLite's INSERT and
SELECT processing, the complete
content of each row in the database is
encoded as a single BLOB. So the
SQLITE_MAX_LENGTH parameter also
determines the maximum number of bytes
in a row.
The maximum string or BLOB length can
be lowered at run-time using the
sqlite3_limit(db,SQLITE_LIMIT_LENGTH,size)
interface.
Declare your field as "TEXT" and I believe you can specify a length restriction.
so you can declare name TEXT(255) as a 255 character TEXT value.
I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.