we have Cassandra 3.4.4.
In the system.log we have a lot message like this:
INFO [CompactionExecutor:2] 2020-09-16 13:42:52,916 PerSSTableIndexWriter.java:211 - Rejecting value (size 1.938KiB, maximum 1.000KiB) for column payload (analyzed false) at /opt/cassandra/data/cfd/request-c8399780225a11eaac403f5be58182da/md-609-big SSTable.
What are the significance of these messages?
These entries appear several hundred per second, log rotates every minute.
The symptoms you described tell me that you have added a SASI index on the payload column of the cfd.request table but didn't used to.
Those messages are coming because Cassandra is going through the data trying to index them that the payload column has too much data in it. The maximum term size for SASI is 1024 bytes but in the example you posted, the term size was 1.9KB.
If the column only contains ASCII characters, the maximum term length is 1024 characters since each ASCII character is 1 byte. If the column has extended Unicode such as Chinese or Japanese characters, the maximum term length is shorter since each of those take up 3 bytes.
You don't have a SASI analyzer configured on the index (analyzed false) so the whole column value is taken up as a single term. If you use the standard SASI analyzer, the column value will get tokenised breaking them up into multiple terms which are shorter and you won't see those indexing failures get logged.
If you're interested in the detailed fix steps, see https://community.datastax.com/questions/8370/. Cheers!
Related
I am trying to write a Python data frame to redshift. I wrote this code -
df.to_sql(sheetname, con=conn,if_exists='replace',chunksize=250, index=False)
conn.autocommit = True
I am getting below error:
DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(256)
I have 300+ columns and 1080 rows.
The error messages looks to be from Redshift and is indicating that one of the values you are attempting to insert into the table is too large for the definition of that column.
"value too long for type character varying(256)"
The column is defined to be VARCHAR(256) but the value being inserted is larger than this. You should be able to inspect the length of the strings you are inserting (in your python code) to find the offending value(s). Alternatively you can look at the statement history in Redshift to find the offending command.
One thing to note is that Redshift uses UTF-8 to encode data and for some characters it needs to use a multi-byte encoding. These encodings can take more than one byte to represent certain characters. The defined length of the column is in bytes, not characters, so a string that has 250 characters can take more than 256 bytes to represent it if there are more than a handful of multi-byte characters in the string. If you are on the hairy edge of not fitting in the defined column length you may want to check your string lengths in bytes with a multi-byte UTF-8 encoding.
I have a question regarding STRING field definitions.
Am I better off to fully qualify my STRING fields or allow them to be variable length?
For example I am working with a data file which contains multiple string data elements which can be up to 1000 characters in length.
When I define the ECL fields as STRING1000 the strings are padded and difficult to view in ECL Watch.
If I define the ECL fields simply as STRING, the string fields are adjusted to the length of the field value and much easier to read in ECL Watch.
With regards to my question, does either option affect the size of my dataset in memory or on disk?
What is the best practice I should follow?
The standard answer to this question is:
IF you know the string is always going to contain n number of characters (like a US state code or zipcode field) OR the string will always contain 1 to n characters where n is a small number and the average length of the actual data approaches the max (like most street address fields) THEN you should define that field as a STRINGn. ELSE IF n is a large number and the average length of the data is small compared to the maximum THEN variable-length STRING would be best.
Both options affect the storage and memory size:
Fixed-length fields are always stored at their defined length.
Variable-length STRING fields are stored with a leading 4-byte integer indicating the actual number of characters following that instance (like a Pascal string)
Therefore, if you define a string field that always contains 2 characters as a STRING2 it occupies two bytes of storage, but define it as a STRING and it will occupy six.
When performing an INSERT, Redshift does not allow you to insert a string value that is longer/wider than the target field in the table. Observe:
CREATE TEMPORARY TABLE test (col VARCHAR(5));
-- result: 'Table test created'
INSERT INTO test VALUES('abcdefghijkl');
-- result: '[Amazon](500310) Invalid operation: value too long for type character varying(5);'
One workaround for this is to cast the value:
INSERT INTO test VALUES('abcdefghijkl'::VARCHAR(5));
-- result: 'INSERT INTO test successful, 1 row affected'
The annoying part about this is that now all of my code will have to have these cast statements on every INSERT for each VARCHAR field like this, or the application code will have to truncate the string before trying to construct the query; either way, it means that the column's width specification has to go into the application code, which is annoying.
Is there any better way of doing this with Redshift? It would be great if there was some option to just have the server truncate the string and perform (and maybe raise a warning) the way it does with MySQL.
One thing I could do is just declare these particular fields as a very large VARCHAR, perhaps even 65535 (the maximum).
create table analytics.testShort (a varchar(3));
create table analytics.testLong (a varchar(4096));
create table analytics.testSuperLong (a varchar(65535));
insert into analytics.testShort values('abc');
insert into analytics.testLong values('abc');
insert into analytics.testSuperLong values('abc');
-- Redshift reports the size for each table is the same, 4 mb
The one disadvantage of this approach I have found is that it will cause bad performance if this column is used in a group by/join/etc:
https://discourse.looker.com/t/troubleshooting-redshift-performance-extensive-guide/326
(search for VARCHAR)
I am wondering though if there is no harm otherwise if you plan to never use this field in group by, join, and the like.
Some things to note in my scenario: Yes, I really don't care about the extra characters that may be lost with truncation, and no, I don't have a way to enforce the length of the source text. I am capturing messages and URLs from external sources which generally fall into certain range in length of characters, but sometimes there are longer ones. It doesn't matter in our application if they get truncated or not in storage.
The only way to automatically truncate the strings to match the column width is using the COPY command with the option TRUNCATECOLUMNS
Truncates data in columns to the appropriate number of characters so
that it fits the column specification. Applies only to columns with a
VARCHAR or CHAR data type, and rows 4 MB or less in size.
Otherwise, you will have to take care of the length of your strings using one of these two methods:
Explicitly CAST your values to the VARCHAR you want:
INSERT INTO test VALUES(CAST('abcdefghijkl' AS VARCHAR(5)));
Use the LEFT and RIGHT string functions to truncate your strings:
INSERT INTO test VALUES(LEFT('abcdefghijkl', 5));
Note: CAST should be your first option because it handles multi-byte characters properly. LEFT will truncate based on the number of characters not bytes and if you have a multi-byte character in your string, you might end up exceeding the limit of your column.
Lets first quote:
Combined size of all of the properties in an entity cannot exceed 1MB.
(for a ROW/Entity) from msdn
My Questions is: Since everything is XMLed data, so for 1MB, is 1MB of what, 1MB of ASCII Chars, or 1MB of UTF8 Chars, or something else?
Sample:
Row1: PartitionKey="A', RowKey="A", Data="A"
Row2: PartitionKey="A', RowKey="A", Data="A" (this is a UTF8 unicode A)
Is Row1 and Row2 same size (in length), or Row2.Length=Row1.Length+1?
Single columns such as "Data" in your example are limited to 64 KB of binary data and single rows are limited to 1 MB of data. Strings are encoding into binary in the UTF8 format so the limit is whatever the byte size ends up being for your string. If you want your column to store more than 64 KB of data you can use a technique such as FAT Entity which is provided to you with Lokad (https://github.com/Lokad/lokad-cloud-storage/blob/master/Source/Lokad.Cloud.Storage/Azure/FatEntity.cs). The technique is pretty simple, you just encode your string to binary and then split the binary across multiple columns. Then when you want to read the string from the table, you would just re-join the columns again and convert the binary back to a string.
Azure table row size calculation is quite involved and includes both the size of the property name and its value plus some overhead.
http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/30/how-the-size-of-an-entity-is-caclulated-in-windows-azure-table-storage.aspx
Edit. Removed statement that earlier said that size calculation was slightly inaccurate. It is quite accurate.
Anybody know what the limit is for the comments on an Excel cell (2003)? I'm programatically filling this in and want to make sure that I don't exceed the limit.
This page list all the Excel specifications and limits.
Not sure about comments, but it seems cell data is limited to 32767 characters. Also not sure of the character encoding or if that matters.
Looked up the BIFF specification for NOTES (which are actually cell comments) and there is not a limit per se, only 2048 per NOTES record, but you can have as many of these as you like, they are marked as continuation records past the first one. With this in mind, it seems limitless.
However, to be safe, I'm cutting mine at 15,000 (as we should never need more than a 1000 for what we are doing).
I just tried to give it a try by putting large comment programatically.
And I ended with result : 22877 chars. However I'm not sure whether character encoding matters.
According to "Excel specifications and limits" the the max number of chars is 32767 (only 1024 display in a cell).
I suspect that's characters on the assumption that no character ever takes more than 2 bytes...