About Azure Table Storage Row 1MB Limit, How it counts for UTF8 Code? - azure

Lets first quote:
Combined size of all of the properties in an entity cannot exceed 1MB.
(for a ROW/Entity) from msdn
My Questions is: Since everything is XMLed data, so for 1MB, is 1MB of what, 1MB of ASCII Chars, or 1MB of UTF8 Chars, or something else?
Sample:
Row1: PartitionKey="A', RowKey="A", Data="A"
Row2: PartitionKey="A', RowKey="A", Data="A" (this is a UTF8 unicode A)
Is Row1 and Row2 same size (in length), or Row2.Length=Row1.Length+1?

Single columns such as "Data" in your example are limited to 64 KB of binary data and single rows are limited to 1 MB of data. Strings are encoding into binary in the UTF8 format so the limit is whatever the byte size ends up being for your string. If you want your column to store more than 64 KB of data you can use a technique such as FAT Entity which is provided to you with Lokad (https://github.com/Lokad/lokad-cloud-storage/blob/master/Source/Lokad.Cloud.Storage/Azure/FatEntity.cs). The technique is pretty simple, you just encode your string to binary and then split the binary across multiple columns. Then when you want to read the string from the table, you would just re-join the columns again and convert the binary back to a string.

Azure table row size calculation is quite involved and includes both the size of the property name and its value plus some overhead.
http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/30/how-the-size-of-an-entity-is-caclulated-in-windows-azure-table-storage.aspx
Edit. Removed statement that earlier said that size calculation was slightly inaccurate. It is quite accurate.

Related

What is QR-CODE's one line symbols capacity?

There is the qr code with M level size 41x41. I found that it's capacity is 154 alphanumerical data. How it is distributed between the lines? What is the max string size for one row of the code? Didn't find exact info on this.
There are several senteces in the data string to be decoded and they has different lengths. Some of them are large.

DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(256)

I am trying to write a Python data frame to redshift. I wrote this code -
df.to_sql(sheetname, con=conn,if_exists='replace',chunksize=250, index=False)
conn.autocommit = True
I am getting below error:
DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(256)
I have 300+ columns and 1080 rows.
The error messages looks to be from Redshift and is indicating that one of the values you are attempting to insert into the table is too large for the definition of that column.
"value too long for type character varying(256)"
The column is defined to be VARCHAR(256) but the value being inserted is larger than this. You should be able to inspect the length of the strings you are inserting (in your python code) to find the offending value(s). Alternatively you can look at the statement history in Redshift to find the offending command.
One thing to note is that Redshift uses UTF-8 to encode data and for some characters it needs to use a multi-byte encoding. These encodings can take more than one byte to represent certain characters. The defined length of the column is in bytes, not characters, so a string that has 250 characters can take more than 256 bytes to represent it if there are more than a handful of multi-byte characters in the string. If you are on the hairy edge of not fitting in the defined column length you may want to check your string lengths in bytes with a multi-byte UTF-8 encoding.

Fixed vs variable length STRING field definitions in ECL

I have a question regarding STRING field definitions.
Am I better off to fully qualify my STRING fields or allow them to be variable length?
For example I am working with a data file which contains multiple string data elements which can be up to 1000 characters in length.
When I define the ECL fields as STRING1000 the strings are padded and difficult to view in ECL Watch.
If I define the ECL fields simply as STRING, the string fields are adjusted to the length of the field value and much easier to read in ECL Watch.
With regards to my question, does either option affect the size of my dataset in memory or on disk?
What is the best practice I should follow?
The standard answer to this question is:
IF you know the string is always going to contain n number of characters (like a US state code or zipcode field) OR the string will always contain 1 to n characters where n is a small number and the average length of the actual data approaches the max (like most street address fields) THEN you should define that field as a STRINGn. ELSE IF n is a large number and the average length of the data is small compared to the maximum THEN variable-length STRING would be best.
Both options affect the storage and memory size:
Fixed-length fields are always stored at their defined length.
Variable-length STRING fields are stored with a leading 4-byte integer indicating the actual number of characters following that instance (like a Pascal string)
Therefore, if you define a string field that always contains 2 characters as a STRING2 it occupies two bytes of storage, but define it as a STRING and it will occupy six.

PerSSTableIndexWriter.java:211 - Rejecting value for column payload (analyzed false)

we have Cassandra 3.4.4.
In the system.log we have a lot message like this:
INFO [CompactionExecutor:2] 2020-09-16 13:42:52,916 PerSSTableIndexWriter.java:211 - Rejecting value (size 1.938KiB, maximum 1.000KiB) for column payload (analyzed false) at /opt/cassandra/data/cfd/request-c8399780225a11eaac403f5be58182da/md-609-big SSTable.
What are the significance of these messages?
These entries appear several hundred per second, log rotates every minute.
The symptoms you described tell me that you have added a SASI index on the payload column of the cfd.request table but didn't used to.
Those messages are coming because Cassandra is going through the data trying to index them that the payload column has too much data in it. The maximum term size for SASI is 1024 bytes but in the example you posted, the term size was 1.9KB.
If the column only contains ASCII characters, the maximum term length is 1024 characters since each ASCII character is 1 byte. If the column has extended Unicode such as Chinese or Japanese characters, the maximum term length is shorter since each of those take up 3 bytes.
You don't have a SASI analyzer configured on the index (analyzed false) so the whole column value is taken up as a single term. If you use the standard SASI analyzer, the column value will get tokenised breaking them up into multiple terms which are shorter and you won't see those indexing failures get logged.
If you're interested in the detailed fix steps, see https://community.datastax.com/questions/8370/. Cheers!

Convert string into fixed length numbers and convert it back

I have more than 100 cpp files. I need to assign unique ID to each of them. I aslo need to know which file it is based on their ID. I found the maximum length of file's name contains 64 characters and the ID can only be at most 8 bytes long. Is there any algorithm can help to assign unique ID to source file in VS2013 in C++ and can also let user know which file it is based on the ID ?
Just store a mapping between filename and an integer.
-----Yes, this way is very simple. But every time when people create new course files, the mapping need to be re-coded. So I won't use this way.
HERE IS THE ORIGINAL QUESTION SO THAT THE COMMENTS BELOW MAKE SENSE
Now I have a bunch of strings, like "AAA", or "ABBCCHH". The maximum of string contains 64 characters. Now I need an algorithm which can convert string into numbers( not must be integer, double float is also acceptable). But the length of numbers must be fixed. For example, if "A" is convert into 12312, 5 digits, "ABBHGGH" should also have 5 digits after converted. And these numbers can also be converted back to original strings. Is there any algorithms can do that ? The converted number cannot over 8 bytes. That's why I cannot just use ASCII etc simple algorithm. I don't know which algorithm can do that.
To generate unique IDs of an arbitrary set of filenames (the actual question here), you could use a cryptographic hash (SHA-1, -256, -384, -512). This will result in a unique, fixed-length hexadecimal output. If you can't allow the characters a-f in the output, you can convert the hexadecimal value to decimal.
This process is not reversible, but you can maintain a map (lookup table) of the input values to the IDs.
If you want a simpler solution, just hexadecimal encode the filenames. This is reversible. (You can add the hex -> decimal conversion here if necessary as well).

Resources