DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(256) - python-3.x

I am trying to write a Python data frame to redshift. I wrote this code -
df.to_sql(sheetname, con=conn,if_exists='replace',chunksize=250, index=False)
conn.autocommit = True
I am getting below error:
DataError: (psycopg2.errors.StringDataRightTruncation) value too long for type character varying(256)
I have 300+ columns and 1080 rows.

The error messages looks to be from Redshift and is indicating that one of the values you are attempting to insert into the table is too large for the definition of that column.
"value too long for type character varying(256)"
The column is defined to be VARCHAR(256) but the value being inserted is larger than this. You should be able to inspect the length of the strings you are inserting (in your python code) to find the offending value(s). Alternatively you can look at the statement history in Redshift to find the offending command.
One thing to note is that Redshift uses UTF-8 to encode data and for some characters it needs to use a multi-byte encoding. These encodings can take more than one byte to represent certain characters. The defined length of the column is in bytes, not characters, so a string that has 250 characters can take more than 256 bytes to represent it if there are more than a handful of multi-byte characters in the string. If you are on the hairy edge of not fitting in the defined column length you may want to check your string lengths in bytes with a multi-byte UTF-8 encoding.

Related

reading data from excel into SAS and converting character to numeric

I am reading data from an excel file in SAS and inserting the values to an oracle table. The oracle table has a numeric column. If the excel file has numbers, it works fine. But if the column is left blank in the excel file, it is read as a character value and insertion to oracle fails.
Is it possible to convert the column to numeric if its is blank, but read it as is if its has a number?
Thanks!
Let's assume that SAS is reading this column as a character and you cannot convert it directly within the file. This happens sometimes: maybe you don't have authorization to do it, or maybe it's just not working like you're expecting. SAS can go from character to numeric and numeric to character with two functions: input() and put().
Going from Character to Numeric: input()
input() is for changing character data into numbers.
This is great for reading in dates, currency, comma-separated numbers, etc. If you need your data as a number, use this function. Its syntax is:
num_var = input(char_var, informat.);
In your case, let's say we always expect numbers to be here even if it's missing. We'll use the 8. informat for our variable of interest, my_var.
data want;
set have_excel(rename=(my_var = my_var_char) );
my_var = input(my_var_char, 8.);
drop my_var_char;
run;
Note that we need to create a new variable. We rename the variable of interest to something else, then create a new version of the variable of interest that is a number. In SAS, just like many other languages and database systems, when a variable is declared as a character or number, it is always a character or a number.
Going from Numeric to Character: put()
put() is for putting a number to a character or a character to another character.
This is great for converting SAS dates to characters, adding custom formats, converting a character to another character, etc. The syntax is:
char_var = put(num_var, format.);
OR:
char_var = put(char_var, format.);
Note the previous use case: with put(), you can convert characters to other characters. This is very handy for standardizing values or even merging data using a format.
For example: let's convert a number to a comma-separated character number.
data want;
char_number = put(1234, comma.);
run;
Output:
char_number
1,234
Below case statement worked for me.
case
when missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
when not missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
end as COLUMN_VALUE

What is the difference between these two tab-delimited .txt files that is causing .split("\t") to properly separate values from one but not the other?

I have two Japanese word frequency reports that were compiled from different sources. Each line contains a word and its number of occurrences, delimited by tabs. I also have a python script that is supposed to split each line into those two values using .split("\t"). The latter value is then converted into an integer, which is where the error is coming from:
ValueError: invalid literal for int() with base 10: '\ufeff29785713'
This is only occurring for data from the second file.
Upon testing to see if converting the number to a float would work (or change the error), the result was this:
ValueError: could not convert string to float: '\ufeff29785713'
Is this a result of the tabs or numerals in the second file perhaps not technically being the same character and not delimiting properly, causing unwanted characters in the latter value (or perhaps not splitting at all)? Both files are UTF-8 encoded.
Shorter version of first file (working)
Shorter version of second file
Honestly, not a python dev at all, but given that your second array element contains a rogue character pair you could try removing it after you split and before you convert to number:
x[1] = x[1].replace('\ufeff', '')
x being the name of the array you did split your line into. The replace operation will have no effect on the first file, because FEFF is not present

Convert string into fixed length numbers and convert it back

I have more than 100 cpp files. I need to assign unique ID to each of them. I aslo need to know which file it is based on their ID. I found the maximum length of file's name contains 64 characters and the ID can only be at most 8 bytes long. Is there any algorithm can help to assign unique ID to source file in VS2013 in C++ and can also let user know which file it is based on the ID ?
Just store a mapping between filename and an integer.
-----Yes, this way is very simple. But every time when people create new course files, the mapping need to be re-coded. So I won't use this way.
HERE IS THE ORIGINAL QUESTION SO THAT THE COMMENTS BELOW MAKE SENSE
Now I have a bunch of strings, like "AAA", or "ABBCCHH". The maximum of string contains 64 characters. Now I need an algorithm which can convert string into numbers( not must be integer, double float is also acceptable). But the length of numbers must be fixed. For example, if "A" is convert into 12312, 5 digits, "ABBHGGH" should also have 5 digits after converted. And these numbers can also be converted back to original strings. Is there any algorithms can do that ? The converted number cannot over 8 bytes. That's why I cannot just use ASCII etc simple algorithm. I don't know which algorithm can do that.
To generate unique IDs of an arbitrary set of filenames (the actual question here), you could use a cryptographic hash (SHA-1, -256, -384, -512). This will result in a unique, fixed-length hexadecimal output. If you can't allow the characters a-f in the output, you can convert the hexadecimal value to decimal.
This process is not reversible, but you can maintain a map (lookup table) of the input values to the IDs.
If you want a simpler solution, just hexadecimal encode the filenames. This is reversible. (You can add the hex -> decimal conversion here if necessary as well).

Why strings cannot be indexed by integer values

I learned that Swift strings cannot be indexed by integer values. I remembered it and I use the rule. But I've never fully understood the mechanic behind it.
The explanation of from the official document is as follows
"Different characters can require different amounts of memory to store, so in order to determine which Character is at a particular position, you must iterate over each Unicode scalar from the start or end of that String. For this reason, Swift strings cannot be indexed by integer values"
I've read it several times, I still don't quite get the point. Can someone explain me a bit more why Swift String cannot be indexed by integer values?
Many Thanks
A string is stored in memory as an array of bytes.
A given character can require 1 to 4 bytes for the basic codepoint, plus any number of combining diacritical mark.
For example, é requires 2 bytes.
Now, if you have the strings efgh and éfgh, to access the second character (f), for the first string, the character is in the byte array at index 1, for the second string, it is at index 2.
In order to know that, you need to inspect the first character. For accessing any character based on its index, you need to go through all the previous characters to know how many bytes each takes.

Redshift: Truncate VARCHAR value automatically on INSERT or maybe use max length?

When performing an INSERT, Redshift does not allow you to insert a string value that is longer/wider than the target field in the table. Observe:
CREATE TEMPORARY TABLE test (col VARCHAR(5));
-- result: 'Table test created'
INSERT INTO test VALUES('abcdefghijkl');
-- result: '[Amazon](500310) Invalid operation: value too long for type character varying(5);'
One workaround for this is to cast the value:
INSERT INTO test VALUES('abcdefghijkl'::VARCHAR(5));
-- result: 'INSERT INTO test successful, 1 row affected'
The annoying part about this is that now all of my code will have to have these cast statements on every INSERT for each VARCHAR field like this, or the application code will have to truncate the string before trying to construct the query; either way, it means that the column's width specification has to go into the application code, which is annoying.
Is there any better way of doing this with Redshift? It would be great if there was some option to just have the server truncate the string and perform (and maybe raise a warning) the way it does with MySQL.
One thing I could do is just declare these particular fields as a very large VARCHAR, perhaps even 65535 (the maximum).
create table analytics.testShort (a varchar(3));
create table analytics.testLong (a varchar(4096));
create table analytics.testSuperLong (a varchar(65535));
insert into analytics.testShort values('abc');
insert into analytics.testLong values('abc');
insert into analytics.testSuperLong values('abc');
-- Redshift reports the size for each table is the same, 4 mb
The one disadvantage of this approach I have found is that it will cause bad performance if this column is used in a group by/join/etc:
https://discourse.looker.com/t/troubleshooting-redshift-performance-extensive-guide/326
(search for VARCHAR)
I am wondering though if there is no harm otherwise if you plan to never use this field in group by, join, and the like.
Some things to note in my scenario: Yes, I really don't care about the extra characters that may be lost with truncation, and no, I don't have a way to enforce the length of the source text. I am capturing messages and URLs from external sources which generally fall into certain range in length of characters, but sometimes there are longer ones. It doesn't matter in our application if they get truncated or not in storage.
The only way to automatically truncate the strings to match the column width is using the COPY command with the option TRUNCATECOLUMNS
Truncates data in columns to the appropriate number of characters so
that it fits the column specification. Applies only to columns with a
VARCHAR or CHAR data type, and rows 4 MB or less in size.
Otherwise, you will have to take care of the length of your strings using one of these two methods:
Explicitly CAST your values to the VARCHAR you want:
INSERT INTO test VALUES(CAST('abcdefghijkl' AS VARCHAR(5)));
Use the LEFT and RIGHT string functions to truncate your strings:
INSERT INTO test VALUES(LEFT('abcdefghijkl', 5));
Note: CAST should be your first option because it handles multi-byte characters properly. LEFT will truncate based on the number of characters not bytes and if you have a multi-byte character in your string, you might end up exceeding the limit of your column.

Resources