I've a struct array in which I store chapters title and other info after an OCR, at the end there's a GUI for manual adjustment in case of ocr failing.
All works fine but after an editing or if I reopen the GUI later from the command windows the strings are messed up, short strings have e blank space after every char, for example:
'first chapter' becomes 'f i r s t c h a p t e r'
while longer strings skip a row every character
'THIS IS AN EXAMPLE' becomes 'TE HX IA SM IP SL AE N'
I thought it was due to multiline strings but converting in a single line doesn't changed the results.
any ideas?
Thanks
Related
I am trying to read v large text docs with 15 columns into R and every so often I get an R error message (and the read-in doesn't work) saying that there are not 15 columns - occasional rows have the letter "B" at the end as an extra element. As I can't read the doc into R I can't correct in R - is there any algorithmic way of finding and deleting these superfluous characters? Thanks
I need to replace or substitute the first instance of a single text character in an excel row.
current: B01 TEST TEST TEST A W B 0 A
expected result where first "A" that is on its own is replaced with "|": B01 TEST TEST TEST | W B 0 A
The issue is, each row has a character that is segmented on its own, but they are all different (some A, some W, some R, etc). Which function can I use to look for the first instance of a single text character surrounded by spaces?
In Office 365 you could use =AGGREGATE(15,6,FIND(" "&CHAR(SEQUENCE(26,,65))&" ",A19),1)
Older version: =AGGREGATE(15,6,FIND(" "&{"A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"}&" ",A19),1)
Edit: better version suggested by Mayukh
=AGGREGATE (15,6,FIND(" "&CHAR(ROW($65:$90))&" ",A19),1)
This uses Office 365 LET and SEQUENCE and is not dependent on it being a capital character. It will replace the first character whether alpha, numeric or special that has a space on either side of it with the |:
=LET(rng,A1,
sq,SEQUENCE(LEN(rng)-3),
md,MID(rng,sq,3),
lft,LEFT(md),
rt,RIGHT(md),
REPLACE(rng,MIN(IF((lft=" ")*(rt=" "),sq+1)),1,"|"))
I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?
Here is my simple solution:
Go to Edit cells -> Transform and enter
value.replace(/((\s+\S+?){999})\s+/,"$1###")
This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with ### (you can choose any token you like, as long as it doesn't appear in the original text).
The go to Edit cells -> Split multi-valued cells and split using the token ### as separator.
The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).
You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.
So: Edit cells -> Transform -> Python/Jython:
my_list = value.split(' ')
n = 1000
i = n
while i < len(my_list):
my_list.insert(i, '|||')
i+= (n+1)
return " ".join(my_list)
(For an explanation of this script, see here)
Here is a more compact version :
text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])
You can then split using ||| as separator.
If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap :
import textwrap
return "|||".join(textwrap.wrap(value, 6000))
Note: this is SSIS not sql server
I am pulling data from a file and some columns have names like this:
1;&count chocula
13;&roger ramjet
123;&mary smith
45678;&john adams
How do I remove the ampersand and everything to the left of it?
I am using the fx transformation for the character.
I thought about finding the character position for the ampersand and then deleting everthing from start to that position but ssis does not have that function. The ampersand can be at any position, I cannot say it is guaranteed to be in position such and such.
Thanks
The RIGHT() function retrieves the last X characters of a string.
RIGHT("13;&roger ramjet",12) = roger ramjet
Above, X equals 12. Of course, twelve won't work for every string. Instead we can calculate X by subtracting the string length from the position of the ampersand.
LEN(MyColumn]) = 16
FINDSTRING([MyColumn],"&",1) = 4
Or put another way...
RIGHT([MyColumn], LEN([MyColumn]) - FINDSTRING([MyColumn],"&",1)) = roger ramjet
Question:
Is it save to get substring n characters from a text in RPG using MOVEL function which take a text with length x and store it to a variable with capacity n?
Or the only save way to get the first n character is using SUBST?
The background of the question is one of my colleague getting the first 3 characters from a database with 30 char in length is using MOVEL to a variable with length only 3 char (like truncating the rest of it). The strange way, sometimes the receive variable is showing minus character ('-'), sometimes doesn't. So I assume using MOVEL is not a safe way. I am thinking like string in C which always terminated by '\0', you need to use strcpy function to get the copy save, not assigning using = operator.
Anybody who knows RPG familiar with this issue?
MOVEL should work. RPG allows several character data types. Generally speaking, someone using MOVEL will not be dealing with null terminated strings because MOVEL is an old technique and null terminated strings are a newer data type. You can read up on the MOVEx operations and the string operations in the RPG manual. To get a better answer, please post your code, including the definitions of the variables involved.
EDIT: Example of how MOVEL handles signs.
dcl-s long char(20) inz('CORPORATION');
dcl-s short char(3) inz('COR');
dcl-s numb packed(3: 0);
// 369
c movel long numb
dsply numb;
// -369
c movel short numb
dsply numb;
*inlr = *on;
With signed numeric fields in RPG the sign is held in the zone of the last byte of the field. So 123 is X'F1F2F3' but -123 is X'F1F2D3'. If you look at those fields as character strings they will have 123 and 12L in them.
In your program you are transferring something like "123 AAAAAL" to a 3 digit numeric field so you get X'F1F2F3' but because the final character is X'D3' that changes the result to have a zone of D i.e. X'F1F2D3'
You anomaly is dependent on what the 30th character contains. If it is } or any capital letter J to R then you get a negative result. [It doesn't matter whether the first 3 characters are numbers or letters because it is only the second half of the byte, the digit, that matters in your example.]
The IBM manuals say:
If factor 2 is character and the result field is numeric, a minus zone is moved into the rightmost position of the result field if the zone from the rightmost position of factor 2 is a hexadecimal D (minus zone). However, if the zone from the rightmost position of factor 2 is not a hexadecimal D, a positive zone is moved into the rightmost position of the result field. Other result field positions contain only numeric characters.
Don