Is there a way of algorithmically editing lines in a (word) text document? - text

I am trying to read v large text docs with 15 columns into R and every so often I get an R error message (and the read-in doesn't work) saying that there are not 15 columns - occasional rows have the letter "B" at the end as an extra element. As I can't read the doc into R I can't correct in R - is there any algorithmic way of finding and deleting these superfluous characters? Thanks

Related

How to programmatically print a long text string in 3 columns over multiple lines in excel vba

I have a long string of text. I want to print this in 3 columns over 2 pages in excel, using vba. e.g.:
A|B|C
-----
D|E|F
This is easy to do in ms word, I just split the page into 3 columns and print the text and it automatically goes onto the next column/page once the previous on is full. But I need to do similar in excel.
Ideas I've explored:
Having a textbox that stretches over 2 pages - splits into 3 columns, but the arrangement is:
a|c|e
b|d|f
I can't seem to find a way to put a page break within the text box.
Have 6 separate text boxes and split the text up - can't find a way of determining when one box is full so that the other can be started. Can't even determine how many lines the text takes up as characters have varying widths.
Have 6 separate large cells and split the text up - same issues as above.
Does anyone know of a way this can be overcome? I just want to replicate the behaviour of ms word.
Edit: here's the code for a textbox with columns that I can't get to page break:
Dim tb_1
Set tb_1 = jumbledwords_sheet.Shapes.AddTextbox(msoTextOrientationHorizontal, 10, 7470, 488, 1300)
tb_1.TextFrame2.Column.Spacing = 10
tb_1.TextFrame2.Column.Number = 3
tb_1.TextFrame2.TextRange.Font.Size = 9
tb_1.TextFrame2.TextRange.Characters.Text = long_string
The stucture of long_string isn't important, and I can break it etc. to fit the solution.

Extract last word in string in R - error faced

First, I wish to extract the last word and first word for the Description column (this column contains at least 3 words) into a newly created column firstword and lastword. However, the word() function is not applied to all the rows. As such, there are many rows with empty lastword, though these rows actually have a last word (as you can see from the Description column). This is shown in the first two lines of codes.
Second, I am also trying to get the third line of code to replace the lastword with firstword, if lastword is empty. However it isn't working.
Is there a way to rectify this?
c1$lastword = word(c1$Description,start=-1) #extract last word
c1$firstword = word(c1$Description,start=1) #extract first word
c1$lastword=ifelse(c1$lastword == " ", c1$firstword, c1$lastword)
I realise that there is white space at the beginning of some of the rows of the Description variable, which isn't shown when viewed in R.
Removing the whitespace using stri_trim() solved the issue.
c1$Description = stri_trim(c1$Description, "left") #remove whitespace

Openrefine: Split multi-valued cells by token/word count?

I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?
Here is my simple solution:
Go to Edit cells -> Transform and enter
value.replace(/((\s+\S+?){999})\s+/,"$1###")
This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with ### (you can choose any token you like, as long as it doesn't appear in the original text).
The go to Edit cells -> Split multi-valued cells and split using the token ### as separator.
The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).
You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.
So: Edit cells -> Transform -> Python/Jython:
my_list = value.split(' ')
n = 1000
i = n
while i < len(my_list):
my_list.insert(i, '|||')
i+= (n+1)
return " ".join(my_list)
(For an explanation of this script, see here)
Here is a more compact version :
text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])
You can then split using ||| as separator.
If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap :
import textwrap
return "|||".join(textwrap.wrap(value, 6000))

Deleting duplicate data in one of three column in Excel

I tried to use some functions that I found whilst searching to solve my problem, they are slightly modified it to remove duplicate data in a field.
File
Rather than a count of 4, I would like the count of 2 from column J. The information below are my attempts for 4 different sections on the attached document as I always thought the next one would give me the result that I wanted.
H ====I==========J
P13C Body Exterior 4943
P13C Body Exterior 4943
P13C Body Exterior 5122
P13C Body Exterior 5122
=IFERROR(INDEX($K$7:$K$142,MATCH(0,COUNTIFS($H$7:$H$142,B14,$K$7:$K$142,$E$14),0)),"")
as does this
=IFERROR(INDEX($J$7:$J$142,MATCH(,IF(H$7:H$142="P13C",COUNTIF(I7:I142,$J$7:$J$142)),)),"")
and this
=IFERROR(INDEX($K$7:$K$142,MATCH(0,COUNTIF($H$7:$H$142,$K$7:$K$142),0)),"")
This, gives me a 0
=IF($J$7:$J$142>1,IF($K$7:$K$142="20",SUM(IF(FREQUENCY($H$7:$H$142,$H$7:$H$142)>1,1))))
This gives me a DIV error
=SUMPRODUCT(((H7:H142="P13C")*(I7:I142="Body Exterior"))/(COUNTIFS(J7:J142, J7:J142, H7:H142, "P13C", I7:I142,"Body Exterior")+((H7:H142<>"P13C")+(I7:I142="Body Exterior"))))
There are duplicates in $J$7:$J$142, but I only want the one count.
Sort the column J in smallest to largest order
concatenate all three values by using '&' on next column K ---> H&I&J
then use IF and COUNTIF formula on next column L,
Column L:
=IF(K1=K2,COUNTIF(K1,K1:K142),"")
this a easy method to get count 1 for J column

MATLAB avoid multiline strings

I've a struct array in which I store chapters title and other info after an OCR, at the end there's a GUI for manual adjustment in case of ocr failing.
All works fine but after an editing or if I reopen the GUI later from the command windows the strings are messed up, short strings have e blank space after every char, for example:
'first chapter' becomes 'f i r s t c h a p t e r'
while longer strings skip a row every character
'THIS IS AN EXAMPLE' becomes 'TE HX IA SM IP SL AE N'
I thought it was due to multiline strings but converting in a single line doesn't changed the results.
any ideas?
Thanks

Resources