An elderly family member recorded a memoir over the past few years, using Windows Notepad, so each file (by year) is simple text. I am tasked with normalizing the documents as much as possible, for a later print run. The problem I'm struggling with is how to handle each chapter title. Within a single text file could be multiple chapter entries. some chapter titles are very simple to get, for example:
Chapter 1
text
text
text.
chapter two
text
text
But she wasn't always so neat. Some of her documents contain lines like
" chapter
three
"
with leading and trailing spaces and even a CarriageReturn/LineFeed between.
I cannot get the syntax to manage the "chapter three" situation. Here's what I have done so far:
$charstr = ' chapter
three
text here
more text
'
#remove leading spaces
$charstr2 = $charstr.trim()
#find and replace chapter to all caps and start on a new line
$charstr2.Replace("chapter ",''nCHAPTER ')
I'd sure appreciate some assistance how to normalize that multi-line text string into a format like "CHAPTER three" (ideally, I will UPPER() the chapter
number as well, like "CHAPTER THREE").
I've tried using \s, as in
$charstr2 = $charstr.trim() -replace '\s+',
but I'm obviously doing something wrong.
Thanks!
read
Related
I would like to select a specific text on Sublime text, I looked for tutorials on how to do it but I can't find what I want.
As you can see on the screen, I'd like that from a word, for example "hello", it selects the sentence where there is the word but also the 2 sentences underneath.
Is it possible to do this?
in red represents the selection
I think you are using the word 'sentence' when what you mean is a 'line' and what you want to do is to construct a RegEx to select the line a specific word is on and also the contents of the next 2 lines.
If that is right then the following regex does what you want and seems like what is shown in the screenshot that you posted.
(hello)(.*\n){2}(.*)
I am new to excel vba. I want to read a textfile that contains text like this:
John Smith Engineer Chicago
Bob Alice Doctor New York
Jane Smith Teacher St. Louis
So, I want to convert this into a 2D array so if I do print(3,3), it should return 'Teacher'.
I am able to read entire file contents into one string but am having difficulty in converting it to
a 2d array like above. Please advice on how to proceed. Thanks
unless the text file has some specific structure to it, you're going to struggle a bit. Things that might make it easier are:
Does the text file contain line breaks at the end of each line?
Are all the names in [FirstName][LastName] format as per your example
or might some have more/less words?
Does the Occupation always come directly after the name?
Are there a (very) limited number of Occupations?
as mentioned by NautMeg, You have to make some assumptions on the data based on the provided template.
However we can assume that :
a space is the delimiter
The Final column is City, which can contain a space
there are 4 columns
First Name
Last Name
Profession
City/Location
Using this information:
While Not EOF(my_file)
Line Input #my_file, text_line
// text_line contains the independent line
i = i + 1
// i is the line number
Wend
is how we retrieve each line.
Split ( Expression, [Delimiter], [Limit], [Compare] )
This will give you each item in the list. For index's < 3 (0 based index), they are unique columns of data and you can handle them however you want.
For Index >=3, Join these together into 1 string .
Join( SourceArray, [Delimiter] )
You'll likely want to make the delimiter in this case a simple space, since the split function will remove the space.
That will allow you to parse the data AS is.
However, for future reference if you can control the export of the text file, you should try exporting as a CSV file.
Good luck
I am translating a game, and the game's text box only supports 50 characters max per line. Is there a way to use a formula to split the entire sentence every 50 characters or whole word (49, 48, 47, etc)?
I am currently working with this formula.
=JOIN(CHAR(10),SPLIT(REGEXREPLACE(A1, "(.{50})", "/$1"),"/"))
The problem with this code, is that it splits at exactly 50 characters (one time), and will split in the middle of the word.
So again, my goal is to have it not split on the 50th character IF the 50th character is in the middle of the word, and for the rule to apply for the rest of the lines too because it only applies on the first line.
Please take a look at this test google sheet to get an example of what I am talking about.
If it's impossible to do it on Google Sheets, I don't mind moving to Excel provided I get a functioning code.
For the record, I did ask in Google's product forums 2 days ago, and still haven't received an answer.
=REGEXREPLACE(A1, "(.{1,50})\b", "$1" & CHAR(10))
{50} matches exactly 50 times, but what you need is 50 or less.
\b is word boundary that matches between alphanumeric and non-alphanumeric character.
= REGEXEXTRACT(A1,"(?ism)^"&REPT("([\w\d'\(\),. ]{0,49}\s)", ROUNDUP(LEN(A1)/50,0))&"([\w\d'\(\),. ]{0,49})$")
Tested with various expressions and works as intended. Note that only these characters [a-zA-Z0-9_'(),.] are allowed, Which means - and other characters not mentioned will not work. If you need them, add them inside the REPT expression and finishing regexp formula. Otherwise, This will work perfectly.
You are pretty close. I'm not an expert in Sheets, so not sure if this is the best way, but your Regex is wrong for what you want.
Also, you need to be certain that you don't use a split character that might appear in the phrase itself. However, using CHAR(10) for the replace character allows you to insert LF without going through the JOIN SPLIT sequence.
replace any line feeds, carriage returns and spaces with a single space
Match strings that start with a non-Space character followed by up to 49 more characters which are followed by a space or the end of the string.
replace the capture group with the capturing group followed by the CHAR(10) (and delete the space following).
There will be extra CHAR(10) at the end which you can strip off.
EDIT Regex changed slightly due to a difference in behavior between Google's RE and what I am used to (probably has to do with how a non-backtracking regex works). The problem showed up on your example:
=regexreplace(REGEXREPLACE(REGEXREPLACE(A1 & " ","[\r\n\s]+"," "),"(\S.{0,49})\s","$1" & char(10)),"\n+\z","")
Basically a line looks like this: 'number number text text text' with spaces dividing them. The numbers are ok, because the readln() just splits them after the space, but it reads the 3 texts as one. How can i read them into separate strings?
If anybody faces this problem, here's a really easy solution I just found: read the whole thing into a string. Then pos(' ',stringsname), then copy('spacepos'+1, 200), then delete(spacepos,200) from the first string and voilá.
I'm using the MCONCAT formula (with success & help from others) to create a single string of multiple attachment names to associate them with a single record # (I am converting data from a legacy system to another by way of flat files and a data loader).
An example: | Contract 1 | filename.pdf, filename2.doc |
However, when the first load was run, records that had a comma in the name error-ed out because the data loader is viewing the comma as the break between files. After some research, we decided to use '#' as the delimiter between multiple files in a cell. Now I am stuck trying to substitute the comma delimiters in my MCONCAT formula with '#' and have been fruitless so far.
Here is the code as I am using it now:
=SUBSTITUTE(MCONCAT(IF($A$2:$A$11133=$D2,", "&$B$2:$B$11133,"")),", ","",1)
Is this possible to do? If so, how & maybe (if not asking to much) a short explanation so I can fully understand.
An example of the hopeful solution: | Contract 1 | filename.pdf # filename2.doc |
Depending on the complexity of the filenames with commas in, you might be able to do what you want simply using the Find & Select / Replace feature of excel.
Please use a copy of your workbook if you try anything suggested.
If your separator is always [list item][comma][space][list item] and none of your [list item(s)] contain [comma][space] then using a "find what" term of ", " (note the space!) and "replace with" term of "# ", Using [at][space] instead of [space][at][space] is probably better, and selecting the column containing the list should fix your problem.
A VBA solution is possible but it would probably be more effort than its worth. You might need to write lots of rules telling it how to split and join stuff, and end up with it still not being perfect.
While doing it manually might not be a fun idea, you could use something like "text to columns" to split your list then look over the results and fix the errors then re-join using your new delimiter.