Delete words with specific string in a text document in Notepad++ - text

I have a file with contents like
00001 abcd
00020 abcdefgh
0030 acgefty
00040 jhjhjdsadj2
00050 sadjjjah589
00500 blessing
I need to delete all numbers at beginning, i.e., I need a result file
abcd
abcdefgh
acgefty
jhjhjdsadj2
sadjjjah589
blessing
Can someone help, as the actual file is approx 600 lines long, so difficult to delete the numbers at start manually. I don't mind running a small C program.

You can do:
Ctrl+H
Find what: ^\d+\s*
Replace with: NOTHING

Related

How to yank words between dots on multiple lines in VIM?

I try to find the best way to copy (all occurrences) and paste (all the occurrences somewhere else) the second word in between the dots in this example case with vim (without plugins):
1 somename.xyz.something
2 so.someday.zzzz
3 text.example.fese.efsse
The result after I paste it somewhere else:
5 xyz
6 someday
7 example
I would copy the lines in question (e.g. yip or through visual mode) and paste them to the desired location, so you'd get:
somename.xyz.something
so.someday.zzzz
text.example.fese.efsse
somename.xyz.something
so.someday.zzzz
text.example.fese.efsse
And then delete the unwanted parts. For example by selecting them in visual mode and running :'<,'>normal 0df.f.D, resulting in:
somename.xyz.something
so.someday.zzzz
text.example.fese.efsse
xyz
someday
example

remove a character from the end of a text string only if it is there in excel

I have a list of file names in excel I need to Match with another list. Some of the file names contain extra characters that need to be removed first though. I have a formula that will remove special characters and spaces from the file names;
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE($E8,"_",""),"-",""),".","")," ","")
But some of the file names contain an extra 1 at the end I need to remove, please see example;
2AALNOR120114
DCA CDE 12-01-14
OPASDOCS120114
TWASCE1202141
TWASCE1203141
STCSRA120120141
STCSRA120220141
If anyone could give me a Formula solution that strips out the above special characters and the 1 at the end of the filename that would be great.
Bonus credit if you can also strip out the 20 from the STC files as well to output as STCSRA120114 instead of STCSRA12012014
Edit: For clarification, final result would ideally look like this;
2AALNOR120114
DCACDE120114
OPASDOCS120114
TWASCE120214
TWASCE120314
STCSRA120114
STCSRA120214
Thanks,
Ben
Maybe:
=LEFT(A1,LEN(A1)-IF(RIGHT(A1,1)="1",1,0))
(Replace the first two instances of A1 above with a suitable version of your SUBSTITUTE formula, and the last with E8).
With substitution:
=LEFT(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(E8,"_",""),"-",""),".","")," ",""),LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(E8,"_",""),"-",""),".","")," ",""))-IF(RIGHT(E8,1)="1",1,0))
A sightly shorter version of the A1 one:
=LEFT(A1,LEN(A1)-(RIGHT(A1)="1"))

data file in horizontal format containing hidden characters

I have been provided a data file in a format I have never seen. The data do not appear to be in columns, but rather in one long row. I can open the file in Notepad and see the data. So, the data do not appear to be encrypted.
When I open the data file in Notepad the row of data wraps back to the to left side of the Notepad window when I guess the data reach the maximum number of characters that Notepad allowed in a single row, and then the data continue in a new row.
There might be 10,000 rows of data when I open the file in Notepad. The data in one of these rows are not aligned with the data in the row above it or below it.
Here are some example data:
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1304 3 0 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 0205 0 3 0
40001 1 5 GGGG 2998 HURG SU111111 95 1.0 F1 4 0805 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1205 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1505 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2003 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2303 2 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2703 3 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999
Notice that when I paste the example data here, representing one row in Notepad, the columns are 'magically' aligned.
I have found that I can open the data file in Excel and the data are also aligned. I do need to manually assign column boundaries in Excel however. And Excel does not allow me to assign a column boundary beyond more-or-less Character Space 123.
Below is SAS code to read the data file, although this SAS code does not work correctly. Rather I guess this SAS code skips some of the data rows. Notice that the variable TT covers character spaces 125-207, but that there are only 120 characters in most rows. There are more than 120 characters in some rows. This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly.
option linesize = 210 ;
option pagesize = 30 ;
FILENAME myinput 'C:/Users/markm/simple SAS programs/mydata.new' ;
DATA mydata ;
INFILE myinput ;
INPUT
AA 2-9
BB 12-17
CC 18-22
DD $ 24-27
EE 30-33
FF $ 35-38
GG $ 40-47
HH 53-56
II 59-64
JJ $ 66-68
KK $ 70-71
LL 72-78
MM 79-85
NN $ 87-90
OO 91-95
PP 97-104
QQ 105-110
RR 112-120
SS $ 122-123
TT $ 125-207 ;
If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad.
All of this is telling me there are hidden characters in the data file used to identify the end of a line of data.
I opened the data file in Vim hoping to see these hidden characters, but did not see anything. Vim did align the columns correctly when I opened the file. So, Vim must be seeing these hidden end-of-line characters.
How can I see these end-of-line characters myself? I suspect there is an option in Vim to reveal the hidden characters.
How can I determine the application that created this data file?
How can I modify the above SAS code to read this data file correctly?
First off, double check your LRECL. You're missing basically half of your data, which makes me think you're reading in two lines for each line. You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there.
Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. If the latter, you have a DOS EOF character (1A) in the data, and you need to set the IGNOREDOSEOF option. If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767 and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one).
Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile.
Read in a dummy character, and then put the _infile_ line with hex. format. For example:
data test;
infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
input #1 x $1. #;
r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
put r;
put _infile_;
put _infile_ hex512.;
stop; *we want to see just one line here;
run;
In that case i'm reading in 20 long lines, and using hex40., as it needs to be exactly double the line length. You can leave the length off (hex.) but you'll get some really long lines with tons of blanks if you do that. In your case, lrecl=207, you should use hex414. in theory (But might want to make your lrecl 256 and hex512. just in case). Since we're using RECFM=F, the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. (If one line doesn't tell you enough about this, use firstobs= to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks).
That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). See this page for a list of ASCII codes.
Hex codes to look for:
1A = DOS EOF character.
0A = LF
0D = CR
If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A in a row, somewhere around 207. If this is a Unix document, you will see just 0A there. If this is a Mac OS document, you may see LFCR, or 0A0D. Why would anyone want to be consistent.
You probably will see something, since you're getting some number of lines. (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems:
This is a DBCS file, so all characters really take up more than one byte. If you see a lot of 00 or 40 or 20 between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. They use two bytes for every character in order to represent the full set of characters in their languages; but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case).
This is a UTF-8 file, where characters may take multiple bytes (but may not). In this case you probably see some 'junk' in the data when viewing it this way, and every so often you get a character that takes up two or three spaces - often entirely full of 'junk' characters. UTF-8 can take between 1 and 4 bytes per character, usually powers of 2 (so 1,2,4) but will look 'normal' for ASCII characters (ie, it takes ASCII and adds a lot, making relatively few changes in the 00-7F range).
My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me).
Here is how to see the hidden end-of-line characters in gVim 7.4:
Open gVim 7.4
Open the data file in gVim 7.4
Press the escape key a few times to access the line editor. Note pressing the escape key
will result in no visible result on the gVim 7.4 window.
Type :set list at the bottom of the gVim 7.4 window
Press the enter key
Once I did the above I saw a blue $ at the end of every line, which I assume is an end-of-line hidden character.
Maybe if I am able to remove these blue $ symbols and save the result under a new name SAS might be able to read that new data file. If I figure this out I will post an update.
EDIT
I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M
I typed :%s/$//g which replaced the blue $ with yellow $. Then I saved the file under a new name and opened the new file with gVim. But when I typed :set list the blue $ were still present in the new file.

Add lines using contents of a list

If I had a sentence - "I know a person named Ted who likes //^$" - or basically a sentence with a lot characters I didn't feel like escaping, and I wanted to insert copies of that sentence with different names (e.g. John Mary Bob)...
Can a for loop do this by copying the sentence, pasting it as the next line, and then subbing out the name? How do I tell it where to paste?
I could also paste the list of names in first and then sub the sentence in around the names - eg :s/^/I know a person named /, but I find that if there is a lot of text with a lot characters to escape, I'll probably make an error somewhere and waste time having to scrutinize the expression.
So then, is there an easier way to grab the contents from the sentence and put it into a substitute command?
You can do this with a macro in vim.
Check here for a explanation of vim macros: http://vim.wikia.com/wiki/Macros.
It's a lot easier than using regex stuff.
Like Chiel92 suggests, a macro is the easiest way. Suppose you have a text file that looks like this:
I know a person named XXX who like //^$
John
Mary
Bob
Personally I would:
Go to line 1, and copy the line into a named buffer: :1<enter>"iyy
Go to line 3 and record a macro that copies the name on the line, pastes the contents of the i buffer, and then replaces XXX with the name that was on the line:
Go to line 3: :3<enter>
Start recording a macro to register m: qm
Delete the name into a different register: "od$
Paste in the template: "ipkdd
Replace XXX with the name: :s/XXX/^Ro/
Go to the next line: j
Finish recording: q
For each name line you can now replay the macro: #m or ##
Note: when making macros and have problems replaying I always find it helpful to look at the contents of my recording registry. You can just do ^R^Rm to see all the commands you recorded.

Text Manipulation - Add / Remove Spaces

I have a txt file containing multiple rows of identical size:
(Examples)
0123456 789 AND abcdefg hij
For all rows in the file I want to add a space after the 4th character shifting the following characters to the right by 1 character. I also want to remove the space from the 8th character (which would be 9th after the initial space is added).
I have cygwin installed so sed is an option.
I also have php and visual studio 2010 installed.
Any help on this would be greatly appreciated.
sed 's/^\(....\)\(...\) /\1 \2/'
I ended up just using Cygwin -> VIM.
Open input file in Vim Editor.
Go to first line, first character using ":1"
Start recording using "qa" (Where a is the name of your macro)
Move to the 4th character of line.
Enter into edit mode by pressing "insert" or "i"
Type your space character.
Press Esc.
Move to first character by pressing "Shift+^".
Move to next line's first character.
Press q to quit from recording mode.
Now play whatever you have recorded for any number of times you want.
If you want to play it once press #a
If you want to repeat this 10 times then type 10#a
(where a is the macro name you defined earlier)
Deleting a space follows the same steps except you don't need to go into edit mode just go to the space you want to remove and hit x and move on with the instructions.

Resources