I'm having a data issue with embedded ^A characters, which i can fully reproduce with this small file:
Observe that I have embedded ^A characters. I put them there using vi with the ^V technique.
Now, notice I also put a line break after the "p,q," string on the third line. That was done with the Enter key, but it just puts in a ^A, we can see here:
[ ~/hack ] cat t.csv
a,b,c,d,e
f,g,,i,j
k,l,,n,o
p,q,
,s,t
u,v,w,x,y
[ ~/hack ] xxd < t.csv > u.csv
[ ~/hack ] cat u.csv
0000000: 612c 622c 632c 642c 650a 662c 672c 012c a,b,c,d,e.f,g,.,
0000010: 692c 6a0a 6b2c 6c2c 012c 6e2c 6f0a 702c i,j.k,l,.,n,o.p,
0000020: 712c 0a2c 732c 740a 752c 762c 772c 782c q,.,s,t.u,v,w,x,
0000030: 790a y.
[ ~/hack ]
Note that for the "cat" listing, the double comma has the ^A in it, it just doesn't print to the screen with cat.
But notice also, the normal end-of-line is also a ^A. This is where it gets tricky...how does Linux differentiate between a ^A that is an embedded character, and one that is the end of line?
Note in the hex dump, after the "e", is an 0a, as expected. But there is an 0a between the two commas between 'l' and 'n' too. Yet my manually broken line between 'q' and 's' shows an actual line break--but it's just a 0a like any other!!!
My ultimate need is I need to programmatically find all broken lines like the p,q,.,s,t one, and get rid of those line breaks. But sed can't see that as a line break. That is, if I replace ^A, it would see the ones on the 'f' and 'k' lines, but it can't find the ones on the 'p' line.
So, 1) As a matter of conceptual understanding, can someone explain how on Earth Linux knows the difference between the 0a character that is embedded and one that is an end of line, and 2) What is the piece of code that would find the artificial line breaks and mend the line?
Thanks!
^A is not 0a. ^A (control-A) is ASCII character 1 (01), while the newline/linefeed character (0a, ASCII 10) is ^J (control-J).
Related
I have a file which is as following
!J INCé0001438823
#1 A LIFESAFER HOLDINGS, INC.é0001509607
#1 ARIZONA DISCOUNT PROPERTIES LLCé0001457512
#1 PAINTBALL CORPé0001433777
$ LLCé0001427189
$AVY, INC.é0001655250
& S MEDIA GROUP LLCé0001447162
I just want to keep the last 10 characters of each line so that it becomes as following:-
0001438823
0001509607
0001457512
0001433777
0001427189
0001655250
:%s/.*\(.\{10\}\)/\1
: ex-commaned
% entire file
s/ substitute
.* anything (greedy)
. followed by any character
\{10\} exactly 10 of them
\( \) put them in a match group
/ replace with
\1 said match group
I would treat this as a shell script problem. Enter the following in vim:
:%! rev|cut -c1-10|rev
The :%! will pipe the entire buffer through the following filter, and then the filter comes straight from here.
for a single line you could use:
$9hd0
$ go to end of line
9h go 9 characters left
d0 delete to beginning of line
Assuming the é character appears only once in a line, and only before your target ten digits, then this would seem to work:
:% s/^.*é//
: command
% all lines
s/ / substitute (i.e., search-and-replace) the stuff between / and /
^ search from beginning of line,
. including any character (wildcard),
* any number of the preceding character,
é finding "é";
// replace with the stuff between / and / (i.e., nothing)
Note that you can type the é character by using ctrl-k e' (control-k, then e, then apostrophe, without spaces). On my system at least, this works in insert mode and when typing the "substitute" command. (To see the list of characters you can invoke with the ctrl-k "digraph" feature, use :dig or :digraph.
I'd like to replace double quotes " characters which come in pairs. Let me explain what I mean.
"Some sentence"
Here double quotes should be replaced because they come in pair.
"Some sentence
Here should not be replaced - there is no matching pair for the first quote character.
I'd like to replace first quote character with „.
❯ echo „ |hexdump -C
00000000 e2 80 9e 0a
And the second quote character with ”
❯ echo ” |hexdump -C
00000000 e2 80 9d 0a
Summing it up, the following:
Hi, "how
are you"
Should be the following after being replacement is made.
Hi, „how
are you”
I've come up with the following code, but it fails to work:
'sed -r s/(\")(.+)(\")/\1\xe2\x80\x9e\3\xe2\x80\x9d/g'
" hi " gives "„"”.
EDIT
As requested in the comments, here comes a sample from a file to be modified. Important note: the file is structured - perhaps it may help. The file is always a srt file, i.e. movie subtitle format.
104
00:10:25,332 --> 00:10:27,876
Kobieta mówi do drugiej:
"Widzisz to, co ja?"
105
00:10:28,001 --> 00:10:30,904
A tamta: "No to co?
Każdy wygląda tak samo."
Your expression doesn't work because you have three capturing groups: The three sets of (). You are putting the 1st (the first quote) and the 3rd (the last quote) in the output and ignoring the 2nd, which is the part you want to keep.
There's no reason to capture the quotes, since you don't want to inject them into the output. Only the bit in the middle needs to be captured.
There is also a flaw, the (.*) will itself match against a string containing a quote. So /"(.*)"/ would match the entire sequence "one"two", with the capture, (.*), matching one"two. Use [^"]* to match a sequence of non-quote characters.
Fixing this, and treating the entire text file as one line with -z, which only works if there are no nul characters in the text file, it appears this works:
sed -zE 's/"([^"]+)"/„\1“/g'
sed -rn ':a;s/"([^"]*)"/„\1”/g;/"/!{p;b;};$p;N;ba'
It substitutes all "xx" with „xx”. If the result contains no more " it is printed and we restart with next line. Else we concatenate the next line and we restart. The $p is just here to print the last lines if they contain a dangling ".
vim: insert original line number in g/pattern/move $
I'm debugging some event order in a log and like to check two set of events sequence by the line number of the showing log. Usually, I used g/pattern/move $ for some interesting info. But I cannot find a way to insert the original line number of them. Please help.
I tried :
g/pattern/move $; printf("%d",line("."))
but it does not work.
Can't help thinking of something very straightforward, for example:
g/pattern/call append(line('$'), line('.') . ' ' . getline('.'))
A slightly different way but I have following mapping in my _vimrc
nnoremap <F3> :redir! #f<cr>:silent g//<cr>:redir! END<cr>:enew!<cr>:put! f<cr>:let #f=#/<cr>:g/^$/d<cr>:let #/=#f<cr>gg
It opens a new buffer with all your search matches, including the linenumbers where the match occured.
I have figured out a way to insert at first the line number on the lines that have the pattern and after that moving the same lines to the end of the file:
:%s,\v^\ze.*pattern,\=line('.') . ' ' ,g | g/pattern/m$
We have two commands:
:%s,\v^\ze.*pattern,\=line('.') . ' ' ,g
, ....................... we are using comma as delimiter
\v ...................... very magic substitution
^ ....................... Regular expression for beginning of line
\ze ..................... indicates that all after it will not be substituted
\=line('.') ............. gets the line number
. ' ' .................. concatenates one space after the number
The second command is separated with |
g/pattern/m$
m$ ....................... moves the pattern to the end of file
I use this to detect space in a string in Lua:
if string.byte(" ")==32 then blah blah
What is the return number (instead of 32) for enter key or new line in Lua?
These numbers denote the ASCII codes for each character. Here's a chart for future reference (but only to 127, as extended ASCII is not supported) so newline is 10.
You can also print a list with the following code:
for i=1,127 do
print(i .. " = " .. string.char(i))
end
However, command characters (such as newline) are difficult to interpret.
You can check them with the \n and \r characters.
> =string.byte '\r'
13
> =string.byte '\n'
10
I don't know the number, but you could try finding it by printing print(string.byte("\n"))
Here you can see an output of "cat tcl.log":
Discovered serial numbers for slot 1 olt 1:
sernoID Vendor Serial Number sernoID Vendor Serial Number
5 ZNTS 032902A6
And that's how it looks in VIM:
^MDiscovered serial numbers for slot 1 olt 1:
^MsernoID Vendor Serial Number sernoID Vendor Serial Number
^M<SPACE> for next page, <CR> for next line, A for all, Q to quit^H ^H^H ^H^...
5 ZNTS 032902A6
I don't mind the ^M and ^H characters, I know how to get rid of them. The problem is that for some reason my C++ program (unlike cat) is seeing the line starting with "< SPACE >". What can I do about it? I'm using the fstream library to read the log file and I want it to ignore the line I mentioned. I tried to do something like this:
std::ofstream logFinal("logFinal");
std::ifstream log("tcl.log");
std::string temp;
while (std::getline(log, temp)){
if (temp.find("SPACE") != std::string::npos){
temp = "";
}
logFinal << temp << std::endl;
}
But for some reason it doesn't find any "SPACE" in the temp variable. It looks like the "< SPACE >" is some kind of a special character of which I've never heard about.
You're obtaining that log file from/via some sort of program that does paging. (It could be buried inside things; these things happen.) That paging program prints a message like this at the end of a page:
<SPACE> for next page, <CR> for next line, A for all, Q to quit
The <SPACE> is just part of some message with human-readable text; it's seven very ordinary characters. HOWEVER, the ^H that follow it are more interesting, as they're really backspace characters; it's where the preceding characters are deleted again to make way for the next line of real output.
The easiest way (assuming you're on — or have easy access to — a Unix/Linux system) is to feed that log file through col -b (the col program with the -b option, to do backspace elimination). Check out this little cut-n-paste from a shell session:
bash$ echo -e 'abc\b\b\bdef'
def
bash$ echo -e 'abc\b\b\bdef' | od -c
0000000 a b c \b \b \b d e f \n
0000012
bash$ echo -e 'abc\b\b\bdef' | col -b | od -c
0000000 d e f \n
0000004
(The \b should be the same as ^H in your log file.)