I'd like to replace double quotes " characters which come in pairs. Let me explain what I mean.
"Some sentence"
Here double quotes should be replaced because they come in pair.
"Some sentence
Here should not be replaced - there is no matching pair for the first quote character.
I'd like to replace first quote character with „.
❯ echo „ |hexdump -C
00000000 e2 80 9e 0a
And the second quote character with ”
❯ echo ” |hexdump -C
00000000 e2 80 9d 0a
Summing it up, the following:
Hi, "how
are you"
Should be the following after being replacement is made.
Hi, „how
are you”
I've come up with the following code, but it fails to work:
'sed -r s/(\")(.+)(\")/\1\xe2\x80\x9e\3\xe2\x80\x9d/g'
" hi " gives "„"”.
EDIT
As requested in the comments, here comes a sample from a file to be modified. Important note: the file is structured - perhaps it may help. The file is always a srt file, i.e. movie subtitle format.
104
00:10:25,332 --> 00:10:27,876
Kobieta mówi do drugiej:
"Widzisz to, co ja?"
105
00:10:28,001 --> 00:10:30,904
A tamta: "No to co?
Każdy wygląda tak samo."
Your expression doesn't work because you have three capturing groups: The three sets of (). You are putting the 1st (the first quote) and the 3rd (the last quote) in the output and ignoring the 2nd, which is the part you want to keep.
There's no reason to capture the quotes, since you don't want to inject them into the output. Only the bit in the middle needs to be captured.
There is also a flaw, the (.*) will itself match against a string containing a quote. So /"(.*)"/ would match the entire sequence "one"two", with the capture, (.*), matching one"two. Use [^"]* to match a sequence of non-quote characters.
Fixing this, and treating the entire text file as one line with -z, which only works if there are no nul characters in the text file, it appears this works:
sed -zE 's/"([^"]+)"/„\1“/g'
sed -rn ':a;s/"([^"]*)"/„\1”/g;/"/!{p;b;};$p;N;ba'
It substitutes all "xx" with „xx”. If the result contains no more " it is printed and we restart with next line. Else we concatenate the next line and we restart. The $p is just here to print the last lines if they contain a dangling ".
Related
I have a file which is as following
!J INCé0001438823
#1 A LIFESAFER HOLDINGS, INC.é0001509607
#1 ARIZONA DISCOUNT PROPERTIES LLCé0001457512
#1 PAINTBALL CORPé0001433777
$ LLCé0001427189
$AVY, INC.é0001655250
& S MEDIA GROUP LLCé0001447162
I just want to keep the last 10 characters of each line so that it becomes as following:-
0001438823
0001509607
0001457512
0001433777
0001427189
0001655250
:%s/.*\(.\{10\}\)/\1
: ex-commaned
% entire file
s/ substitute
.* anything (greedy)
. followed by any character
\{10\} exactly 10 of them
\( \) put them in a match group
/ replace with
\1 said match group
I would treat this as a shell script problem. Enter the following in vim:
:%! rev|cut -c1-10|rev
The :%! will pipe the entire buffer through the following filter, and then the filter comes straight from here.
for a single line you could use:
$9hd0
$ go to end of line
9h go 9 characters left
d0 delete to beginning of line
Assuming the é character appears only once in a line, and only before your target ten digits, then this would seem to work:
:% s/^.*é//
: command
% all lines
s/ / substitute (i.e., search-and-replace) the stuff between / and /
^ search from beginning of line,
. including any character (wildcard),
* any number of the preceding character,
é finding "é";
// replace with the stuff between / and / (i.e., nothing)
Note that you can type the é character by using ctrl-k e' (control-k, then e, then apostrophe, without spaces). On my system at least, this works in insert mode and when typing the "substitute" command. (To see the list of characters you can invoke with the ctrl-k "digraph" feature, use :dig or :digraph.
I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.
The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)
The following prints the entire content of the line after "B. "
perl -ne'print if /B[.] (.*)/s' $string > file
How can I match/print the line only if there is no other character before the "B. "? In other words, if there is a character before the "B. " ie. "TAB." skip the line / do not print.
The correct "B." is always on a new line, the only correct line to match appears as follows:
B. some text here
A regex with a leading carat indicates that the expression should match only if it is the first item on the line. The pattern /^B[.] (.*)/s should get you the result you're looking for.
Put ^ in front of the B. It means match the word starts with B. So your regex should be /^B\. (.*)/. Then no need you s flag in your pattern match.
I am new to sed and am lost with this problem. There is one line in a text file:
start_year = 1952,
I need to replace the start year number with the year number in an array, say time_c=(1999 01 01)
. So I tested the following command:
txt=" start_year = 1952,"
echo "$txt" | sed -r 's/([:blank:]*start_year[:blank:]*=) (.*)/\1 [:blank:]${time_c[0]}/g'
But this only gives back the original line stripped of white spaces:
start_year = 1952,
It seems that sed only recognizes the pattern if I remove the equal sign, because if I do
echo "$txt" | sed -r 's/([:blank:]*start_year[:blank:]*) (.*)/\1 = [:blank:]${time_c[0]}/g'
Then the result is:
start_year = [:blank:]${time_c[0]}
Now sed seems to have recognized correctly. But then, why is it not interpreting the [:blank:] and my variable?
Another minor thing is that I would like to keep the original whitespaces in the replaced text, if possible. It does not matter for my program, but it makes the replaced line align nicer with the others.
Could someone help? Thanks a lot!
try this
$ time_c=(1999 01 01); echo " start_year = 1952," |
sed -r "s/(\s*start_year\s+=\s*)(.*)(,)/\1${time_c[0]}\3/"
start_year = 1999,
Shells don't do substitutions of any sort inside single quotes.
Also, the character classes are [[:space:]] with doubled brackets. You can do things like [^[:space:][:digit:]] too (not a space or digit). And the replacement text doesn't want [:space:] to appear — well, most probably it doesn't
This works:
time_c=(1999 01 01)
txt=" start_year = 1952,"
echo "$txt" | sed -r 's/([[:blank:]]*start_year[[:blank:]]*=) (.*)/\1 '"${time_c[0]}"'/g'
Note how the expansion of the shell variable is done outside of single quotes and inside of double quotes — that's generally a good idea (outside single quotes is crucial; inside double quotes is a good idea).
It produces:
start_year = 1999
I'm having a data issue with embedded ^A characters, which i can fully reproduce with this small file:
Observe that I have embedded ^A characters. I put them there using vi with the ^V technique.
Now, notice I also put a line break after the "p,q," string on the third line. That was done with the Enter key, but it just puts in a ^A, we can see here:
[ ~/hack ] cat t.csv
a,b,c,d,e
f,g,,i,j
k,l,,n,o
p,q,
,s,t
u,v,w,x,y
[ ~/hack ] xxd < t.csv > u.csv
[ ~/hack ] cat u.csv
0000000: 612c 622c 632c 642c 650a 662c 672c 012c a,b,c,d,e.f,g,.,
0000010: 692c 6a0a 6b2c 6c2c 012c 6e2c 6f0a 702c i,j.k,l,.,n,o.p,
0000020: 712c 0a2c 732c 740a 752c 762c 772c 782c q,.,s,t.u,v,w,x,
0000030: 790a y.
[ ~/hack ]
Note that for the "cat" listing, the double comma has the ^A in it, it just doesn't print to the screen with cat.
But notice also, the normal end-of-line is also a ^A. This is where it gets tricky...how does Linux differentiate between a ^A that is an embedded character, and one that is the end of line?
Note in the hex dump, after the "e", is an 0a, as expected. But there is an 0a between the two commas between 'l' and 'n' too. Yet my manually broken line between 'q' and 's' shows an actual line break--but it's just a 0a like any other!!!
My ultimate need is I need to programmatically find all broken lines like the p,q,.,s,t one, and get rid of those line breaks. But sed can't see that as a line break. That is, if I replace ^A, it would see the ones on the 'f' and 'k' lines, but it can't find the ones on the 'p' line.
So, 1) As a matter of conceptual understanding, can someone explain how on Earth Linux knows the difference between the 0a character that is embedded and one that is an end of line, and 2) What is the piece of code that would find the artificial line breaks and mend the line?
Thanks!
^A is not 0a. ^A (control-A) is ASCII character 1 (01), while the newline/linefeed character (0a, ASCII 10) is ^J (control-J).