Removing quotes from a text file in python - text

I have to remove quotes from a relatively large text file. I have looked at the questions that may have matched this one, yet I am still not able to remove the quotes from specific lines from the text file.
This is what a short segment from the file looks like:
1 - "C1".
E #1 - "C1".
2 - "C2".
1
2
E #2 - "C2".
I would like to have 1-"C1". be replaced by c1, and E #1 - "C1" be replaced with E c1. I tried replacing these in python, but I get an error because of the double "'s.
I tried:
input=open('file.text', 'r')
output=open(newfile.txt','w')
clean==input.read().replace("1 - "C1".","c1").replace("E #1 - "C1"."," E c1")
output.write(clean)
I have with sed:
sed 's\"//g'<'newfile.txt'>'outfile.txt'.
But yet another syntax error.

If you're sure you want literal replacement, you just need to escape the quotes or use single quotes: replace('1 - "C1".', "c1") and so on (and to correct a couple of syntax-breaking typos). This, however, will only work on the first line, so there's no point in reading the whole file then. To do a smarter job, you could use re.sub:
with open('file.text') as input, open('newfile.txt','w') as output:
for line in input:
line = re.sub(r'^(?P<num>\d+) - "C(?P=num)".$',
lambda m: 'c' + m.groups()[0], line)
line = re.sub(r'^E #(?P<num>\d+) - "C(?P=num)".$',
lambda m: 'E c' + m.groups()[0], line)
output.write(line)
In the sed command you seem to try to simply remove the quotes:
sed 's/"//g' newfile.txt > outfile.txt

Related

How to append lines that match two patterns to the previous line in a file?

I have a csv file where what's supposed to be a single line, is split across several. I need help to find a way to join the lines that are split. Also, the number of fields (separated by ,) is not fixed.
A correct line has the following pattern:
X,X,X,"()",Y,H where X can be any number of fields. However, the bold part (end of the string) is fixed. Y and H are both one word.
The issue is that this line can appear as (or any variant of this):
X,X,
X, "()"
,Y,H
What I need is a way (awk, sed) of appending the lines that don't have 24 or more commas and do not end with ",Y,H, to the previous line.
Please bear in mind that it's a large file, although I have 256 GB of RAM.
Example
Correct lines
a, b, c, "()", h, k
a, b, c, d, "()", h, k
Same lines in the file
First line
a, b, c,
"()", h, k
Second line
a, b, c, d, "()"
, h
, k
So far I've tried this (not working):
awk '/"[:space:]*,[:space:]*[:alpha:]+[:space:]*,[:space:]*[:alpha:]+$/{print}' check.csv
to try to find the lines ending with ", X, Y where X and Y are words.
Also, as the minimum number of correct fields is 24, I've used:
awk 'NF<24{print}' check.csv
to filter out lines with less than 24 fields.
My idea is to detect lines that match both regular expressions and append them to the previous line.
Thank you!
This might work for you (GNU sed):
sed '/"()", *[^,]\+, *[^,]\+$/b;:a;N;s/\n//;/"()", *[^,]\+, *[^,]\+$/!ba;P;D' file
Do not process a correct line, just bail out.
Otherwise append the next line, remove the introduced newline and try and match again.
Repeat until a match, then print/delete the first line and repeat.
perl -lanF, -e 'push #L, grep length, #F; if ($L[-3] eq q/"()"/) { print join ",", #L; #L=() }' file
use -l -n -e to loop over input lines w/o printing, append linebreaks to output
use -a -F, to create #F array by splitting input on commas
push #L, grep length, #F push nonempty fields onto #L
if ($L[-3] eq q/"()"/) - if the 3rd to last accumulated field is the magic marker:
print join ",", #L print all of #L joined with commas
#L=() reset #L

Replace all double quotes only in Nth Column

I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.

sed - replace paragraph that begins with A and ends with B with strings [duplicate]

This question already has an answer here:
how to replace all lines between two points and subtitute it with some text in sed
(1 answer)
Closed 1 year ago.
I have a bunch of text files in which many paragraphs begin with printf("<style type=\"text/css\">\n"); and end with printf("</style>\n");
For example,
A.txt
...
...
printf("<style type=\"text/css\">\n");
...
...
...
printf("</style>\n"); // It may started with several Spaces!
...
...
I want this part replaced with some function call.
How to do it by sed command?
Would you try the following:
sed '
:a ;# define a label "a"
/printf("<style type=\\"text\/css\\">\\n");/ { ;# if the line matches the string, enter the {block}
N ;# read the next line
/printf("<\/style>\\n");/! ba ;# if the line does not include the ending pattern, go back to the label "a"
s/.*/replacement function call/ ;# replace the text
}
' A.txt
To replace a block of lines starting with A and ending with B by the text R, use sed '/A/,/B/cR'. Just make sure that you correctly escape the special symbols /, \ and ; in your strings. For readability I used variables:
start='printf("<style type=\\"text\/css\\">\\n")\;'
end='printf("<\/style>\\n")\;'
replacement='somefunction()\;'
sed "/^ *$start/,/^ *$end/c$replacement" yourFile

Using sed to delete specific lines after LAST occurrence of pattern

I have a file that looks like:
this name
this age
Remove these lines and space above.
Remove here too and space below
Keep everything below here.
I don't want to hardcode 2 as the number of lines containing "this" can change. How can I delete 4 lines after the last occurrence of the string. I am trying sed -e '/this: /{n;N;N;N;N;d}' but it is deleting after the first occurrence of the string.
Could you please try following.
awk '
FNR==NR{
if($0~/this/){
line=FNR
}
next
}
FNR<=line || FNR>(line+4)
' Input_file Input_file
Output will be as follows with shown samples.
this: name
this: age
Keep everything below here.
You can also use this minor change to make your original sed command work.
sed '/^this:/ { :k ; n ; // b k ; N ; N ; N ; d }' input_file
It uses a loop which prints the current line and reads the next one (n) while it keeps matching the regex (the empty regex // recalls the latest one evaluated, i.e. /^this:/, and the command b k goes back to the label k on a match). Then you can append the next 3 lines and delete the whole pattern space as you did.
Another possibility, more concise, using GNU sed could be this.
sed '/^this:/ b ; /^/,$ { //,+3 d }' input_file
This one prints any line beginning with this: (b without label goes directly to the next line cycle after the default print action).
On the first line not matching this:, two nested ranges are triggered. The outer range is "one-shot". It is triggered right away due to /^/ which matches any line then it stays triggered up to the last line ($). The inner range is a "toggle" range. It is also triggered right away because // recalls /^/ on this line (and only on this line, hence the one-shot outer range) then it stays trigerred for 3 additional lines (the end address +3 is a GNU extension). After that, /^/ is no longer evaluated so the inner range cannot trigger again because // recalls /^this:/ (which is short cut early).
This might work for you (GNU sed):
sed -E ':a;/this/n;//ba;$!N;$!ba;s/^([^\n]*\n?){4}//;/./!d' file
If the pattern space (PS) contains this, print the PS and fetch the next line.
If the following line contains this repeat.
If the current line is not the last line, append the next line and repeat.
Otherwise, remove the first four lines of the PS and print the remainder.
Unless the PS is empty in which case delete the PS entirely.
N.B. This only reads the file once. Also the OP says
How can I delete 4 lines after the last occurrence of the string
However the example would seem to expect 5 lines to be deleted.

How can I swap two lines using sed?

Does anyone know how to replace line a with line b and line b with line a in a text file using the sed editor?
I can see how to replace a line in the pattern space with a line that is in the hold space (i.e., /^Paco/x or /^Paco/g), but what if I want to take the line starting with Paco and replace it with the line starting with Vinh, and also take the line starting with Vinh and replace it with the line starting with Paco?
Let's assume for starters that there is one line with Paco and one line with Vinh, and that the line Paco occurs before the line Vinh. Then we can move to the general case.
#!/bin/sed -f
/^Paco/ {
:notdone
N
s/^\(Paco[^\n]*\)\(\n\([^\n]*\n\)*\)\(Vinh[^\n]*\)$/\4\2\1/
t
bnotdone
}
After matching /^Paco/ we read into the pattern buffer until s// succeeds (or EOF: the pattern buffer will be printed unchanged). Then we start over searching for /^Paco/.
cat input | tr '\n' 'ç' | sed 's/\(ç__firstline__\)\(ç__secondline__\)/\2\1/g' | tr 'ç' '\n' > output
Replace __firstline__ and __secondline__ with your desired regexps. Be sure to substitute any instances of . in your regexp with [^ç]. If your text actually has ç in it, substitute with something else that your text doesn't have.
try this awk script.
s1="$1"
s2="$2"
awk -vs1="$s1" -vs2="$s2" '
{ a[++d]=$0 }
$0~s1{ h=$0;ind=d}
$0~s2{
a[ind]=$0
for(i=1;i<d;i++ ){ print a[i]}
print h
delete a;d=0;
}
END{ for(i=1;i<=d;i++ ){ print a[i] } }' file
output
$ cat file
1
2
3
4
5
$ bash test.sh 2 3
1
3
2
4
5
$ bash test.sh 1 4
4
2
3
1
5
Use sed (or not at all) for only simple substitution. Anything more complicated, use a programming language
A simple example from the GNU sed texinfo doc:
Note that on implementations other than GNU `sed' this script might
easily overflow internal buffers.
#!/usr/bin/sed -nf
# reverse all lines of input, i.e. first line became last, ...
# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G
# on the last line we're done -- print everything
$ p
# store everything on the buffer again
h

Resources