I have 10000 sites of text and basically it is subtitle-text... I want to delete of every line the last 5 numbers and the first 2 numbers without touching/changing the text...
Here is an example:
18: 00:03:13:05 00:03:16:17 03:12 Moi, j'aurais mis ça à la même hauteur que ça.
19: 00:03:18:02 00:03:21:05 03:03 Dans un premier temps, je termine.
20: 00:03:23:15 00:03:26:07 02:17 Ah, toujours le travail !
The bold numbers should be deleted.
There are a number of ways you can approach this.
Split the string into substrings and then concatenate the ones you want.
Regular Expression to pull backrefs out of the text.
Since your data seems pretty normalized, and is essentially space delimited, you could tokenize the string based on a space, and then put the 2nd, 3rd, and "rest" back together, throwing away the 1st and 4th token.
You don't say what tools or languages you want to use, but in Java you might use public String[] split(String regex, int limit)
With Vim, something like :0,$s/\d\+: // should remove the first part. A pattern like :0,$s/ \d\d:\d\d / / should remove the second part.
Assuming all lines in the file are the same and you have a fixed width file, on a UNIX like system (Unix, Linux, Mac, FreeBSD) , you can try
cut -b 1-4,28-33 --complement INPUTFILENAME > OUTPUTFILENAME
If not, you should be able to do this in two steps, as follows:
cut -d : -f 1 --complement INPUTFILENAME > OUTPUTFILENAME
cut -b 24-29 --complement OUTPUTFILENAME > OUTPUTFILENAME
The second solution's first step deletes the first number identifier field regardless of its length. The second step changes what should then be lines of similar width (for the columns of interests).
Related
I'd like to replace double quotes " characters which come in pairs. Let me explain what I mean.
"Some sentence"
Here double quotes should be replaced because they come in pair.
"Some sentence
Here should not be replaced - there is no matching pair for the first quote character.
I'd like to replace first quote character with „.
❯ echo „ |hexdump -C
00000000 e2 80 9e 0a
And the second quote character with ”
❯ echo ” |hexdump -C
00000000 e2 80 9d 0a
Summing it up, the following:
Hi, "how
are you"
Should be the following after being replacement is made.
Hi, „how
are you”
I've come up with the following code, but it fails to work:
'sed -r s/(\")(.+)(\")/\1\xe2\x80\x9e\3\xe2\x80\x9d/g'
" hi " gives "„"”.
EDIT
As requested in the comments, here comes a sample from a file to be modified. Important note: the file is structured - perhaps it may help. The file is always a srt file, i.e. movie subtitle format.
104
00:10:25,332 --> 00:10:27,876
Kobieta mówi do drugiej:
"Widzisz to, co ja?"
105
00:10:28,001 --> 00:10:30,904
A tamta: "No to co?
Każdy wygląda tak samo."
Your expression doesn't work because you have three capturing groups: The three sets of (). You are putting the 1st (the first quote) and the 3rd (the last quote) in the output and ignoring the 2nd, which is the part you want to keep.
There's no reason to capture the quotes, since you don't want to inject them into the output. Only the bit in the middle needs to be captured.
There is also a flaw, the (.*) will itself match against a string containing a quote. So /"(.*)"/ would match the entire sequence "one"two", with the capture, (.*), matching one"two. Use [^"]* to match a sequence of non-quote characters.
Fixing this, and treating the entire text file as one line with -z, which only works if there are no nul characters in the text file, it appears this works:
sed -zE 's/"([^"]+)"/„\1“/g'
sed -rn ':a;s/"([^"]*)"/„\1”/g;/"/!{p;b;};$p;N;ba'
It substitutes all "xx" with „xx”. If the result contains no more " it is printed and we restart with next line. Else we concatenate the next line and we restart. The $p is just here to print the last lines if they contain a dangling ".
I have a multi fasta file named fasta1.fasta that contains the sequences and their IDs. What i want is to cut the header of the sequence that have the ID and reduce it to contains the ID accession number of the sequence only. I used the command line grep '>' fasta1.fasta | cut -d " " -f 1 to cut the parts that i want from the header but the output that i get is the IDs accession numbers only without the rest of the sequences. My sequences looks like this:
>tr|Q8IBQ5|Q8IBQ5_PLAF7 40S ribosomal protein S10, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_$
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7 TBC domain protein, putative OS=Plasmodium falciparum (isolate 3D7) OX=36329 GN=PF3D7_132020$
MEYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW
The output that i get is:
>tr|Q8IBQ5|Q8IBQ5_PLAF7
>tr|Q8IEB1|Q8IEB1_PLAF7
While the output desired is:
>tr|Q8IBQ5|Q8IBQ5_PLAF7
MDKQTLPHHKYSYIPKQNKKLIYEYLFKEGVIVVEKDAKIPRHPHLNVPNLHIMMTLKSL
KSRNYVEEKYNWKHQYFILNNEGIEYLREFLHLPPSIFPATLSKKTVNRAPKMDEDISRD
VRQPMGRGRAFDRRPFE
>tr|Q8IEB1|Q8IEB1_PLAF7
EYKLEFLSYLLIFKKKNERISKFDEQIKTCINIFEKSIINESDLKYLFERNILDMNPGV
RSMCWKLALKHLSLDSNKWNTELIEKKKLYEEYIKSFVINPYYSCVDNKKKEFVKETEKE
PKGKNMKDEYIEYNLDRNKTYYHKDDSLLKLQNDNNTKQMDYLEDEKYSSMDDECSEDNW
Any help will be appreciated. Thank you.
Variant 1:
sed '/^>/s/ .*//'
Variant 2:
perl -pe 's/ .*// if /^>/'
That is, in all lines that start with >, remove everything after and including the first space.
This question already has answers here:
Printing with sed or awk a line following a matching pattern
(9 answers)
Closed 6 years ago.
I'm trying to replace every nth occurrence of a string in a text file.
background:
I have a huge bibtex file (called in.bib) containing hundreds of entries beginning with "#". But every entry has a different amount of lines. I want to write a string (e.g. "#") right before every (let's say) 6th occurrence of "#" so, in a second step, I can use csplit to split the huge file at "#" into files containing 5 entries each.
The problem is to find and replace every fifth "#".
Since I need it repeatedly, the suggested answer in printing with sed or awk a line following a matching pattern won't do the job. Again, I do not looking for just one matching place but many of it.
What I have so far:
awk '/^#/ && v++%5 {sub(/^#/, "\n#\n#")} {print > "out.bib"}' in.bib
replaces 2nd until 5th occurance (and no more).
(btw, I found and adopted this solution here: "Sed replace every nth occurrence". Initially, it was meant to replace every second occurence--which it does.)
And, second:
awk -v p="#" -v n="5" '$0~p{i++}i==n{sub(/^#/, "\n#\n#")}{print > "out.bib"}' in.bib
replaces exactly the 5th occurance and nothing else.
(adopted solution from here: "Display only the n'th match of grep"
What I need (and not able to write) is imho a loop. Would a for loop do the job? Something like:
for (i = 1; i <= 200; i * 5)
<find "#"> and <replace with "\n#\n#">
then print
The material I have looks like this:
#article{karamanic_jedno_2007,
title = {Jedno Kosova, Dva Srbije},
journal = {Ulaznica: Journal for Culture, Art and Social Issues},
author = {Karamanic, Slobodan},
year = {2007}
}
#inproceedings{blome_eigene_2008,
title = {Das Eigene, das Andere und ihre Vermischung. Zur Rolle von Sexualität und Reproduktion im Rassendiskurs des 19. Jahrhunderts},
comment = {Rest of lines snippet off here for usability -- as in following entries. All original entries may have a different amount of lines.}
}
#book{doring_inter-agency_2008,
title = {Inter-agency coordination in United Nations peacebuilding}
}
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
What I want is every sixth entry looking like this:
#
#book{reckwitz_subjekt_2008,
address = {Bielefeld},
title = {Subjekt}
}
Thanks for your help.
Your code is almost right, i modified it.
To replace every nth occurrence, you need a modular expression.
So for better understanding with brackets, you need an expression like ((i % n) == 0)
awk -v p="#" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^#/, "\n#\n#") }{ print }' in.bib > out.bib
you can do the splitting in awk easily in one step.
awk -v RS='#' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file
will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter #.
Instead of doing this in multiple steps with multiple tools, just do something like:
awk '/#/ && (++v%5)==1{out="out"++c} {print > out}' file
Untested since you didn't provide any sample input/output.
If you don't have GNU awk and your input file is huge you'll need to add a close(out) right before the out=... to avoid having too many files open simultaneously.
I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.
Here is a line from the file:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it.
A correct line would be:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?
If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:
perl -ne 'chomp; #a=split(/\$,\$/); $_ = join("", #a[4..($#a-1)]); print join("\$,\$", #a[0..3], $_, $a[$#a]), "\n"' < data.txt
What this does:
splits the line using $,$ as delimiter
takes the middle part = fields[4] .. fields[N-1]
joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)
This works with your example, but I don't know what other corner cases you might have.
A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:
... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u
(should output a single line, with "6")
I have this file as:
The number is %d0The number is %d1The number is %d2The number is %d3The number is %d4The number is %d5The number is %d6The...
The number is %d67The number is %d68The number is %d69The number is %d70The number is %d71The number is %d72The....
The number is %d117The number is %d118The number is %d119The number is %d120The number is %d121The number is %d122
I want to pad it like:
The number is %d0 The number is %d1 The number is %d2 The number is %d3 The number is %d4 The number is %d5 The number is %d6
The number is %d63 The number is %d64 The number is %d65 The number is %d66 The number is %d67 The number is %d68 The number is %d69
d118The number is %d119The number is %d120The number is %d121The number is %d122The number is %d123The number is %d124The
Please tell me how to do it through shell script
I am working on Linux
Edit:
This single command pipeline should do what you want:
sed 's/\(d[0-9]\+\)/\1 /g;s/\(d[0-9 ]\{3\}\) */\1/g' test2.txt >test3.txt
# ^ three spaces here
Explanation:
For each sequence of digits following a "d", add three spaces after it. (I'll use "X" to represent spaces.)
d1 becomes d1XXX
d10 becomes d10XXX
d100 becomes d100XXX
Now (the part after the semicolon), capture every "d" and the next three character which must be digits or spaces and output them but not any spaces beyond.
d1XXX becomes d1XX
d10XXX becomes d10X
d100XXX becomes d100
If you want to wrap the lines as you seem to show in your sample data, then do this instead:
sed 's/\(d[0-9]\+\)/\1 /g;s/\(d[0-9 ]\{3\}\) */\1/g' test2.txt | fold -w 133 >test3.txt
You may need to adjust the argument of the fold command to make it come out right.
There's no need for if, grep, loops, etc.
Original answer:
First of all, you really need to say which shell you're using, but since you have elif and fi, I'm assuming it's Bourne-derived.
Based on that assumption, your script makes no sense.
The parentheses for the if and elif are unnecessary. In this context, they create a subshell which serves no purpose.
The sed commands in the if and elif say "if the pattern is found, copy hold space (it's empty, by the way) to pattern space and output it and output all other lines.
The first sed command will always be true so the elif will never be executed. sed always returns true unless there's an error.
This may be what you intended:
if grep -Eqs 'd[0-9]([^0-9]|$)' test2.txt; then
sed 's/\(d[0-9]\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
elif grep -Eqs 'd[0-9][0-9]([^0-9]|$)' test2.txt; then
sed 's/\(d[0-9][0-9]\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
else
cat test2.txt >test3.txt
fi
But I wonder if all that could be replaced by something like this one-liner:
sed 's/\(d[0-9][0-9]?\)\([^0-9]\|$\)/\1 \2/g' test2.txt >test3.txt
Since I don't know what test2.txt looks like, part of this is only guessing.