How do I search and replace in vi to eliminate a random number of random characters preceding a known string? - search

I have text files which look like this:
0 298047498 /directory1/app/20170417/file1.blob 0 f191
e 6569844 /directory1/app/20170417/file2.blob 0 f191
344 /directory1/app/20170417/file3.blob 0
8946 /directory1/app/20170417/file4.blob 0
196496 /directory1/app/20170417/file5.blob 0
9 182340752 /directory1/app/20170417/file6.blob 0 f191
68802 /directory1/app/20170417/file7.blob 0
I want to remove everything prior to the first / and everything after the file extension.
Results should look like this:
/directory1/app/20170417/file1.blob
/directory1/app/20170417/file2.blob
/directory1/app/20170417/file3.blob
Is there a way to do this using vi search and replace?

This type of question may be better placed here: https://vi.stackexchange.com/
But for now:
Yout can e.g. use a simple vim-macro, in which you collect all the key-strokes you need to edit one line and repeat this macro as many times as you need it.
Here are simply the key-strokes for one line:
dt/WD
d = delete..
t = ..till the first "/"
W = [shift]+[w] jumps to the next Word (after the "file-location-string")
D = [shift]+[d] deletes till the end of the current line
If you want to record this as a macro, do the following, with the keystrokes from above, inbetween - like this:
qmdt/WD[home][down]q
qm = start the recording of a macro in buffer "m"
... key-strokes from above
[home][down] = key [home] followed by [arrow down]-key, to move into the next line (for convenince)
q = end up the macro-recording
Now execute that macro with:
#m
And if you added the [down] key, you can do something like:
7#m
with which you fire your macro 7 times, for all your 7 lines.

Related

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Notepad++/Vim Incremental Insert

I have a file that I'm writing where I need to insert incrementing numbers in two different places (Context & UConfig). I've tried using the column editor in Notepad++, but it also adds numbers in the middle of the other rows.
forw mte Context 0
forw mte Config 1 0 0
forw mte CellSelection 10000 [10]
forw mte UConfig 1([100000000000000 2] [] [] []) [] [0]
forw mte ConfigCapability [1] [100] [100]
forw mte ConfigSelection 20000
I need to copy this selection 1000 times with Context going from 0-999 and UConfig going from 100000000000000-100000000000999.
It should be an easy task for vim:
type:
qq?Conte<CR>6YGgpn<C-A>3jf0<C-A>q
then do 998#q
first step is record a macro <CR> and <C-A> means pressing Enter and Ctrl-A
2nd step is to replay the macro 998 times
"what does the first line do exactly"? try by yourself from first ? till last <c-a> you will see how the text was modified
Following is a demo, to show add the text from 0 - 4 (after recording, press 3#q):

entering text in a file at specific locations by identifying the number being integer or real in linux

I have an input like below
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01
Before anything beings I have to input a text like
*Revolved
*Gripped
*Crippled
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
*Cracked
*Crippled
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too.
Any suggestions with sed or awk would be highly appreaciated!
The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.
An awk with fixed strings:
awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"}
match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile
match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.

awk-insert row with specific text within specific position

I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected.
Sample file:
// Generated for: spectre
// Design library name: Kovi
// Design cell name: T_Line
// Design view name: schematic
simulator lang=spectre
global 0
parameters frequency=3.8G Zo=250
// Library name: Kovi
// Cell name: T_Line
// View name: schematic
T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1
How about sed
sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file
Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end.
Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.
Since you awk'd the post
awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file
As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect.
Test file
#
#
#
#Preceded by a tab
begin protect
#
before unprotect
Result
#
#
#
#Preceded by tab
protect
begin protect
#
before unprotect
unprotect
Edit:
Removed the [:space:]* as it seems that is already handled by default.
Support //
If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \.
This would check for at least one /.
Adding a quantifier {2} will match // exactly: /#|\/{2}/

Resources