I want to replace some wrongly assigned elements from the dataframe in python based on the previous elements - string

Station_ID
ID809086
t
ID809088
ID809089
.
.
ID809098
t
ID809100
.
.
.
Basically I want to replace all 't' with previous ID +1, such as t = ID809087 and ID809099 in above scenario. Thanks in advance
Basically I want to replace all 't' with previous ID +1, such as t = ID809087 and ID809099 in above scenario. Thanks in advance

Try:
tmp = 'ID' + df['Station_ID'].str.extract(r'^ID(\d+)')[0].astype(float).interpolate().astype(int).astype(str)
df['Station_ID'] = np.where(df['Station_ID'].eq('t'), tmp, df['Station_ID'])
print(df)
Prints:
Station_ID
0 ID809086
1 ID809087
2 ID809088
3 ID809089
4 ID809098
5 ID809099
6 ID809100

Related

Add suffix to a specific row in pandas dataframe

Im trying to add suffix to % Paid row in the dataframe, but im stuck with only adding suffix to the column names.
is there a way i can add suffix to a specific row values,
Any suggestions are highly appreciated.
d={
("Payments","Jan","NOS"):[],
("Payments","Feb","NOS"):[],
("Payments","Mar","NOS"):[],
}
d = pd.DataFrame(d)
d.loc["Total",("Payments","Jan","NOS")] = 9991
d.loc["Total",("Payments","Feb","NOS")] = 3638
d.loc["Total",("Payments","Mar","NOS")] = 5433
d.loc["Paid",("Payments","Jan","NOS")] = 139
d.loc["Paid",("Payments","Feb","NOS")] = 123
d.loc["Paid",("Payments","Mar","NOS")] = 20
d.loc["% Paid",("Payments","Jan","NOS")] = round((d.loc["Paid",("Payments","Jan","NOS")] / d.loc["Total",("Payments","Jan","NOS")])*100)
d.loc["% Paid",("Payments","Feb","NOS")] = round((d.loc["Paid",("Payments","Feb","NOS")] / d.loc["Total",("Payments","Feb","NOS")])*100)
d.loc["% Paid",("Payments","Mar","NOS")] = round((d.loc["Paid",("Payments","Mar","NOS")] / d.loc["Total",("Payments","Mar","NOS")])*100)
without suffix
I tried this way, it works but.. im looking for adding suffix for an entire row..
d.loc["% Paid",("Payments","Jan","NOS")] = str(round((d.loc["Paid",("Payments","Jan","NOS")] / d.loc["Total",("Payments","Jan","NOS")])*100)) + '%'
d.loc["% Paid",("Payments","Feb","NOS")] = str(round((d.loc["Paid",("Payments","Feb","NOS")] / d.loc["Total",("Payments","Feb","NOS")])*100)) + '%
d.loc["% Paid",("Payments","Mar","NOS")] = str(round((d.loc["Paid",("Payments","Mar","NOS")] / d.loc["Total",("Payments","Mar","NOS")])*100)) + '%'
with suffix
Select row separately by first index value, round and convert to integers, last to strings and add %:
d.loc["% Paid"] = d.loc["% Paid"].round().astype(int).astype(str).add(' %')
print (d)
Payments
Jan Feb Mar
NOS NOS NOS
Total 9991.0 3638.0 5433.0
Paid 139.0 123.0 20.0
% Paid 1 % 3 % 0 %

Store iteritems-Result in dataframe

Goal:
Add a column ('Team_url') to my nfl teams dataframe (df_teams) with each teams website-url.
Problem:
If I print the url, it works just fine. If I try to store it to df_teams['Team_url'], it only stores the last result of the iteritems.
Data:
df_teams['Team_web']
0 Arizona-Cardinals
1 Chicago-Bears
2 Green-Bay-Packers
3 New-York-Giants
4 Detroit-Lions
.
.
.
31 Houston-Texans
Code:
for i, j in df_teams['Team_web'].iteritems():
url_1 = "https://www.nfl.com/teams/{0}/roster".format(j)
df_teams['Team_url'] = url_1
Print:
print(url_1):
https://www.nfl. com/teams/Arizona-Cardinals/roster
https://www.nfl. com/teams/Chicago-Bears/roster
.
.
.
https://www.nfl.com/teams/Houston-Texans/roster
print(df_teams['Team_url'])
0 https://www.nfl.com/teams/Houston-Texans/roster
1 https://www.nfl.com/teams/Houston-Texans/roster
2 https://www.nfl.com/teams/Houston-Texans/roster
Questions:
How can I store what is printed for the url_1 in the dataframe column?
df['Team_url'] = 'https://www.nfl.com/teams/' + df['Team_web'] + '/roster'

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

Creating Frequency Matrix (counting number of items in array) in J

>text
┌───────────┬──────────┬───────────┬──────────┬──────────┬─────────┬──────────┬─────────────┬─────────────┬──────────┬───────────────┬──────────┬──────────┬────────────┬─────────────────┬──────────┬──────────┬──────────────┬─────────────┬─────────────┬────...
│speak │conceal │terribl │option │write │book │come │tuesdai │matter │act │conceal │catastroph│integr │depart │justic │put │wai │choic │realli │bad │opti...
├───────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼─────────────┼─────────────┼──────────┼───────────────┼──────────┼──────────┼────────────┼─────────────────┼──────────┼──────────┼──────────────┼─────────────┼─────────────┼────...
│trump │logu │talk │entir │time │talk │entir │time │discov │someth │frequent │doe │logu │thi │direct │logu │direct │logu │differ │direct │cons...
├───────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼─────────────┼─────────────┼──────────┼───────────────┼──────────┼──────────┼────────────┼─────────────────┼──────────┼──────────┼──────────────┼─────────────┼─────────────┼────...
│cohen │lawyer │object │taint │team │anoth │unusu │move │lawyer │trump │file │emerg │motion │court │sundai │night │sai │presid │object │extraordinari│meas...
├───────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼─────────────┼─────────────┼──────────┼───────────────┼──────────┼──────────┼────────────┼─────────────────┼──────────┼──────────┼──────────────┼─────────────┼─────────────┼────...
│photo │presid │trump │fire │jame │comei │director │mai │did │mean │end │comei │time │public │memoir │higher │loyalti │releas │comei │featur │wide...
├───────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼─────────────┼─────────────┼──────────┼───────────────┼──────────┼──────────┼────────────┼─────────────────┼──────────┼──────────┼──────────────┼─────────────┼─────────────┼────...
│british │deleg │organ │wrote │twitter │russia │syria │allow │access │douma │unfett │access │essenti │russia │syria │cooper │western │diplomat │confirm │syria │russ...
├───────────┼──────────┼───────────┼──────────┼──────────┼─────────┼
cleaned_text
┌─────┬───────┬───────┬──────┬─────┬────┬────┬───────┬──────┬───┬───────┬──────────┬──────┬──────┬──────┬───┬───┬─────┬──────┬───┬──────┬──────────┬──────┬────┬────┬────┬────────┬─────┬─────┬───────┬───────┬───────┬───────┬───┬─────┬───────┬────┬───────┬──...
│speak│conceal│terribl│option│write│book│come│tuesdai│matter│act│conceal│catastroph│integr│depart│justic│put│wai│choic│realli│bad│option│catastroph│option│hard│call│tell│congress│thing│chang│clinton│fervent│support│disagre│sai│least│philipp│rein│longtim│tr...
└─────┴───────┴───────┴──────┴─────┴────┴────┴───────┴──────┴───┴───────┴──────────┴──────┴──────┴──────┴───┴───┴─────┴──────┴───┴──────┴──────────┴──────┴────┴────┴────┴────────┴─────┴─────┴───────┴───────┴───────┴───────┴───┴─────┴───────┴────┴───────┴──...
each row of "text" is a news article, and I am trying to figure out the number of each vocab from cleaned_text in each article so that I can create a frequency matrix like this:
art1 art2 art3 ...
mai 4 5 4
sai 1 0 0
...
I am looking e. and E. verbs to count the number of each vocab in each article, but I am having a hard time to use them in this case.
Can anyone help me on this issue??? Thank you!
I would use a slightly different approach. To keep things simple, I will use the example of p
p
┌─────┬─────┬─────┬─────┬─────┐
│pants│shirt│shirt│hat │pants│
├─────┼─────┼─────┼─────┼─────┤
│shoes│shoes│socks│pants│shirt│
├─────┼─────┼─────┼─────┼─────┤
│shirt│hat │pants│shoes│shoes│
├─────┼─────┼─────┼─────┼─────┤
│socks│pants│shirt│shirt│hat │
├─────┼─────┼─────┼─────┼─────┤
│pants│shoes│shoes│socks│pants│
├─────┼─────┼─────┼─────┼─────┤
│shirt│shirt│hat │pants│shoes│
└─────┴─────┴─────┴─────┴─────┘
To get a count of each article of clothing I need to compare each row to the whole vocabulary. I get the whole vocabulary by ravelling (,) p and getting the nub (~.) This ensures that every possible word in p is accounted for.
~.#:,p
┌─────┬─────┬───┬─────┬─────┐
│pants│shirt│hat│shoes│socks│
└─────┴─────┴───┴─────┴─────┘
Now I will transpose (|:) p so that the I can compare each row to the nub using =/ and finish off with totalling up the sum across each item. +/#:
+/#:(|: =/ ~.#,)p
2 2 1 0 0
1 1 0 2 1
1 1 1 2 0
1 2 1 0 1
2 0 0 2 1
1 2 1 1 0
Reading these numbers against the nub I see the first row has 2-pants 2-shirts 1-hat 0-shoes and 0-socks and by inspection this is correct. The second row has 1-pant 1-shirt 0-hats 2-shoes and 1-sock and so on...
Hope this helps.

data.frame slicing

I hope this question is not too simple for this board.
I have created a data.frame df:
CAS Name CID
89 13010-47-4 Lomustine 3950
90 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
91 130636-43-0 Nifekalant 268083
92 130929-57-6 Entacapone 5281081
and a vector vec
[1] 5282380 18471829 45923789 44308022 44266812 24883465 24867475 24867460
I would like to extract the rows of df which contains any number of vec. I tried to solve this problem by this code:
df$GC[(df$CID %in% vec)] = 1
df[df$GC==1,]
But the problem with this solution is, that I only get the rows, which contain only one number in the CID column. Rows which contain several values in CID like line 90 do not appear.
Is there an elegant solution for this problem?
Thanks in advance
Given your comment on EDi's answer (which I like) I thought I'd make a suggestion.
Squeezing comma separated values into a single column of a data frame is awkward and (in my experience) just leads to frustration. I often find it simpler to keep it in a separate data structure, a list:
dat <- read.table(text = " CAS Name CID
13010-47-4 Lomustine 3950
130209-82-4 Latanoprost 5311221,5282380,46705340,3890
130636-43-0 Nifekalant 268083
130929-57-6 Entacapone 5281081",sep = "",header = TRUE)
cid <- sapply(dat$CID,strsplit,",",USE.NAMES = FALSE)
In this form, things are often easier to work with:
ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
dat[sapply(cid,function(x) {any(x %in% as.character(ID))}),]
CAS Name CID
1 13010-47-4 Lomustine 3950
2 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
You can always use rownames in dat and the names of the list to keep each item straight, if you're worried about orderings changing.
(Also note that my anonymous function is assuming that ID will be found eventually by R's scoping rules; you can alter the function to pass in ID explicitly if you like.)
One way is to use grep():
> txt <- " CAS Name CID
+ 13010-47-4 Lomustine 3950
+ 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
+ 130636-43-0 Nifekalant 268083
+ 130929-57-6 Entacapone 5281081
+ "
> con <- textConnection(txt)
> df <- read.table(con, header = TRUE)
> close(con)
> ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
> grep(paste("\\b", ID, "\\b", sep="", collapse = "|"), dat$CID)
[1] 1 2

Resources