I have a list of strings [abc1, abc2, abc3, xyz3, xyz4]
Out of the elements with the same string preceding the number, I need to keep just the string with the highest number in my output list. So out of abc1, abc2 and abc3, the string abc3 should be selected. Out of xyz3 and xyz4, xyz4 should be kept.
So the final list should contain [abc3, xyz4].
I've been thinking of how this problem can be solved since the past 2 days and after unsuccessfully trying out some approaches, I am still in the dark how this can be done. I would greatly appreciate any help on this.
This function is what you need
The first step of each item is divided into two parts, number and string
Step 2 If the aphid already exists in the dictionary, its value is compared to the current item value. If it is smaller, its value is moved to the current number.
Otherwise I save the value in the dictionary.
Finally, we turn the dictionary into a list
def split(items):
biggest=dict()
for i in items:
string = i[:-1]
number = int(i[-1])
if string in biggest:
if biggest[string]<number:
biggest[string]=number
else:
biggest[string]=number
return list([k+str(v) for k,v in biggest.items()])
x = ['abc1', 'abc2', 'abc3', 'xyz3', 'xyz4']
print(split(x))
output :
['abc3', 'xyz4']
Related
I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']
I have the following string below, and I want to get all the values before the equal sign and in a list.
asaa=tcp:192.168.40.1:1119 dsae=tcp:192.168.40.2:1115 dem=tcp:192.168.40.3:1117 ape=tcp:192.168.40.4:1116
Result should be:
asaa
dsae
dem
ape
Any help would be appreciated. been trying a couple different things but can get it into a list nor can i get the rest of the values.
s = 'asaa=tcp:192.168.40.1:1119 dsae=tcp:192.168.40.2:1115 dem=tcp:192.168.40.3:1117 ape=tcp:192.168.40.4:1116'
parts = s.split()
result = [part.split('=')[0] for part in parts]
print(result)
# ['asaa', 'dsae', 'dem', 'ape']
I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!
You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )
I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here
If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]
I'm making a game of hangman. I use a list to keep track of the word that you are guessing for, and a list of blanks that you fill in. But I can't figure out what to do if for example someone's word was apple, and I guessed p.
My immediate thought was to just find if a letter is in the word twice, then figure out where it is, and when they guess that letter put it in both the first and second spot where that letter is. But I can't find
How to test if two STRINGS are duplicates in a list, and
If I were to use list.index to test where the duplicate letters are how to I find both positions instead of just one.
Create a string for your word
Create a string for user input
Cut your string into letters and keep it on a list/array
Get input
Cut input into letters and keep it on another array
Create a string = "--------" as displayed message
Using a for loop check every position in both array lists and compare them
If yourArray[i] == inputArray[i]
Then change displayedString[i] = inputArray[i] and display message then get another input
If it doesnt match leave "-" sings
Displayed the "---a--b" string
One way to do it would be to go through the list one by one and check if something comes up twice.
def isDuplicate(myList):
a = []
index = 0
for item in myList:
if type(item) == str:
if item in a:
return index
else:
a.append(item)
index += 1
return False
This function goes through the list and adds what it has seen so far into another list. Each time it also checks if the item it is looking at is already in that list, meaning it has already been seen before. If it gets through the whole list without any duplicates, it returns False.
It also keeps track of the index it is on, so it can return that index if it does find a duplicate.
Alternately, If you want to find multiple occurrences of a given string, you would use the same structure with some modifications.
def isDuplicate(myList, query):
index = 0
foundIndexes = []
for item in myList:
if item == query:
foundIndexes.append(index)
index += 1
return foundIndexes
This would return a list of the indexes of all instances of query in myList.