Python all possible combinations/permutation of X items and of length X+1 - python-3.x

I've been searching everywhere but can't find a thing for my issue.
Let's say I've got three numbers : ['1','2','3'].
I want, using itertool or not, all possible combinations/permutations with a length of 4 and I only want combinations containing these 3 numbers (I don't want '1111' or '1221' and so).
The wanted result would be like that :
1 2 3 1
1 1 2 3
2 2 3 1

from itertools import combinations_with_replacement as irep
res = [' '.join(x) for x in irep('123',4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1223', '1233']
OR
from itertools import product
res= [' '.join(x) for x in product('123',repeat=4) if {'1','2','3'}.issubset(x)]
# output
# ['1123', '1132', '1213', '1223', '1231', '1232', '1233',
# '1312', '1321', '1322', '1323', '1332', '2113', '2123',
# '2131', '2132', '2133', '2213', '2231', '2311', '2312',
# '2313', '2321', '2331', '3112', '3121', '3122', '3123',
# '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']

import itertools
elements = ['1', '2', '3']
permutations = [''.join(combination) for combination in
itertools.product(elements, repeat = 4) if
all(elem in combination for elem in elements)]
Does this produce what you are looking for?
The code produces the following output:
['1123', '1132', '1213', '1223', '1231', '1232', '1233', '1312', '1321', '1322', '1323', '1332', '2113', '2123', '2131', '2132', '2133', '2213', '2231', '2311', '2312', '2313', '2321', '2331', '3112', '3121', '3122', '3123', '3132', '3211', '3212', '3213', '3221', '3231', '3312', '3321']

Related

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

How to find the shortest distance between two line segments capturing the sign values with python

I have a pandas dataframe of the form:
benchmark_x benchmark_y ref_point_x ref_point_y
0 525039.140 175445.518 525039.145 175445.539
1 525039.022 175445.542 525039.032 175445.568
2 525038.944 175445.558 525038.954 175445.588
3 525038.855 175445.576 525038.859 175445.576
4 525038.797 175445.587 525038.794 175445.559
5 525038.689 175445.609 525038.679 175445.551
6 525038.551 175445.637 525038.544 175445.577
7 525038.473 175445.653 525038.459 175445.594
8 525038.385 175445.670 525038.374 175445.610
9 525038.306 175445.686 525038.289 175445.626
I am trying to find the shortest distance from the line to the benchmark such that if the line is above the benchmark the distance is positive and if it is below the benchmark the distance is negative. See image below:
I used the KDTree from scipy like so:
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
test = df.apply(lambda row: tree.query(row[["ref_point_x", "ref_point_y"]]), axis=1)
test=test.apply(pd.Series, index=["distance", "index"])
This seems to work except that it fails to capture the negative values as a result that the line is below the benchmark.
# recreating your example
columns = "benchmark_x benchmark_y ref_point_x ref_point_y".split(" ")
data = """525039.140 175445.518 525039.145 175445.539
525039.022 175445.542 525039.032 175445.568
525038.944 175445.558 525038.954 175445.588
525038.855 175445.576 525038.859 175445.576
525038.797 175445.587 525038.794 175445.559
525038.689 175445.609 525038.679 175445.551
525038.551 175445.637 525038.544 175445.577
525038.473 175445.653 525038.459 175445.594
525038.385 175445.670 525038.374 175445.610
525038.306 175445.686 525038.289 175445.626"""
data = [float(x) for x in data.replace("\n"," ").split(" ") if len(x)>0]
arr = np.array(data).reshape(-1,4)
df = pd.DataFrame(arr, columns=columns)
# adding your two new columns to the df
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
df["distance"], df["index"] = tree.query(df[["ref_point_x", "ref_point_y"]])
Now to compare if one line is above the other or not, we have to evaluate y at the same x position. Therefore we need to interpolate the y points for the x positions of the other line.
df = df.sort_values("ref_point_x") # sorting is required for interpolation
xy_refpoint = df[["ref_point_x", "ref_point_y"]].values
df["ref_point_y_at_benchmark_x"] = np.interp(df["benchmark_x"], xy_refpoint[:,0], xy_refpoint[:,1])
And finally your criterium can be evaluated and applied:
df["distance"] = np.where(df["ref_point_y_at_benchmark_x"] < df["benchmark_y"], -df["distance"], df["distance"])
# or change the < to <,>,<=,>= as you wish

Keras Prediction result (getting score,use of argmax)

I am trying to use the elmo model for text classification for my own dataset. The training is completed and the number of classes is 4(used keras model and elmo embedding).In the prediction, I got a numpy array. I am attaching the sample code and the result below...
import tensorflow as tf
import keras.backend as K
new_text_pr = np.array(data, dtype=object)[:, np.newaxis]
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model_elmo = build_model(classes)
model_elmo.load_weights(model+"/"+elmo_model)
import time
t = time.time()
predicted = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)
print(predicted)
# print(predicted[0][0])
print("result:",np.argmax(predicted[0]))
return np.argmax(predicted[0])
when I print the predicts variable I got this.
time: 1.561854362487793
[[0.17483692 0.21439584 0.24001297 0.3707543 ]
[0.15607062 0.24448264 0.4398888 0.15955798]
[0.06494818 0.3439018 0.42254424 0.16860574]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.0365712 0.4194748 0.3321385 0.21181548]
[0.05350104 0.18225929 0.56712115 0.19711846]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.09541835 0.19085276 0.41069734 0.30303153]
[0.03930932 0.40526104 0.45785302 0.09757669]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.09784496 0.2292052 0.44426462 0.22868524]
[0.06089798 0.31685832 0.47317514 0.14906852]
[0.03956613 0.46605557 0.3502095 0.14416872]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.15165758 0.22900137 0.50939053 0.10995051]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.11404029 0.21311268 0.46880838 0.2040386 ]
[0.07556026 0.20502563 0.52019936 0.19921473]
[0.11096822 0.23295449 0.36192006 0.29415724]
[0.05018891 0.16656907 0.60114646 0.18209551]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.09596984 0.18282187 0.5053091 0.2158991 ]
[0.09428936 0.13995855 0.62395805 0.14179407]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.08244281 0.15743142 0.5462735 0.21385226]
[0.07199708 0.2446867 0.44568574 0.23763043]
[0.1339082 0.27288827 0.43478844 0.15841508]
[0.07354636 0.24499843 0.44873005 0.23272514]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.08924995 0.36547357 0.40014726 0.14512917]
[0.05132649 0.28190497 0.5224545 0.14431408]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04849219 0.36724472 0.39698333 0.1872797 ]
[0.07206573 0.31368822 0.4667826 0.14746341]
[0.05948553 0.28048623 0.41831577 0.2417125 ]
[0.07582933 0.18771031 0.54879296 0.18766735]
[0.03858965 0.20433436 0.5596278 0.19744818]
[0.07443814 0.20681688 0.3933627 0.32538226]
[0.0639974 0.23687115 0.5357675 0.16336392]
[0.11005415 0.22901568 0.4279426 0.23298755]
[0.12625505 0.22987585 0.31619486 0.32767424]
[0.08893713 0.14554602 0.45740074 0.30811617]
[0.07906891 0.18683094 0.5214609 0.21263924]
[0.06316617 0.30398315 0.4475617 0.185289 ]
[0.07060979 0.17987429 0.4829593 0.26655656]
[0.0720717 0.27058697 0.41439256 0.24294883]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04745338 0.25831962 0.46751252 0.22671448]
[0.06624557 0.20708969 0.54820716 0.17845756]]
result:3
Anyone have any idea about what is the use of taking the 0th index value only. Considering this as a list of lists 0th index means first list and the argmax returns index the maximum value from the list. Then what is the use of other values in the lists?. Why isn't it considered?. Also is it possible to get the score from this? I hope the question is clear. Is it the correct way or is it wrong?
I have found the issue. just posting it others who met the same problem.
Answer: When predicting with Elmo model, it expects a list of strings. In code, the prediction data were split and the model predicted for each word. That's why I got this huge array. I have used a temporary fix. The data is appended to a list then an empty string is also appended with the list. The model will predict the both list values but I took only the first predicted data. This is not the correct way but I have done this as a quick fix and hoping to find a fix in the future
To find the predicted class for each test example, you need to use axis=1. So, in your case the predicted classes will be:
>>> predicted_classes = predicted.argmax(axis=1)
>>> predicted_classes
[3 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
2 2 2 2 2 2 3 2 2 2 2 2 1 2 2]
Which means that the first test example belongs to the third class, and the second test example belongs to the second class and so on.
The previous part answers your question (I think), now let's see what the np.argmax(predicted) does. Using np.argmax() alone without specifying the axis will flatten your predicted matrix and get the argument of the maximum number.
Let's see this simple example to know what I mean:
>>> x = np.matrix(np.arange(12).reshape((3,4)))
>>> x
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> x.argmax()
11
11 is the index of the 11 which is the biggest number in the whole matrix.

Removing a num from a string in a list of numbers

An example like:
print("1.", get_list_nums_without_9([589775, 677017, 34439, 48731548, 782295632, 181967909]))
print("2.", get_list_nums_without_9([6162, 29657355, 5485406, 422862350, 74452, 480506, 2881]))
print("3.", get_list_nums_without_9([292069010, 73980, 8980155, 921545108, 75841309, 6899644]))
print("4.", get_list_nums_without_9([]))
nums = [292069010, 73980, 8980155, 21545108, 7584130, 688644, 644908219, 44281, 3259, 8527361, 2816279, 985462264, 904259, 3869, 609436333, 36915, 83705, 405576, 4333000, 79386997]
print("5.", get_list_nums_without_9(nums))
I'm trying to get number list without 9. If the list is empty or if all of the numbers in the list contain the digit 9, the function should return an empty list. I tried the function below, it doesn't work.
def get_list_nums_without_9(a_list):
j=0
for i in a_list:
a_list[j]=i.rstrip(9)
j+=1
return a_list
expected:
1. [677017, 48731548]
2. [6162, 5485406, 422862350, 74452, 480506, 2881]
3. []
4. []
5. [21545108, 7584130, 688644, 44281, 8527361, 83705, 405576, 4333000]
your lists contain integers. To remove the ones containing 9 the best way is to test if 9 belongs to the number as string and rebuild the output using a list comprehension with a conditional.
(Besides, rstrip removes the trailing chars of a string. Not suitable at all for your problem)
def get_list_nums_without_9(a_list):
return [x for x in a_list if "9" not in str(x)]
testing with your data:
>>> numbers = [
... [589775, 677017, 34439, 48731548, 782295632, 181967909],
... [6162, 29657355, 5485406, 422862350, 74452, 480506, 2881],
... [292069010, 73980, 8980155, 921545108, 75841309, 6899644],
... [],
... [292069010, 73980, 8980155, 21545108, 7584130, 688644, 644908219,
... 44281, 3259, 8527361, 2816279, 985462264, 904259, 3869, 609436333,
... 36915, 83705, 405576, 4333000, 79386997]
... ]
>>> for i, l in enumerate(numbers, 1):
... print("{}. {}".format(i, get_list_nums_without_9(l)))
1. [677017, 48731548]
2. [6162, 5485406, 422862350, 74452, 480506, 2881]
3. []
4. []
5. [21545108, 7584130, 688644, 44281, 8527361, 83705, 405576, 4333000]
You don't want to replace anything in the list, you want to generate a new list based on what you detect in the old list:
def get_list_nums_without_9(numbers):
sans_9 = []
for number in numbers:
if '9' not in str(number):
sans_9.append(number)
return sans_9
This will get you an empty list if there are no valid numbers.

python - cannot make corr work

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

Resources