Sliding window over a string using python - python-3.x

I am working on a dataset as a part of my course practice and am stuck in a particular step. I have tried that using R, but I wish to do the same in python. I am comparatively new to python and so require help.
The data set consists of a column with name 'Seq' with seq(5000+) records. I have another column of name 'MainSeq' that contains the substring seq values in it. I need to check the presence of seq on MainSeq based on the start position given and then print 7 letters before and after each letter of the seq. i.e.
I have a a value in col 'MainSeq' as 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.
Col 'Seq' contains value JKLMNO
Start Position of J= 10 and O= 15
I need to create a new column such that it takes 7 letters before and after the start letter from J till O i.e. having a total length of 15
CDEFGHI**J**KLMNOPQ
DEFGHIJ**K**LMNOPQR
EFGHIJK**L**MNOPQRS
FGHIJKL**M**NOPQRST
GHIJKLM**N**OPQRSTU
HIJKLMN**O**PQRSTUV
I know to apply the logic on a specific seq. But since I have around 5000+ seq records, I need to figure out a way to apply the same on all the seq records.
seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
i = seq.index('J')
j = seq.index('O')
value = 7
for mid in range(i, 1+j):
print(seq[mid-value:mid+value+1])

I'm not sure this will do exactly what you want, you've not really supplied a lot of data to test with, but it might work or at least give you a start.
import pandas as pd
df = pd.DataFrame({'MainSeq':['ABCDEFGHIJKLMNOPQRSTUVWZYZ','ABCDEFGHIJKLMNOPQRSTUVWZYZ'], 'Seq':'JKLMNO'})
def get_sequences(seq, letters, value):
sequences = [seq[seq.index(letter)-value:seq.index(letter)+value+1] for letter in letters]
return sequences
df['new_seq'] = df.apply(lambda row : get_sequences(row['MainSeq'], row['Seq'], 7), axis = 1)
df = df.explode('new_seq')
print(df)

Related

Data Extraction in Dataframe

I have dataframe like:
Names Subsets Subnames SubNumber Numbers
AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL
AO,AI DO,AP KLM,ABC P890 L97, 52PL
IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU, 90OP,89JN
From a dataframe like this,
For subsets with OP,OP,DO I need Numbers for Names ending with I.
for eg. As First row has subset OP,OP,DO and the name with first index has letter I at end. So Numbers with every first index 98-OPB,58AP is the output I need. (Every first index means, there are three elements in Name. So after the second index in numbers, again zeroth index starts)
Names Subsets Subnames SubNumber Numbers Output
AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL 98-OPB,58AP
AO,AI DO,AP KLM,ABC P890 L97, 52PL
IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU, 90OP,89JN 89JN
In the third row, MI is second index, hence number with second index are needed 89JN here.
Indexing starting from zero.
This is essentially a for loop, because you are dealing with the object dtype. You might be able to make some minor improvements, but I don't really see how to make any big gains off the top of my head -- this is some pretty messy "extraction" logic:
def extract(row):
names = row.Names.split(",")
numbers = row.Numbers.split(",")
idxs = {i for i, name in enumerate(names) if name[-1] == "I"}
return ",".join(num for i, num in enumerate(numbers) if i % len(names) in idxs)
Output:
>>> df["Output"] = df[df["Subsets"] == "OP,OP,DO"].apply(extract, axis=1)
>>> df
Names Subsets Subnames SubNumber Numbers Output
0 AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL 98-OPB,58AP
1 AO,AI DO,AP KLM,ABC P890 L97,52PL NaN
2 IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU,90OP,89JN 89JN
If you don't want the NaN:
df["Output"] = df["Output"].fillna("")

Looking for a specific combination algorithm to solve a problem

Let’s say I have a purchase total and I have a csv file full of purchases where some of them make up that total and some don’t. Is there a way to search the csv to find the combination or combinations of purchases that make up that total ? Let’s say the purchase total is 155$ and my csv file has the purchases [5.00$,40.00$,7.25$,$100.00,$10.00]. Is there an algorithm that will tell me the combinations of the purchases that make of the total ?
Edit: I am still having trouble with the solution you provided. When I feed this spreadsheet with pandas into the code snippet you provided it only shows one solution equal to 110.04$ when there are three. It is like it is stopping early without finding the final solutions.This is the output that I have from the terminal - [57.25, 15.87, 13.67, 23.25]. The output should be [10.24,37.49,58.21,4.1] and [64.8,45.24] and [57.25,15.87,13.67,23.25]
from collections import namedtuple
import pandas
df = pandas.read_csv('purchases.csv',parse_dates=["Date"])
from collections import namedtuple
values = df["Purchase"].to_list()
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i+1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if v + sum == S :
print(path + [v])
tuples = next
A dp solution:
Let S be your goal sum
Build all 1-combinations. Keep those which sums less or equal than S. Whenever one equals S, output it
Build all 2-combinations reusing the previous ones.
Repeat
from collections import namedtuple
values = [57.25,15.87,13.67,23.25,64.8,45.24,10.24,37.49,58.21,4.1]
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i + 1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if abs(v + sum - S) <= 1e-2 :
print(path + [v])
tuples = next
More detail about the tuple structure:
What we want to do is to augment a tuple with a new value.
Assume we start with some tuple with only one value, say the tuple associated to 40.
its sum is trivially 40
the last index added is 1 (it is the number 40 itself)
the used values is [40], since it is the sole value.
Now to generate the next tuples, we will iterate from the last index (1), to the end of the array.
So candidates are 7.25, 100.00, 10.00
The new tuple associated to 7.25 is:
sum: 40 + 7.25
last index: 2 (7.25 has index 2 in array)
used values: values of tuple union 7.25, so [40, 7.25]
The purpose of using the last index, is to avoid considering [7.25, 40] and [40, 7.25]. Indeed they would be the same combination
So to generate tuples from an old one, only consider values occurring 'after' the old one from the array
At every step, we thus have tuples of the same size, each of them aggregates the values taken, the sum it amounts to, and the next values to consider to augment it to a bigger size
edit: to handle floats, you may replace (v+sum)<=S by abs(v+sum - S)<=1e-2 to say a solution is reach when you are very close (here distance arbitrarily set to 0.01) to solution
edit2: same code here as in https://repl.it/repls/DrearyWindingHypertalk (which does give
[64.8, 45.24]
[57.25, 15.87, 13.67, 23.25]
[10.24, 37.49, 58.21, 4.1]

How can i optimise my code and make it readable?

The task is:
User enters a number, you take 1 number from the left, one from the right and sum it. Then you take the rest of this number and sum every digit in it. then you get two answers. You have to sort them from biggest to lowest and make them into a one solid number. I solved it, but i don't like how it looks like. i mean the task is pretty simple but my code looks like trash. Maybe i should use some more built-in functions and libraries. If so, could you please advise me some? Thank you
a = int(input())
b = [int(i) for i in str(a)]
closesum = 0
d = []
e = ""
farsum = b[0] + b[-1]
print(farsum)
b.pop(0)
b.pop(-1)
print(b)
for i in b:
closesum += i
print(closesum)
d.append(int(closesum))
d.append(int(farsum))
print(d)
for i in sorted(d, reverse = True):
e += str(i)
print(int(e))
input()
You can use reduce
from functools import reduce
a = [0,1,2,3,4,5,6,7,8,9]
print(reduce(lambda x, y: x + y, a))
# 45
and you can just pass in a shortened list instead of poping elements: b[1:-1]
The first two lines:
str_input = input() # input will always read strings
num_list = [int(i) for i in str_input]
the for loop at the end is useless and there is no need to sort only 2 elements. You can just use a simple if..else condition to print what you want.
You don't need a loop to sum a slice of a list. You can also use join to concatenate a list of strings without looping. This implementation converts to string before sorting (the result would be the same). You could convert to string after sorting using map(str,...)
farsum = b[0] + b[-1]
closesum = sum(b[1:-2])
"".join(sorted((str(farsum),str(closesum)),reverse=True))

Splitting the output obtained by Counter in Python and pushing it to Excel

I am using the counter function to count every word of the description of 20000 products and see how many times this word repeats like 'pipette' repeats 1282 times.To do this i have split a column A into many columns P,Q,R,S,T,U & V
df["P"] = df["A"].str.split(n=10).str[0]
df["Q"] = df["A"].str.split(n=10).str[1]
df["R"] = df["A"].str.split(n=10).str[2]
df["S"] = df["A"].str.split(n=10).str[3]
df["T"] = df["A"].str.split(n=10).str[4]
df["U"] = df["A"].str.split(n=10).str[5]
df["V"] = df["A"].str.split(n=10).str[6]
This shows the splitted products
And the i am individually counting all of the columns and then add them to get the total number of words.
d = Counter(df['P'])
e = Counter(df['Q'])
f = Counter(df['R'])
g = Counter(df['S'])
h = Counter(df['T'])
i = Counter(df['U'])
j = Counter(df['V'])
m = d+e+f+g+h+i+j
print(m)
This is the image of the output i obtained on using counter.
Now i want to transfer the output into a excel sheet with the Keys in one column and the Values in another.
Am i using the right method to do so? If yes how shall i push them into different columns.
Note: Length of each key is different
Also i wanna make all the items of column 'A' into lower case so that the counter does not repeat the items. How shall I go about it ?
I've been learning python for just a couple of months but I'll give it a shot. I'm sure there are some better ways to perform that same action. Maybe we both can learn something from this question. Let me know how this turns out. GoodLuck
import pandas as pd
num = len(m.keys())
df = pd.DataFrame(columns=['Key', 'Value']
for i,j,k in zip(range(num), m.keys(), m.values()):
df.loc[i] = [j, k]
df.to_csv('Your_Project.csv')

Choosing minimum numbers from a given list to give a sum N( repetition allowed)

How to find the minimum number of ways in which elements taken from a list can sum towards a given number(N)
For example if list = [1,3,7,4] and N=14 function should return 2 as 7+7=14
Again if N= 11, function should return 2 as 7+4 =11. I think I have figured out the algorithm but unable to implement it in code.
Pls use Python, as that is the only language I understand(at present)
Sorry!!!
Since you mention dynamic programming in your question, and you say that you have figured out the algorithm, i will just include an implementation of the basic tabular method written in Python without too much theory.
The idea is to have a tabular structure we will use to compute all possible values we need without having to doing the same computations many times.
The basic formula will try to sum values in the list till we reach the target value, for every target value.
It should work, but you can of course make some optimization like trying to order the list and/or find dividends in order to construct a smaller table and have faster termination.
Here is the code:
import sys
# num_list : list of numbers
# value: value for which we want to get the minimum number of addends
def min_sum(num_list, value):
list_len = len(num_list)
# We will use the tipycal dynamic programming table construct
# the key of the list will be the sum value we want,
# and the value will be the
# minimum number of items to sum
# Base case value = 0, first element of the list is zero
value_table = [0]
# Initialize all table values to MAX
# for range i use value+1 because python range doesn't include the end
# number
for i in range(1, value+1):
value_table.append(sys.maxsize);
# try every combination that is smaller than <value>
for i in range(1, value+1):
for j in range(0, list_len):
if (num_list[j] <= i):
tmp = value_table[i-num_list[j]]
if ((tmp != sys.maxsize) and (tmp + 1 < value_table[i])):
value_table[i] = tmp + 1
return value_table[value]
## TEST ##
num_list = [1,3,16,5,3]
value = 22
print("Min Sum: ",min_sum(num_list,value)) # Outputs 3
it would be helpful if you include your Algorithm in Pseudocode - it will very much look like Python :-)
Another aspect: your first operation is a multiplication with one item from the list (7) and one outside of the list (2), whereas for the second opration it is 7+4 - both values in the list.
Is there a limitation for which operation or which items to use (from within or without the list)?

Resources