Data Extraction in Dataframe - python-3.x

I have dataframe like:
Names Subsets Subnames SubNumber Numbers
AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL
AO,AI DO,AP KLM,ABC P890 L97, 52PL
IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU, 90OP,89JN
From a dataframe like this,
For subsets with OP,OP,DO I need Numbers for Names ending with I.
for eg. As First row has subset OP,OP,DO and the name with first index has letter I at end. So Numbers with every first index 98-OPB,58AP is the output I need. (Every first index means, there are three elements in Name. So after the second index in numbers, again zeroth index starts)
Names Subsets Subnames SubNumber Numbers Output
AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL 98-OPB,58AP
AO,AI DO,AP KLM,ABC P890 L97, 52PL
IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU, 90OP,89JN 89JN
In the third row, MI is second index, hence number with second index are needed 89JN here.
Indexing starting from zero.

This is essentially a for loop, because you are dealing with the object dtype. You might be able to make some minor improvements, but I don't really see how to make any big gains off the top of my head -- this is some pretty messy "extraction" logic:
def extract(row):
names = row.Names.split(",")
numbers = row.Numbers.split(",")
idxs = {i for i, name in enumerate(names) if name[-1] == "I"}
return ",".join(num for i, num in enumerate(numbers) if i % len(names) in idxs)
Output:
>>> df["Output"] = df[df["Subsets"] == "OP,OP,DO"].apply(extract, axis=1)
>>> df
Names Subsets Subnames SubNumber Numbers Output
0 AE,AI,AK OP,OP,DO ABC,ABC,ABC A-890,A891 9OP-A,98-OPB,8IC,87AC,58AP,7PL 98-OPB,58AP
1 AO,AI DO,AP KLM,ABC P890 L97,52PL NaN
2 IK,LJ,MI OP,OP,DO IJK,IJK,OPQ 90AKI 87AU,90OP,89JN 89JN
If you don't want the NaN:
df["Output"] = df["Output"].fillna("")

Related

I need the code to stop after break and it should not print max(b)

Rahul was learning about numbers in list. He came across one word ground of a number.
A ground of a number is defined as the number which is just smaller or equal to the number given to you.Hence he started solving some assignments related to it. He got struck in some questions. Your task is to help him.
O(n) time complexity
O(n) Auxilary space
Input Description:
First line contains two numbers ‘n’ denoting number of integers and ‘k’ whose ground is to be check. Next line contains n space separated numbers.
Output Description:
Print the index of val.Print -1 if equal or near exqual number
Sample Input :
7 3
1 2 3 4 5 6 7
Sample Output :
2
`
n,k = 7,3
a= [1,2,3,4,5,6,7]
b=[]
for i in range(n):
if k==a[i]:
print(i)
break
elif a[i]<k:
b.append(i)
print(max(b))
`
I've found a solution, you can pour in if you've any different views
n,k = 7,12
a= [1,2,3,4,5,6,7]
b=[]
for i in range(n):
if k==a[i]:
print(i)
break
elif a[i]<k:
b.append(i)
else:
print(max(b))
From what I understand, these are the conditions to your question,
If can find number, print the number and break
If cannot find number, get the last index IF it's less than value k
Firstly, it's unsafe to manually input the length of iteration for your list, do it like this:
k = 3
a= [1,7,2,2,5,1,7]
finalindex = 0
for i, val in enumerate(a):
if val==k:
finalindex = i #+1 to index because loop will end prematurely
break
elif val>=k:
continue
finalindex = i #skip this if value not more or equal to k
print(finalindex) #this will either be the index of value found OR if not found,
#it will be the latest index which value lesser than k
Basically, you don't need to print twice. because it's mutually exclusive. Either you find the value, print the index or if you don't find it you print the latest index that is lesser than K. You don't need max() because the index only increases, and the latest index would already be your max value.
Another thing that I notice, if you use your else statement like in your answer, if you have two elements in your list that are larger than value K, you will be printing max() twice. It's redundant
else:
print(max(b))

Sliding window over a string using python

I am working on a dataset as a part of my course practice and am stuck in a particular step. I have tried that using R, but I wish to do the same in python. I am comparatively new to python and so require help.
The data set consists of a column with name 'Seq' with seq(5000+) records. I have another column of name 'MainSeq' that contains the substring seq values in it. I need to check the presence of seq on MainSeq based on the start position given and then print 7 letters before and after each letter of the seq. i.e.
I have a a value in col 'MainSeq' as 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.
Col 'Seq' contains value JKLMNO
Start Position of J= 10 and O= 15
I need to create a new column such that it takes 7 letters before and after the start letter from J till O i.e. having a total length of 15
CDEFGHI**J**KLMNOPQ
DEFGHIJ**K**LMNOPQR
EFGHIJK**L**MNOPQRS
FGHIJKL**M**NOPQRST
GHIJKLM**N**OPQRSTU
HIJKLMN**O**PQRSTUV
I know to apply the logic on a specific seq. But since I have around 5000+ seq records, I need to figure out a way to apply the same on all the seq records.
seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
i = seq.index('J')
j = seq.index('O')
value = 7
for mid in range(i, 1+j):
print(seq[mid-value:mid+value+1])
I'm not sure this will do exactly what you want, you've not really supplied a lot of data to test with, but it might work or at least give you a start.
import pandas as pd
df = pd.DataFrame({'MainSeq':['ABCDEFGHIJKLMNOPQRSTUVWZYZ','ABCDEFGHIJKLMNOPQRSTUVWZYZ'], 'Seq':'JKLMNO'})
def get_sequences(seq, letters, value):
sequences = [seq[seq.index(letter)-value:seq.index(letter)+value+1] for letter in letters]
return sequences
df['new_seq'] = df.apply(lambda row : get_sequences(row['MainSeq'], row['Seq'], 7), axis = 1)
df = df.explode('new_seq')
print(df)

Grabbing the most duplicated letter from a string

What I want to get accomplished is an algorithm that finds the most duplicated letter from the entire list of strings. I'm new to Python so its taken me roughly two hours to get to this stage. The problem with my current code is that it returns every duplicated letter, when I'm only looking for the most duplicated letter. Additionally, I would like to know of a faster way that doesn't use two for loops.
Code:
rock_collections = ['aasdadwadasdadawwwwwwwwww', 'wasdawdasdasdAAdad', 'WaSdaWdasSwd', 'daWdAWdawd', 'QaWAWd', 'fAWAs', 'fAWDA']
seen = []
dupes = []
for words in rock_collections:
for letter in words:
if letter not in seen:
seen.append(letter)
else:
dupes.append(letter)
print(dupes)
If you are looking for the letter which appears the greatest number of times, I would recommend the following code:
def get_popular(strings):
full = ''.join(strings)
unique = list(set(full))
return max(
list(zip(unique, map(full.count, unique))), key=lambda x: x[1]
)
rock_collections = [
'aasdadwadasdadawwwwwwwwww',
'wasdawdasdasdAAdad',
'WaSdaWdasSwd',
'daWdAWdawd',
'QaWAWd',
'fAWAs',
'fAWDA'
]
print(get_popular(rock_collections)) # ('d', 19)
Let me break down the code for you:
full contains each of the strings together with without any letters between them. set(full) produces a set, meaning that it contains every unique letter only once. list(set(full)) makes this back into a list, meaning that it retains order when you iterate over the elements in the set.
map(full.count, unique) iterates over each of the unique letters and counts how many there are of them in the string. zip(unique, ...) puts those numbers with their respective letters. key=lambda x: x[1] is a way of saying, don't take the maximum value of the tuple, instead take the maximum value of the second element of the tuple (which is the number of times the letter appears). max finds the most common letter, using the aforementioned key.

How to multiply 2 values in list of (3) numbers and letters

I have a list(or it can be a dictionary):
A = [
['soda',9,3],
['cake',56,6],
['beer',17,10],
['candies',95,8],
['sugar',21,20]
]
And i need to find a multiply of last 2 values in each sublist and sum up this:
9*3+56*6+17*10+95*8+21*20
How can i do this?
It's a very basic question and has a really simple answer. Until you are sure that the format is the same, the following code will help you:
result = 0
for sub_list in A:
result += sub_list[-1] * sub_list[-2]
The result variable will store the result you want. sub_result is one of the sublists in A in each iteration.
sub_list[-1] is the last element of sub-list and `sub_list[-2] is the element before that.

How to get list of indices for elements whose value is the maximum in that list

Suppose I have a list l=[3,4,4,2,1,4,6]
I would like to obtain a subset of this list containing the indices of elements whose value is max(l).
In this case, list of indices will be [1,2,5].
I am using this approach to solve a problem where, a list of numbers are provided, for example
l=[1,2,3,4,3,2,2,3,4,5,6,7,5,4,3,2,2,3,4,3,4,5,6,7]
I need to identify the max occurence of an element, however in case more than 1 element appears the same number of times,
I need to choose the element which is greater in magnitude,
suppose I apply a counter on l and get {1:5,2:5,3:4...}, I have to choose '2' instead of '1'.
Please suggest how to solve this
Edit-
The problem begins like this,
1) a list is provided as an input
l=[1 4 4 4 5 3]
2)I run a Counter on this to obtain the counts of each unique element
3)I need to obtain the key whose value is maximum
4)Suppose the Counter object contains multiple entries whose value is maximum,
as in Counter{1:4,2:4,3:4,5:1}
I have to choose 3 as the key whose value is 4.
5)So far, I have been able to get the Counter object, I have seperated key/value lists using k=counter.keys();v=counter.values()
6)I want to get the indices whose values are max in v
If I run v.index(max(v)), I get the first index whose value matches max value, but I want to obtain the list of indices whose value is max, so that I can obtain corresponding list of keys and obtain max key in that list.
With long lists, using NumPy or any other linear algebra would be helpful, otherwise you can simply use either
l.index(max(l))
or
max(range(len(l)),key=l)
These however return only one of the many argmax's.
So for your problem, you can choose to reverse the array, since you want the maximum that appears later as :
len(l)-l[::-1].index(max(l))-1
If I understood correctly, the following should do what you want.
from collections import Counter
def get_largest_most_freq(lst):
c = Counter(lst)
# get the largest frequency
freq = max(c.values())
# get list of all the values that occur _max times
items = [k for k, v in c.items() if v == freq]
# return largest most frequent item
return max(items)
def get_indexes_of_most_freq(lst):
_max = get_largest_most_freq(lst)
# get list of all indexes that have a value matching _max
return [i for i, v in enumerate(lst) if v == _max]
>>> lst = [3,4,4,2,1,4,6]
>>> get_largest_most_freq(lst)
4
>>> get_indexes_of_most_freq(lst)
[1, 2, 5]
>>> lst = [1,2,3,4,3,2,2,3,4,5,6,7,5,4,3,2,2,3,4,3,4,5,6,7]
>>> get_largest_most_freq(lst)
3
>>> get_indexes_of_most_freq(lst)
[2, 4, 7, 14, 17, 19]

Resources