Python: Create DataFrame from highest frequency of occurrence per item - python-3.x

I have a dataframe as given below
data = {
'Code': ['P', 'J', 'M', 'Y', 'P', 'Z', 'P', 'P', 'J', 'P', 'J', 'M', 'P', 'Z', 'Y', 'M', 'Z', 'J', 'J'],
'Value': [10, 10, 20, 30, 10, 40, 50, 10, 10, 20, 10, 50, 60, 40, 30, 20, 40, 20, 10]
}
example = pd.DataFrame(data)
Using Python 3, I want to create another dataframe from the dataframe example such that the Code associated with the greater number of Value is obtained.
The new dataframe should look like solution below
output = {'Code': ['J', 'M', 'Y', 'Z', 'P', 'M'],'Value': [10, 20, 30, 40, 50, 50]}
solution = pd.DataFrame(output)
As can be seen, J has more association to Value 10 than other Code so J is selected, and so on.

You could define a function that returns the most occurring items and apply it to the grouped elements. Finally explode to list to rows.
>>> def most_occurring(grp):
... res = Counter(grp)
... highest = max(res.values())
... return [k for k, v in res.items() if v == highest]
...
>>> example.groupby('Value')['Code'].apply(lambda x: most_occurring(x)).explode().reset_index()
Value Code
0 10 J
1 20 M
2 30 Y
3 40 Z
4 50 P
5 50 M
6 60 P

If I understood correctly, you need something like this:
grouped = example.groupby(['Code', 'Value']).indices
arr_tmp = []
[arr_tmp.append([i[0], i[1], len(grouped[i])]) for i in grouped]#['Int64Index'])
output = pd.DataFrame(data=arr_tmp, columns=['Code', 'Value', 'index_count'])
output = output.sort_values(by=['index_count'], ascending=False)
output.reset_index(inplace=True)
output

Related

Make a list inside the dictionary in python

I have a data frame like below. I want to get a dictionary consisting of a list.My expected output is. Can you pls assist me to get it?
You can use the handy groupby function in Pandas:
df = pd.DataFrame({
'Department': ['y1', 'y1', 'y1', 'y2', 'y2', 'y2'],
'Section': ['A', 'B', 'C', 'A', 'B', 'C'],
'Cost': [10, 20, 30, 40, 50, 60]
})
output = {dept: group['Cost'].tolist() for dept, group in df.groupby('Department')}
gives
{'y1': [10, 20, 30], 'y2': [40, 50, 60]}

Remove items inbetween 2 indexes Python

How could I remove all items in between 2 indexes in a list/tuple?
e.g 'abcdefghijklmnop' with begin = 4 and end = 7 should result in 'abcdhijklmnop' ('efg' removed)
You can use list slicing:
a = [1, 2, 3, 4, 5, 6, 7, 8]
b = a[:3] + a[7:]
print(b)
The result is [1, 2, 3, 8]
Try this:
ip = '123456789'
begin = 3
end = 6
res = ip[:begin]+ip[end:]
output:
123789
You can use list slicing as below:
li = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
del li[4:7]
print(li)
output:
['a', 'b', 'c', 'd', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p']
As in the post, you've provided it as a string, so in that case you can use string slicing:
s= 'abcdefghijklmnop'
start = 4
end = 7
s= s[0: start:] + s[end::]
print(s)
output:
abcdhijklmnop

How do I order double list of elements of this type: [[1,2,3], [a,b,c]]?

I have a double list of this type: dl = [[13, 22, 41], ['c', 'b', 'a']], in which, each element dl[0][i] belongs a value in dl[1][i] (with the same index). How can I sort my list using dl[0] values as my order criteria, maintainning linked both sublists? Sublist are kind of 'linked data', so the previous dl[0][i] and dl[1][i] values must match their index after sorting the parent entire list, using as sorting criteria, the first sublist values
I expect something like:
input: dl = [ [14,22,7,17], ['K', 'M', 'F','A'] ]
output: dl = [ [7, 14, 17, 22], ['F', 'K', 'A', 'M'] ]
This was way too much fun to write. I don't doubt that this function can be greatly improved, but this is what I've gotten in a very short amount of time and should get you started.
I've included some tests just so you can verify that this does indeed do what you want.
from unittest import TestCase, main
def sort_by_first(data):
sorted_data = []
for seq in data:
zipped_to_first = zip(data[0], seq)
sorted_by_first = sorted(zipped_to_first)
unzipped_data = zip(*sorted_by_first)
sorted_data.append(list(tuple(unzipped_data)[1]))
return sorted_data
class SortByFirstTestCase(TestCase):
def test_sort(self):
output_1 = sort_by_first([[1, 3, 5, 2, 4], ['a', 'b', 'c', 'd', 'e']])
self.assertEqual(output_1, [[1, 2, 3, 4, 5], ['a', 'd', 'b', 'e', 'c']])
output_2 = sort_by_first([[9, 1, 5], [21, 22, 23], ['spam', 'foo', 'bar']])
self.assertEqual(output_2, [[1, 5, 9], [22, 23, 21], ['foo', 'bar', 'spam']])
if __name__ == '__main__':
main()
Updated for what you're looking for, selection sort but added another line to switch for the second list to match the first.
for i in range(len(dl[0])):
min_idx = i
for j in range(i+1, len(dl[0])):
if dl[0][min_idx] > dl[0][j]:
min_idx = j
dl[0][i], dl[0][min_idx] = dl[0][min_idx], dl[0][i]
dl[1][i], dl[1][min_idx] = dl[1][min_idx], dl[1][i]
You can try solving this with a for loop also:
dl = [ [3,2,1], ['c', 'b', 'a'] ]
for i in range(0,len(dl)):
dl[i].sort()
print(dl)

How to aggregate string length sequence base on an indicator sequence

I have a dictionary with two keys and their values are lists of strings.
I want to calculate string length of one list base on an indicator in another list.
It's difficult to frame the question is words, so let's look at an example.
Here is an example dictionary:
thisdict ={
'brand': ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B']
}
Now, I want to add an item to the dictionary that corresponds to string cumulative-length of "brand-string-sequence" base on condition of "type-sequence".
Here is the criteria:
If type = 'O', set string length = 0 for that index.
If type = 'B', set string length to the corresponding string length.
If type = 'I', it's when things get complicated. You would want to look back the sequence and sum up string length until you reach to the first 'B'.
Here is an example output:
thisdict ={
"brand": ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B'],
'cumulative-length':[0,3,0,4,8,14,0,5,4]
}
where 8=len(benz)+len(audi) and 14=len(benz)+len(audi)+len(subaru)
Note that in the real data I'm working on, the sequence can be one "B" and followed by an arbitrary number of "I"s. ie. ['B','I','I','I','I','I','I',...,'O'] so I'm looking for a solution that is robust in such situation.
Thanks
You can use the zip fucntion to tie the brand and type together. Then just keep a running total as you loop through the dictionary values. This solution will support any length series and any length string in the brand list. I am assuming that len(thisdict['brand']) == len(thisdict['type']).
thisdict = {
'brand': ['Ford','bmw','toyota','benz','audi','subaru','ferrari','volvo','saab'],
'type': ['O','B','O','B','I','I','O','B','B']
}
lengths = []
running_total = 0
for b, t in zip(thisdict['brand'], thisdict['type']):
if t == 'O':
lengths.append(0)
elif t == 'B':
running_total = len(b)
lengths.append(running_total)
elif t == 'I':
running_total += len(b)
lengths.append(running_total)
print(lengths)
# [0, 3, 0, 4, 8, 14, 0, 5, 4]
Generating random data
import random
import string
def get_random_brand_and_type():
n = random.randint(1,8)
b = ''.join(random.choice(string.ascii_uppercase) for _ in range(n))
t = random.choice(['B', 'I', 'O'])
return b, t
thisdict = {
'brand': [],
'type': []
}
for i in range(random.randint(1,20)):
b, t = get_random_brand_and_type()
thisdict['brand'].append(b)
thisdict['type'].append(t)
yields the following result:
{'type': ['B', 'B', 'O', 'I', 'B', 'O', 'O', 'I', 'O'],
'brand': ['O', 'BSYMLFN', 'OF', 'SO', 'KPQGRW', 'DLCWW', 'VLU', 'ZQE', 'GEUHERHE']}
[1, 7, 0, 9, 6, 0, 0, 9, 0]

Determining if all items in a range are equal to a specific value

This question is related to the tic tac toe problem using python: Let's say I have a list - my_list = ['X', 'O', 'X', 'O', 'X', '-', 'O', 'X', 'X']. I want to determine if all the items in range(0, 2) or range(3, 5) or range(6, 8) == X So far I have tried the following, but get a syntax error:
my_list = ['X', 'O', 'X', 'O', 'X', '-', 'O', 'X', 'X']
for i in range(0, 3):
if all(board[i]) == 'X':
print('X is the winner')
elif all(board[i]) == 'Y':
print('Y is the winner')
The problem really stems from setting up the range on the second line, but I also feel I am not using the all function correctly. Can you shed light my mistake here? Side note: I also want to check to see if index items[0, 3, 6], [1, 4, 7], and [2, 5, 8]-the "columns" as well as the diagonals index[0, 4, 8] and [6, 4, 2] are all of a specific value.
Listing the winner indices explicitly works:
my_list = ['X', 'O', 'X', 'O', 'X', '-', 'O', 'X', 'X']
winner_indices = [[0, 1, 2], [3, 4, 5], [6, 7, 8],
[0, 3, 6], [1, 4, 7], [2, 5, 8],
[0, 4, 8], [6, 4, 2]]
no_winner = True
for indices in winner_indices:
selected = [my_list[index] for index in indices]
for party in ['X', 'O']:
if all(item == party for item in selected):
print('{} is the winner'.format(party))
no_winner = False
if no_winner:
print('nobody wins')
You are not considering all winning combinations when deciding the winner.
The approach I am going to use can be used to generate winning combination grid in a generic way. That can help even if you wish to scale.
My solution uses numpy package. Install it if you don't already have.
import numpy as np
from itertools import chain
#Define size of your grid
n_dim = 3
#Input list from players
my_list = ['X', 'O', 'X', 'O', 'X', '-', 'O', 'X', 'X']
#Generate a n_dim*n_dim grid and reshape it in rows and columns
grid = np.arange(n_dim*n_dim).reshape(n_dim, n_dim)
#Get all rows and columns out from the grid as they all are winning combinations
grid_l = list(chain.from_iterable((grid[i].tolist(), grid[:,i].tolist()) for i in range(n_dim)))
#Get the forward diagonal
grid_l.append(grid.diagonal().tolist())
#Get reverse diagonal
grid_l.append(np.diag(np.fliplr(grid)).tolist())
#Check if any player's combination matches with any winning combination
result = [i for i in grid_l if all(my_list[k] == 'X' for k in i) or all(my_list[k] == 'O' for k in i)]
#result [[0,4,8]]

Resources