I want to arrange the list of strings with a certain condition - python-3.x

I want to arrange the list of strings alphabetically but with the condition that strings that start with x go first. For example, the input is list=['apple','pear','xanadu','stop'].
I'm sure you need to add some condition at the sort function but I'm not sure what to put.
list2=[]
string=input("Enter a string:")
list2.append(string)
while string!="stop":
string=input("Enter a string:")
list2.append(string)
list2.remove("stop")
print("Your list is:",list2)
print("Sorted list:",sorted(list2))
I want the output to be list=['xanadu','apple','pear']. I removed the 'stop' btw.

Use the key function that will determine the ordering of elements:
>>> sorted(['apple','pear','xanadu','stop'], key=lambda val: (0, val) if val.startswith('x') else (1, val))
['xanadu', 'apple', 'pear', 'stop']
The lambda means the following:
lambda val:\ # determine the ordering of the element `val`
(0, val)\ # make the algorithm compare tuples!
if val.startswith('x')\
else (1, val) # use default alphabetical ordering otherwise
Since we're now comparing tuples (but ordering the actual values), tuples whose first element is zero will always sort as being greater than those whose first element is 1.

Related

How can I count all deleted elements when I remade the list into a set(Python)

I need to make a function, which will take a list or tuple and remade it into a set. As there are no duplicate elements in the set, I need to write the number of all deleted elements. This is my code,
def find_type(arg):
if isinstance(arg, list):
arg = set(arg)
return arg
elif isinstance(arg, tuple):
a = list(arg)
return set(a)
print(find_type((1, 2, 2, 3)))
and answer
{1, 2, 3}
The function works, so i just do not know how to count and write the number of deleted elements
You can use the following relation:
number1 = len(arg)
number2 = len(find_type(arg))
number = number1-number2
Number of deleted elements is simply the difference in length of list/tuple and set.
Also, you don't need to check if arg is instance of list or tuple (and even if you do, do it in one conditional). set accepts all iterable types (like lists, tuples, strings, etc.).

How can I logically test the output of a np.where result?

I was trying to scan an array for values and take action depending on the result. However, when I had a closer look at what the code was doing I noticed that my logical condition was ill posed.
I will illustrate what I mean with the following example:
#importing numpy
import numpy as np
#creating a test array
a = np.zeros((3,3))
#searching items bigger than 1 in 'a'
index = np.where(a > 1)
I was expecting my index to return an empty list. In fact it returns a tuple object, like:
index
Out[5]: (array([], dtype=int64), array([], dtype=int64))
So, the test I was imposing:
#testing if there are values
#in 'a' that fulfil the where condition
if index[0] != []:
print('Values found.')
#testing if there are no values
#in 'a' that fulfil the where condition
if index[0] == []:
print('No values found.')
Will not achieve its purpose because I was comparing different objects (is that correct to say?).
So what is the correct way to create this test?
Thanks for your time!
For your 2D array, np.where returns a tuple of arrays of indices (one for each axis), so that a[index] gives you an array of the elements fulfilling the condition.
Indeed, you compared an empty list to an empty array. Instead, I would compare the size property (or e.g. len()) of the first element of this tuple:
if index[0].size == 0:
print('No values found.')

How to get list of indices for elements whose value is the maximum in that list

Suppose I have a list l=[3,4,4,2,1,4,6]
I would like to obtain a subset of this list containing the indices of elements whose value is max(l).
In this case, list of indices will be [1,2,5].
I am using this approach to solve a problem where, a list of numbers are provided, for example
l=[1,2,3,4,3,2,2,3,4,5,6,7,5,4,3,2,2,3,4,3,4,5,6,7]
I need to identify the max occurence of an element, however in case more than 1 element appears the same number of times,
I need to choose the element which is greater in magnitude,
suppose I apply a counter on l and get {1:5,2:5,3:4...}, I have to choose '2' instead of '1'.
Please suggest how to solve this
Edit-
The problem begins like this,
1) a list is provided as an input
l=[1 4 4 4 5 3]
2)I run a Counter on this to obtain the counts of each unique element
3)I need to obtain the key whose value is maximum
4)Suppose the Counter object contains multiple entries whose value is maximum,
as in Counter{1:4,2:4,3:4,5:1}
I have to choose 3 as the key whose value is 4.
5)So far, I have been able to get the Counter object, I have seperated key/value lists using k=counter.keys();v=counter.values()
6)I want to get the indices whose values are max in v
If I run v.index(max(v)), I get the first index whose value matches max value, but I want to obtain the list of indices whose value is max, so that I can obtain corresponding list of keys and obtain max key in that list.
With long lists, using NumPy or any other linear algebra would be helpful, otherwise you can simply use either
l.index(max(l))
or
max(range(len(l)),key=l)
These however return only one of the many argmax's.
So for your problem, you can choose to reverse the array, since you want the maximum that appears later as :
len(l)-l[::-1].index(max(l))-1
If I understood correctly, the following should do what you want.
from collections import Counter
def get_largest_most_freq(lst):
c = Counter(lst)
# get the largest frequency
freq = max(c.values())
# get list of all the values that occur _max times
items = [k for k, v in c.items() if v == freq]
# return largest most frequent item
return max(items)
def get_indexes_of_most_freq(lst):
_max = get_largest_most_freq(lst)
# get list of all indexes that have a value matching _max
return [i for i, v in enumerate(lst) if v == _max]
>>> lst = [3,4,4,2,1,4,6]
>>> get_largest_most_freq(lst)
4
>>> get_indexes_of_most_freq(lst)
[1, 2, 5]
>>> lst = [1,2,3,4,3,2,2,3,4,5,6,7,5,4,3,2,2,3,4,3,4,5,6,7]
>>> get_largest_most_freq(lst)
3
>>> get_indexes_of_most_freq(lst)
[2, 4, 7, 14, 17, 19]

Two dictionary nested inside

I have nested dictionary like this:
dic={'dic1':'a': , 'b': , 'dic2':'a': , 'b': , 'dic3':'a': , 'b': }
each inner dictionary has a many rows of data.
There is two problem:
1. I want to compare value of 'a' in nested dictionary to the value of one of hdf5 file dataset containing two dataset dataset1 and dataset2 such as if values of a exists in dataset1, access to the dataset2 values.
2.Access to the 'b'information corresponds to 'a' data?
for the first part I'm doing following procedure which is a never ending solution and for the second question I don't know how to access to the b in the the same tuple of a!
Does anybody have any clue how can I solve this?
for key, value in dict.items():
for k,v in value.items():
if 'a' in k:
for t in entry[key][k]:
if t in file['/dataset1']:
joint = file['/dataset2'][file['/dataset1'] == t]
You probably don't need the second loop, if your 'a' and 'b' keys are always present and known in advance (if not, you could add a test if 'a' in inner_dict and 'b' in inner_dict). Your test 'a' in k probably doesn't do what you expect (it's doing a substring match on an inner key string, which might give false positives if not all the keys are single characters).
Try something like this:
for outer_key, inner_dict in dic.items():
for t in inner_dict['a']:
if t in file['/dataset1']:
joint = file['/dataset2'][file['/dataset1'] == t] # not sure this makes sense
b_value = inner_dict['b']
# I think you want to do something with b_value here, but I'm not sure what

Find distinct values for each column in an RDD in PySpark

I have an RDD that is both very long (a few billion rows) and decently wide (a few hundred columns). I want to create sets of the unique values in each column (these sets don't need to be parallelized, as they will contain no more than 500 unique values per column).
Here is what I have so far:
data = sc.parallelize([["a", "one", "x"], ["b", "one", "y"], ["a", "two", "x"], ["c", "two", "x"]])
num_columns = len(data.first())
empty_sets = [set() for index in xrange(num_columns)]
d2 = data.aggregate((empty_sets), (lambda a, b: a.add(b)), (lambda x, y: x.union(y)))
What I am doing here is trying to initate a list of empty sets, one for each column in my RDD. For the first part of the aggregation, I want to iterate row by row through data, adding the value in column n to the nth set in my list of sets. If the value already exists, it doesn't do anything. Then, it performs the union of the sets afterwards so only distinct values are returned across all partitions.
When I try to run this code, I get the following error:
AttributeError: 'list' object has no attribute 'add'
I believe the issue is that I am not accurately making it clear that I am iterating through the list of sets (empty_sets) and that I am iterating through the columns of each row in data. I believe in (lambda a, b: a.add(b)) that a is empty_sets and b is data.first() (the entire row, not a single value). This obviously doesn't work, and isn't my intended aggregation.
How can I iterate through my list of sets, and through each row of my dataframe, to add each value to its corresponding set object?
The desired output would look like:
[set(['a', 'b', 'c']), set(['one', 'two']), set(['x', 'y'])]
P.S I've looked at this example here, which is extremely similar to my use case (it's where I got the idea to use aggregate in the first place). However, I find the code very difficult to convert into PySpark, and I'm very unclear what the case and zip code is doing.
There are two problems. One, your combiner functions assume each row is a single set, but you're operating on a list of sets. Two, add doesn't return anything (try a = set(); b = a.add('1'); print b), so your first combiner function returns a list of Nones. To fix this, make your first combiner function non-anonymous and have both of them loop over the lists of sets:
def set_plus_row(sets, row):
for i in range(len(sets)):
sets[i].add(row[i])
return sets
unique_values_per_column = data.aggregate(
empty_sets,
set_plus_row, # can't be lambda b/c add doesn't return anything
lambda x, y: [a.union(b) for a, b in zip(x, y)]
)
I'm not sure what zip does in Scala, but in Python, it takes two lists and puts each corresponding element together into tuples (try x = [1, 2, 3]; y = ['a', 'b', 'c']; print zip(x, y);) so you can loop over two lists simultaneously.

Resources