Creating nested list for elements with same counts - python-3.x

Here's an example list:
['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
so how would i go about making a nested list for the elements with the same amount of counts? To be more clear, nesting the elements together that appears the same amount of time in a list. Output would be like this:
[['hello','hell'], ['hel', 'he'], ['h']]
Because the count of [hello,hell] is 3 so they are together like the rest of the elements in the list

With some imports it could be done like this:
from collections import Counter
from itertools import groupby
words = ['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
counts = Counter(words)
res = [list(group) for _, group in groupby(counts, key=lambda k: counts[k])]
res will be:
[['hello', 'hell'], ['hel', 'he'], ['h']]

Related

Is there a way to split strings inside a list?

I am trying to split strings inside a list but I could not find any solution on the internet. This is a sample, but it should help you guys understand my problem.
array=['a','b;','c','d)','void','plasma']
for i in array:
print(i.split())
My desired output should look like this:
output: ['a','b',';','c','d',')','void','plasma']
One approach uses re.findall on each starting list term along with a list comprehension to flatten the resulting 2D list:
inp = ['a', 'b;', 'c', 'd)', 'void', 'plasma']
output = [j for sub in [re.findall(r'\w+|\W+', x) for x in inp] for j in sub]
print(output) # ['a', 'b', ';', 'c', 'd', ')', 'void', 'plasma']

Apparently empty groups generated with itertools.groupby

I have some troubles with groupby from itertools
from itertools import groupby
for k, grp in groupby("aahfffddssssnnb"):
print(k, list(grp), list(grp))
output is:
a ['a', 'a'] []
h ['h'] []
f ['f', 'f', 'f'] []
d ['d', 'd'] []
s ['s', 's', 's', 's'] []
n ['n', 'n'] []
b ['b'] []
It works as expected.
itertools._grouper objects seems to be readable only once (maybe iterators ?)
but:
li = [grp for k, grp in groupby("aahfffddssssnnb")]
list(li[0])
[]
list(li[1])
[]
It seems empty ... I don't understand why ?
This one works:
["".join(grp) for k, grp in groupby("aahfffddssssnnb")]
['aa', 'h', 'fff', 'dd', 'ssss', 'nn', 'b']
I am using version 3.9.9
Question already asked to newsgroup comp.lang.python without any answsers
grp is a sub-iterator over the same major iterator given to groupby. A new one is created for every key.
When you skip to the next key, the old grp is no longer available as you advanced the main iterator beyond the current group.
It is stated clearly in the Python documentation:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:
k, g in groupby(data, keyfunc):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)

combining thousands of list strings in python

I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist referred to below is one list of all the individual words as strings from the book, so wordlist looks like this
["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S",
'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE',
'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I',
'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning',
'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her',
'sister', 'on', 'the', 'bank,'
The code i have for the solution so far is
from string import punctuation
def wordcount(book):
for word in wordlist:
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
print(newlist)
This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split() makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split() is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output
['down']
['the']
['rabbit-hole']
['alice']
['was']
['beginning']
['to']
['get']
['very']
['tired']
['of']
['sitting']
['by']
['her']
Here is a modification of your code with outputs
from string import punctuation
wordlist = "Alice fell down down down!.. down into, the hole."
single_list = []
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
and that produces:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole']
and the unique set:
{'fell', 'alice', 'down', 'into', 'the', 'hole'}
and the length of the unique:
6
(This may not be the most efficient approach but it is close to your current code and will suffice for that book of thousands of elements. If this was a backend process serving multiple requests you would optimize it with improvements)
EDIT----------
You may be importing from file using a library that passes in a list, in which case you produce an error AttributeError: 'list' object has no attribute 'split', or you might see the error IndexError: list index out of range because of an empty string. In which case you use this modification:
from string import punctuation
wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"]
single_list = []
for wordlist in wordlist2:
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
if(len(newlist) > 0):
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
producing:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole', 'there', 'was', 'only', 'one', 'hole', 'for', 'alice', 'to', 'fall', 'down', 'into']
{'there', 'fall', 'fell', 'alice', 'for', 'down', 'was', 'into', 'the', 'to', 'only', 'hole', 'one'}
13

Iterating through values of a paired RDD (Pyspark) and replacing null values

I am collecting data using the Spark RDD API and have created a paired RDD, as shown below:
spark = SparkSession.builder.master('local').appName('app').getOrCreate()
sc = spark.sparkContext
raw_rdd = sc.textFile("data.csv")
paired_rdd = raw_rdd\
.map(lambda x: x.split(","))\
.map(lambda x: (x[2], [x[1], x[3],x[5]]))
Here is a sample excerpt of the paired RDD:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', '', '']),
('VXIO456XLBB630221', ['R', '', '']),
('VXIO456XLBB630221', ['R', '', ''])]
As you notice, the keys in this paired RDD are the same for all elements, but only one element has all the fields completed.
What do we want to accomplish? We want to replace the empty fields with the values of the element with complete fields. So we would have an expected output like this:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003'])]
I know the first step would be to do a groupByKey, i.e.,
paired_rdd.groupByKey().map(lambda kv: ____)
I am just not sure how to iterate through the values and how this would fit into one lambda function.
The best way would probably to go with dataframes and window functions. With RDDs, you could work something out as well with an aggregation (reduceByKey) that would fill in the blanks and keep in memory the list of first elements of the list. Then we could re flatten based on that memory to create the same number of rows as before but with the values filled in.
# let's define a function that selects the none empty values between two strings
def select_value(a, b):
if a is None or len(a) == 0:
return b
else:
return a
# let's use mapValues to separate the first element of the list and the rest
# Then we use reduceByKey to aggregate the list of all first elements (first
# element of the tuple). For the other elements, we only keep non empty values
# (second element of the tuple).
# Finally, we use flatMapValues to recreate the rows based on the memorized
# first elements of the lists.
paired_rdd\
.mapValues(lambda x: ([x[0]], x[1:]))\
.reduceByKey(lambda a, b: (
a[0] + b[0],
[select_value(a[1][i], b[1][i]) for i in range(len(a[1])) ]
) )\
.flatMapValues(lambda x: [[k] + x[1] for k in x[0]])\
.collect()
Which yields:
[('VXIO456XLBB630221', ['I', 'Nissan', '2003']),
('VXIO456XLBB630221', ['A', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003']),
('VXIO456XLBB630221', ['R', 'Nissan', '2003'])
]

Tokenize in Python

I am trying to build a function that python that allows me to tokenize a character string. I have performed the following function:
def tokenize(string):
words = nltk.word_tokenize(string)
return words
This function prints the following:
tokenize("Hello. What’s your name?")
['Hello', '.', 'What', '’', 's', 'your', 'name', '?']
But I need you to print me as follows:
['Hello', '.', 'What’s', 'your', 'name', '?']
How could I implement it?.
Thank you

Resources