I am new to map-reduce and coding, I am trying to write a code in python that would calculate the average number of characters and "#" in a tweet
Sample data:
1469453965000;757570956625870854;RT #lasteven04: La jeune Rebecca
#Kpossi, nageuse, 18 ans à peine devrait être la porte-drapeau du #Togo à #Rio2016 hyperlink;Twitter for Android 1469453965000;757570957502394369;Over 30 million women
footballers in the world. Most of us would trade places with this lot
for #Rio2016 ⚽️ hyperlink;Twitter for iPhone
fields/columns details:
0: epoch_time 1: tweetId 2: tweet 3: device
Here is the code that I've written, I need help to calculate the average in the reducer function, any help/guidance would be really appreciated :-
updated as per the answer provided by #oneCricketeer
import re
from mrjob.job import MRJob
class Lab3(MRJob):
def mapper(self,_,line):
try:
fields=line.split(";")
if(len(fields)==4):
tweet=fields[2]
tweet_id=fields[0]
yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
except:
pass
def reduce(self,tweet_id,tweet_info):
total_tweet_length=0
total_tweet_hash=0
count=0
for v in tweet_info:
tweet_length,hashes = map(int,v.split())
tweet_length_sum+= tweet_length
total_tweet_hash+=hashes
count+=1
yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))
if __name__=="__main__":
Lab3.run()
Your mapper needs to yield a key and a value, 2 elements, not 3, therefore outputting both average length and hashtag count should ideally be separate mapreduce jobs, but for this case you could combine them because you're processing the entire line, not separate words
# you could use the tweetId as the key, too, but would only help if tweets shared ids
yield (None, "{} {}".format(len(tweet), tweet.count('#')))
Note: len(tweet) includes spaces and emojis, which you may want to exclude as "characters"
I'm not sure you can put _ in a function definition, so maybe change that too
Your reduce function is syntactically incorrect. You cannot put a string as a function parameter, nor use += on a variable that wasn't already defined. Then, an average calculation would require you to divide after you've totalled and counted (so, one returned result per reducer, not per value, in the loop}
def reduce(self,key,tweet_info):
total_tweet_length = 0
total_tweet_hash = 0
count = 0
for v in tweet_info:
tweet_length, hashes = map(int, v.split())
total_tweet_length += tweet_length
total_tweet_hash += hashes
count+=1
yield(total_tweet_length / (1.0 * count), total_tweet_hash / (1.0 * count)) # forcing a floating point output
Related
I need to generate a random string of a certain length – say ten characters, for the sake of argument – composed of the characters a, b, c, (, ), with the rule that parentheses must be matched.
So for example aaaaaaaaaa, abba()abba and ((((())))) are valid strings, but )aaaabbbb( is not.
What algorithm would generate a random string, uniformly sampled from the set of all strings consistent with those rules? (And run faster than 'keep generating strings without regard to the balancing rule, discard the ones that fail it', which could end up generating very many invalid strings before finding a valid one.)
A string consisting only of balanced parentheses (for any arbitrary pair of characters representing an open and a close) is called a "Dyck string", and the number of such strings with p pairs of parentheses is the pth Catalan number, which can be computed as (2pCp)/(p+1), a formula which would be much easier to make readable if only SO allowed MathJax. If you want to also allow k other non-parenthetic characters, you need to consider, for each number p ≤ n of pairs of balanced parentheses, the number of different combinations of the non-parentheses characters (k(2n-2p)) and the number of ways you can interpolate 2n-2p characters in a string of total length 2n (2nC2p). If you sum all these counts for each possible value of p, you'll get the count of the total universe of possibilities, and you can then choose a random number in that range and select whichever of the individual p counts corresponds. Then you can select a random placement of random non-parentheses characters.
Finally, you need to get a uniformly distributed Dyck string; a simple procedure is to decompose the Dyck string into it's shortest balanced prefix and the remainder (i.e. (A)B, where A and B are balanced subsequences). Select a random length for (A), then recursively generate a random A and a random B.
Precomputing the tables of counts (or memoising the function which generates them) will produce a speedup if you expect to generate a lot of random strings.
Use dynamic programming to generate a data structure that knows how many there are for each choice, recursively. Then use that data structure to find a random choice.
I seem to be the only person who uses the technique. And I always write it from scratch. But here is working code that hopefully explains it. It will take time O(length_of_string * (length_of_alphabet + 2)) and similar data.
import random
class DPPath:
def __init__ (self):
self.count = 0
self.next = None
def add_option(self, transition, tail):
if self.next is None:
self.next = {}
self.next[transition] = tail
self.count += tail.count
def random (self):
if 0 == self.count:
return None
else:
return self.find(int(random.random() * self.count))
def find (self, pos):
result = self._find(pos)
return "".join(reversed(result))
def _find (self, pos):
if self.next is None:
return []
for transition, tail in self.next.items():
if pos < tail.count:
result = tail._find(pos)
result.append(transition)
return result
else:
pos -= tail.count
raise IndexException("find out of range")
def balanced_dp (n, alphabet):
# Record that there is 1 empty string with balanced parens.
base_dp = DPPath()
base_dp.count = 1
dps = [base_dp]
for _ in range(n):
# We are working backwards towards the start.
prev_dps = [DPPath()]
for i in range(len(dps)):
# prev_dps needs to be bigger in case of closed paren.
prev_dps.append(DPPath())
# If there are closed parens, we can open one.
if 0 < i:
prev_dps[i-1].add_option('(', dps[i])
# alphabet chars don't change paren balance.
for char in alphabet:
prev_dps[i].add_option(char, dps[i])
# Add a closed paren.
prev_dps[i+1].add_option(")", dps[i])
# And we are done with this string position.
dps = prev_dps
# Return the one that wound up balanced.
return dps[0]
# And a quick demo of several random strings.
for _ in range(10):
print(balanced_dp(10, "abc").random())
I just saved the results as string and now performing the check using the loop but its not working
counter is still 0 if 66 appeared in the result.
from random import *
trial = int(randint(1, 10))
print(trial)
result = ''
for i in range(trial):
init_num = str(randint(1, 6))
result += init_num
print(result)
last_dice = 0
counter = 0
for i in range(trial):
if result[i] == 6 and last_dice == 6:
counter += 1
last_dice = 0
else:
last_dice = result[i]
print(counter)
if result[i] == 6
This condition will never be true because the value in 'result' is a string. If you change the 6 to '6' it should work.
EDIT
"how many time double 6 appeared while rolling the dice" implies the dice are being rolled in pairs. As it is, your script will count any two consecutive sixes as having been rolled together. For example if the rolls are 4,6,6,5 your script counts this as an instance of double sixes, which is not accurate if the rolls are occurring in pairs. For this reason you should generate the random 1-6 values in pairs so it is clear when sixes are actually rolled together.
You could create 'results' as a list where each item is a pair of numbers representing a roll of two dice:
results = [str(randint(1,6))+str(randint(1,6)) for i in range(randint(1, 10))]
Your 'trials' variable is only used to print the number of rolls. If each roll is for a pair of dice, the number of rolls is equal to the number of items in the above 'results' list. Printing this value doesn't require a variable, but merely the length of the list:
print(len(results))
One way to print all the roll results is as an easily readable string of comma-separated roll pair values, which can be done printing this list joined together with each item separated by a comma:
print(','.join(results))
Counting the number of double sixes rolled can be done by summing the number of times the value '66' appears in the 'results' list:
print(sum(i == '66' for i in results))
All together the script could be written as:
from random import randint
results = [str(randint(1,6))+str(randint(1,6)) for i in range(randint(1, 10))]
print(len(results))
print(','.join(results))
print(sum(i == '66' for i in results))
Could you give me a hint where the time consuming part of this code is?
It's my temporary solutions for the kata Generate Numbers from Digits #2 from codewars.com.
Thanks!
from collections import Counter
from itertools import permutations
def proc_arrII(arr):
length = Counter(arr).most_common()[-1][1]
b = [''.join(x) for x in list(set(permutations(arr,length)))]
max_count = [max(Counter(x).values()) for x in b]
total = 0
total_rep = 0
maximum_pandigit = 0
for i in range(len(b)):
total+=1
if max_count[i] > 1:
total_rep+=1
elif int(b[i]) > maximum_pandigit:
maximum_pandigit = int(b[i])
if maximum_pandigit == 0:
return([total])
else:
return([total,total_rep,maximum_pandigit])
When posting this,
it would have been helpful to offer example input,
or link to the original question,
or include some python -m cProfile output.
Here is a minor item, it inflates the running time very very slightly.
In the expression [''.join(x) for x in list(set(permutations(arr, length)))]
there's no need to call list( ... ).
The join just needs an iterable, and a set works fine for that.
Here is a bigger item.
permutations already makes the promise that
"if the input elements are unique, there will be no repeat values in each permutation."
Seems like you want to dedup (with set( ... )) on the way in,
rather than on the way out,
for an algorithmic win -- reduced complexity.
The rest looks nice enough.
You might try benching without the elif clause,
using the expression max(map(int, b)) instead.
If there's any gain it would only be minor,
turning O(n) into O(n) with slightly smaller coefficient.
Similarly, you should just assign total = len(b) and be done with it,
no need to increment it that many times.
I'm trying to take a string and break it into small chunks if it is over certain number of words.
I keep on getting a RecursionError: maximum recursion depth exceeded in comparison
What in my code is making this happen?
import math
# Shorten Sentence into small pieces
def shorten(sentenceN):
# If it is a string - and length over 6 - then shorten recursively
if (isinstance(sentenceN, str)):
sentence = sentenceN.split(' ')
array = []
length = len(sentenceN)
halfed = math.floor(length / 2)
if length < 6:
return [sentenceN]
# If sentence is long - break into two parts then rerun shorten on each part
else:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
array.append(first)
array.append(second)
return array
# If the object is an array (sentence is already broken up) - run shorten on each - append
# result to array for returning
if(isinstance(sentenceN, list)):
array = []
for sentence in sentenceN:
array.append(shorten(sentence))
return array
# example sentences to use
longSentence = "On offering to help the blind man, the man who then stole his car, had not, at that precise moment."
shortSentence = "On offering to help the blind man."
shorten(shortSentence)
shorten(longSentence)
When you execute a recursive function in Python on a large input ( > 10^4), you might encounter a “maximum recursion depth exceeded error”.
here you have recursion:
first = shorten(" ".join(sentence[:halfed]))
second = shorten(" ".join(sentence[halfed:]))
it means calling the same function over and over, it has to store in a stack to be returned in someplace, but it seems like your sentence is too long that stack is overflowing and hit maximum recursion depth.
you have to do something with the logic of the code like increase this 6 to a greater number
if length < 6:
return [sentenceN]
or just increase recursion depth with
import sys
sys.setrecursionlimit(10**6)
Instructions: Write a script that will calculate the %GC of a dna string
based on a sliding window of adjustable size. So say the length of
the window is L = 10 bases, then you will move the window along
the dna strand from position 0 to the end (careful, not too far...)
and 'extract' the bases into a substring and analyze GC content.
Put the numbers in a list. The dna string may be very large so you
will want to read the string in from an infile, and print the results
to a comma-delimited outfile that can be ported into Excel to plot.
For the final data analysis, use a window of L = 100 and analyze the two genomes in files:
Bacillus_amyloliquefaciens_genome.txt
Deinococcus_radiodurans_R1_chromosome_1.txt
But first, to get your script functioning, use the following trainer data set.Let window L=4. Example input and output follow:
INPUT:
AACGGTT
OUTPUT:
0,0.50
1,0.75
2,0.75
3,0.50
My code:
dna = ['AACGGTT']
def slidingWindow(dna,winSize,step):
"""Returns a generator that will iterate through
the defined chunks of input sequence. Input sequence
must be iterable."""
# Verify the inputs
#try: it = iter(dna)
# except TypeError:
#raise Exception("**ERROR** sequence must be iterable.")
if not ((type(winSize) == type(0)) and (type(step) == type(0))):
raise Exception("**ERROR** type(winSize) and type(step) must be int.")
if step > winSize:
raise Exception("**ERROR** step must not be larger than winSize.")
if winSize > len(dna):
raise Exception("**ERROR** winSize must not be larger than sequence length.")
# Pre-compute number of chunks to emit
numOfwins = ((len(dna)-winSize)/step)+1
# Do the work
for i in range(0,numOfwins*step,step):
yield dna[i:i+winSize]
chunks = slidingWindow(dna,len(dna),step)
for y in chunks:
total = 1
search = dna[y]
percentage = (total/len(dna))
if search == "C":
total = total+1
print ("#", y,percentage)
elif search == "G":
total = total+1
print ("#", y,percentage)
else:
print ("#", y, "0.0")
"""
MAIN
calling the functions from here
"""
# YOUR WORK HERE
#print ("#", z,percentage)
When approaching a complex problem, it is helpful to divide it into simpler sub-problems. Here, you have at least two separate concepts: a window of bases, and statistics on such a window. Why don't you tackle them one at a time?
Here is a simple generator that produces chunks of the desired size:
def get_chunks(dna, window_size=4, stride=1):
for i in range(0, len(dna) - window_size + 1, stride):
chunk = dna[i:i + window_size]
assert len(chunk) == window_size
yield chunk
for chunk in get_chunks('AACGGTT'):
print(chunk)
It displays this output:
AACG
ACGG
CGGT
GGTT
Now, with that in hand, could you write a simple function that accepts a four-character string and produces an appropriate statistical summary of it? [Please post it as a separate answer to your question. Yes, it might sound odd at first, but StackOverflow does encourage you to post answers to your questions, so you can share what you have learned.]