For a dictionary "a", with the keys "x, y and z" containing integer values.
What is the most efficient way to produce a joint list if I want to merge two keys in the dictionary (considering the size of the keys are identical and the values are of interger type)?
x+y and y+z ? .
Explanation:
Suppose you have to merge two keys and merge them into a new list or new dict without altering original dictionaries.
Example:
a = {"x" : {1,2,3,....,50}
"y" : {1,2,3,.....50}
"z" : {1,2,3,.....50}
}
Desired list:
x+y = [2,4,6,8.....,100]
y+z = [2,4,6,......,100]
A very efficient way is to do convert the dictionary to a pandas dataframe and allow it to do the job for you with its vectorized methods:
import pandas as pd
a = {"x" : range(1,51), "y" : range(1,51), "z" : range(1,51)}
df = pd.DataFrame(a)
x_plus_y = (df['x'] + df['y']).to_list()
y_plus_z = (df['y'] + df['z']).to_list()
print(x_plus_y)
#[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100]
It seems like you're trying to mimic a join-type operation. That is not native to python dicts, so if you really want that type of functionality, I'd recommend looking at the pandas library.
If you just want to merge dict keys without more advanced features, this function should help:
from itertools import chain
from collections import Counter
from typing import Dict, List, Set, Tuple
def merge_keys(data: Dict[str, Set[int]], *merge_list: List[Tuple[str, str]]):
merged_data = dict()
merged_counts = Counter(list(chain(*map(lambda k: list(data.get(k, {})) if k in merge_list else [], data))))
merged_data['+'.join(merge_list)] = [k*v for k,v in merged_counts.items()]
return merged_data
You can run this with merge_keys(a, "x", "y", "z", ...), where a is the name of your dict- you can put as many keys as you want ("x", "y", "z", ...), since this function takes a variable number of arguments.
If you want two separate merges in the same dict, all you need to do is:
b = merge_keys(a, "x", "y") | merge_keys(a, "y", "z")
Note that the order of the keys changes the final merged key ("y+z" vs "z+y") but not the value of their merged sets.
P.S: This was actually a little tricky since the original dict had set values, not lists, which aren't ordered, so you can't just add them elementwise. That's why I used Counter here, in case you were wondering.
Related
I'm having two different datasets where one of them contains 650k records and another contains 20k records and I want to find matching or approximately matching data in a single column of both datasets. How to speed up the process, as Python is very slow?
Note: My data type is string in both columns of two datasets.
Here is my simple code:
from fuzzywuzzy import fuzz
df1
df2
for i in df1['string1']:
for j in df2['string2']:
Ratio = fuzz.ratio(i.lower(),j.lower())
print(Ratio)
You might want to replace your usage of FuzzyWuzzy with RapidFuzz (I am the author), which is significantly faster.
Using RapidFuzz your algorithm can be implemented in the following way:
import pandas as pd
from rapidfuzz import fuzz, process
def lower(str, **kwargs):
return str.lower()
d = {'string1': ["abcd", "abcde"]*1000, 'string2': ["abcd", "abc"]*1000}
df = pd.DataFrame(data=d)
process.cdist(df["string1"], df["string2"],
processor=lower, scorer=fuzz.ratio)
which returns
array([[100, 86, 100, ..., 86, 100, 86],
[ 89, 75, 89, ..., 75, 89, 75],
[100, 86, 100, ..., 86, 100, 86],
...,
[ 89, 75, 89, ..., 75, 89, 75],
[100, 86, 100, ..., 86, 100, 86],
[ 89, 75, 89, ..., 75, 89, 75]], dtype=uint8)
You can further improve the performance using multithreading:
process.cdist(df["string1"], df["string2"],
processor=lower, scorer=fuzz.ratio, workers=-1)
In case you do not want to use all available cores you can specifiy a different count for workers.
I have 3 lists as below:
names = ["paul", "saul", "steve", "chimpy"]
ages = [28, 59, 22, 5]
scores = [59, 85, 55, 60]
And I need to convert them to a dictionary like this:
{'steve': [22, 55, 'fail'], 'saul': [59, 85, 'pass'], 'paul': [28, 59, 'fail'], 'chimpy': [5, 60, 'pass']}
'pass' and 'fail' are coming from the score if it is >=60 or not.
I can do this with a series of for loops but I'm looking for more neat/professional method.
Thank you.
Using zip you can do at least this "condensed" implementation:
res = dict()
for n,a,s in zip(names,ages,scores):
res[n] = [a,s,'fail' if s <60 else 'pass']
You could do this very neatly using dictionary comprehension:
D = {name: [score, age, 'fail' if score<60 else 'pass'] for name, score, age in zip(names, scores, ages)}
Write a python function which generates the list of n dictionaries representing n students. Each dictionary should have two keys Name and Marks, the value of Marks key should be a list of 10 elements representing marks in 10 subjects as [a1,a2,a3,a4,...,a10].
Create another function which accepts such dictionary and performs operation on marks and returns a list as [α1,α2,α3,α4,...,α10]
where
αi = βαi − 1 + (1−β)ai and β = 0.99, α0=1, i=1,2,3,...,10
Example of Dictionary
{{'Marks': [80, 60, 57, 84, 52, 98, 49, 58, 73, 65], 'Name': 'Rahul'},
{'Marks': [58, 66, 50, 94, 87, 98, 82, 62, 83, 67], 'Name': 'Deepak'}}
def create_dict(n):
dicts=[]
for i in range(n):
school={}
school["Name"]=name[i] # user an input name and marks dictionaries
school["Marks"]=marks[i] # the value of n will change as per list.
dicts.append(school)
return dicts
def alpha_i(a, n): # Functn to calculate alpha i
alpha = []
beta = 0.99
for i in range(1, n+1):
print(i)
alpha.append((beta * alpha[i-1]) + (1 - beta) * a[i])
return alpha
def marks_operation(smpl_dict): # functn to perform op on dictionary
marks = smpl_dict['Marks']
return alpha_i(marks, len(marks))
I have 2 lists whereby the sequence of values in the second list map to the months in the first list:
['Apr-16', 'Jul-16', 'Dec-15', 'Sep-16', 'Aug-16', 'Feb-16', 'Mar-16', 'Jan-16', 'May-16', 'Jun-16', 'Oct-15', 'Nov-15']
[15, 15, 6, 81, 60, 36, 6, 18, 36, 27, 24, 29]
I need to retain 2 seperate lists for use in another function. Using python how do I achieve sorting the lists into monthly order whilst retaining the existing mapping of values to months?
The idea is to
associate both lists
sort the resulting list of couples according to the year/month criteria (months must be converted as month indexes first using an auxiliary dictionary)
then separate the list of couples back to 2 lists, but now sorted according to date.
Here's a commented code which does what you want, maybe not the most compact or academic but works and is simple enough.
a = ['Apr-16', 'Jul-16', 'Dec-15', 'Sep-16', 'Aug-16', 'Feb-16', 'Mar-16', 'Jan-16', 'May-16', 'Jun-16', 'Oct-15', 'Nov-15']
b = [15, 15, 6, 81, 60, 36, 6, 18, 36, 27, 24, 29]
# create a dictionary key=month, value=month index
m = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
monthdict = dict(zip(m,range(len(m))))
# the sort function: returns sort key as (year as integer,month index)
def date_sort(d):
month,year = d[0].split("-")
return int(year),monthdict[month]
# zip both lists together and apply sort
t = sorted(zip(a,b),key=date_sort)
# unzip lists
asort = [e[0] for e in t]
bsort = [e[1] for e in t]
print(asort)
print(bsort)
result:
['Oct-15', 'Nov-15', 'Dec-15', 'Jan-16', 'Feb-16', 'Mar-16', 'Apr-16', 'May-16', 'Jun-16', 'Jul-16', 'Aug-16', 'Sep-16']
[24, 29, 6, 18, 36, 6, 15, 36, 27, 15, 60, 81]
I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.
UDP: commited toLocalIterator method
Update: RDD.toLocalIterator method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob to evaluate only a single partition on each step.
TL;DR And the original answer might give a rough idea how it works:
First of all, get the array of partition indexes:
val parts = rdd.partitions
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
for (p <- parts) {
val idx = p.index
val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
//The second argument is true to avoid rdd reshuffling
val data = partRdd.collect //data contains all values from a single partition
//in the form of array
//Now you can do with the data whatever you want: iterate, save to a file, etc.
}
I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true).
Wildfire answer seems semantically correct, but I'm sure you should be able to be vastly more efficient by using the API of Spark. If you want to process each partition in turn, I don't see why you can't using map/filter/reduce/reduceByKey/mapPartitions operations. The only time you'd want to have everything in one place in one array is when your going to perform a non-monoidal operation - but that doesn't seem to be what you want. You should be able to do something like:
rdd.mapPartitions(recordsIterator => your code that processes a single chunk)
Or this
rdd.foreachPartition(partition => {
partition.toArray
// Your code
})
Here is the same approach as suggested by #Wildlife but written in pyspark.
The nice thing about this approach - it lets user access records in RDD in order. I'm using this code to feed data from RDD into STDIN of the machine learning tool's process.
rdd = sc.parallelize(range(100), 10)
def make_part_filter(index):
def part_filter(split_index, iterator):
if split_index == index:
for el in iterator:
yield el
return part_filter
for part_id in range(rdd.getNumPartitions()):
part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
data_from_part_rdd = part_rdd.collect()
print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)
Produces output:
partition id: 0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
partition id: 1 elements: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
partition id: 2 elements: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
partition id: 3 elements: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
partition id: 4 elements: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
partition id: 5 elements: [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
partition id: 6 elements: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
partition id: 7 elements: [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
partition id: 8 elements: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
partition id: 9 elements: [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
pyspark dataframe solution using RDD.toLocalIterator():
separator = '|'
df_results = hiveCtx.sql(sql)
columns = df_results.columns
print separator.join(columns)
# Use toLocalIterator() rather than collect(), as this avoids pulling all of the
# data to the driver at one time. Rather, "the iterator will consume as much memory
# as the largest partition in this RDD."
MAX_BUFFERED_ROW_COUNT = 10000
row_count = 0
output = cStringIO.StringIO()
for record in df_results.rdd.toLocalIterator():
d = record.asDict()
output.write(separator.join([str(d[c]) for c in columns]) + '\n')
row_count += 1
if row_count % MAX_BUFFERED_ROW_COUNT== 0:
print output.getvalue().rstrip()
# it is faster to create a new StringIO rather than clear the existing one
# http://stackoverflow.com/questions/4330812/how-do-i-clear-a-stringio-object
output = cStringIO.StringIO()
if row_count % MAX_BUFFERED_ROW_COUNT:
print output.getvalue().rstrip()
Map/filter/reduce using Spark and download the results later? I think usual Hadoop approach will work.
Api says that there are map - filter - saveAsFile commands: https://spark.incubator.apache.org/docs/0.8.1/scala-programming-guide.html#transformations
For Spark 1.3.1 , the format is as follows
val parts = rdd.partitions
for (p <- parts) {
val idx = p.index
val partRdd = data.mapPartitionsWithIndex {
case(index:Int,value:Iterator[(String,String,Float)]) =>
if (index == idx) value else Iterator()}
val dataPartitioned = partRdd.collect
//Apply further processing on data
}