Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets? - python-dedupe

I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching class, there are both the match and matchBlocks functions. For match the docs suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks in tandem with block_data performs better than just match on larger datasets when the generator=True in match.
I've tried running both methods on a small-ish dataset (10,000 entities) and didn't notice a difference.
data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
{'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}...
{id10000...}}
then either method A:
blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)
or method B
clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)
(Then the computationally intensive part is running a for-loop on the clustered_dupes object.
cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
# Do something with each cluster_id like below
cluster_membership[cluster_id] = cluster
I expect/wonder if there is a performance difference. If so, could you point me to the code that shows that and explain why?

there is no difference between calling _blockData and then matchBlocks versus just match. Indeed if you look at the code, you'll see that match calls those two methods.
The reason why matchBlocks is exposed is that _blockData can take a lot of memory, and you may want to generate the blocks another way, such as taking advantage of a relational database.

Related

Another problem with: PerformanceWarning: DataFrame is highly fragmented

Since I am still learning Python, I am getting some optimisation errors here.
I keep getting the error
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
and it does take for a while to load for what I am doing now.
Here is my code:
def Monte_Carlo_for_Tracking_Error(N,S,K,Ru,Rd,r,I,a):
ldv=[]
lhp=[]
lsp=[]
lod=[]
Tracking_Error_df=pd.DataFrame()
# Go through different time steps of rebalancing
for y in range(1,I+1):
i=0
# do the same step a amount of times
while i<a:
Sample_Stock_Prices=[]
Sample_Hedging_Portfolio=[]
Hedging_Portfolio_Value=np.zeros(N) # Initzialize Hedging PF
New_Path=Portfolio_specification(N,S,K,Ru,Rd,r) # Get a New Sample Path
Sample_Stock_Prices.append(New_Path[0])
Sample_Hedging_Portfolio.append(Changing_Rebalancing_Rythm(New_Path,y))
Call_Option_Value=[]
Call_Option_Value.append(New_Path[1])
Differences=np.zeros(N)
for x in range(N):
Hedging_Portfolio_Value[x]=Sample_Stock_Prices[0][x]*Sample_Hedging_Portfolio[0][x]
for z in range(N):
Differences[z]=Call_Option_Value[0][z]-Hedging_Portfolio_Value[z]
lhp.append(Hedging_Portfolio_Value)
lsp.append(np.asarray(Sample_Stock_Prices))
ldv.append(np.asarray(Sample_Hedging_Portfolio))
lod.append(np.asarray(Differences))
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
i=i+1
return(Tracking_Error_df,lod,lsp,lhp,ldv)
Code starts to give me warnings when I try to run:
Simulation=MCTE(100,100,104,1.05,0.95,0,10,200)
Small part of the warning:
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
C:\Users\xxx\AppData\Local\Temp\ipykernel_1560\440260239.py:30: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Tracking_Error_df[f'Index{i+(y-1)*200}']=Differences
I am using jupyter notebook for this. If somebody could help me optimise it, would appreciate it.
I tried to test the code and I am expecting to have a more performance-oriented version of it.

How to do always necessary pre processing / cleaning with intake?

I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)
If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:
1. Load unprocessed data with intake and apply the pre-processing immediately:
import intake
from my_tools import pre_process
cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)
2. Apply the pre-processing step with the .read() call.
Catalog:
sources:
some_data:
args:
urlpath: "/path/to/some_raw_data.csv"
description: "Some data (already preprocessed)"
driver: csv
preprocess: my_tools.pre_process
And:
import intake
cat = intake.open_catalog('...')
df = cat.some_data.read()
Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.
However, you have a couple of options within Intake that you could consider alongside Option 1., above:
make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
write an alias-type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.

Why the operation on accumulator doesn't work without collect()?

I am learning Spark and following a tutorial. In an exercise I am trying to do some analysis on a data set. This data set has data in each line like:
userid | age | gender | ...
I have the following piece of code:
....
under_age = sc.accumulator(0)
over_age = sc.accumulator(0)
def count_outliers(data):
global under_age, over_age
if data[1] == '0-10':
under_age += 1
if data[1] == '80+':
over_age += 1
return data
data_set.map(count_outliers).collect()
print('Kids: {}, Seniors: {}'.format(under_age, over_age))
I found that I must use the method ".collect()" to make this code work. That is, without calling this method, the code won't count the two accumulators. But in my understanding ".collect()" is used to get the whole dataset to the memory. Why it is necessary here? Is it sth related to lazy evaluation thing? Please advise.
Yes, it is due to lazy evaluation.
Spark doesn't calculate anything until you execute an action such as collect, and the accumulators are only updated as a side-effect of that calculation.
Transformations such as map define what work needs to be done, but it's only executed once an action is triggered to "pull" the data through the transformations.
This is described in the documentation:
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map().
It's also important to note that:
In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
so your accumulators will not necessarily give correct answers; they may overstate the totals.

How to implement SUM with #QuerySqlFunction?

The examples seen so far that cover #QuerySqlFunction are trivial. I put one below. However, I'm looking for an example / solution / hint for providing a cross row calculation, e.g. average, sum, ... Is this possible?
In the example, the function returns value 0 from an array, basically an implementation of ARRAY_GET(x, 0). All other examples I've seen are similar: 1 row, get a value, do something with it. But I need to be able to calculate the sum of a grouped result, or possible a lot more business logic. If somebody could provide me with the QuerySqlFunction for SUM, I assume would allow me to do much more than just SUM.
Step 1: Write a function
public class MyIgniteFunctions {
#QuerySqlFunction
public static double value1(double[] values) {
return values[0];
}
}
Step 2: Register the function
CacheConfiguration<Long, MyFact> factResultCacheCfg = ...
factResultCacheCfg.setSqlFunctionClasses(new Class[] { MyIgniteFunctions.class });
Step 3: Use it in a query
SELECT
MyDimension.groupBy1,
MyDimension.groupBy2,
SUM(VALUE1(MyFact.values))
FROM
"dimensionCacheName".DimDimension,
"factCacheName".FactResult
WHERE
MyDimension.uid=MyFact.dimensionUid
GROUP BY
MyDimension.groupBy1,
MyDimension.groupBy2
I don't believe Ignite currently has clean API support for custom user-defined QuerySqlFunction that spans multiple rows.
If you need something like this, I would suggest that you make use of IgniteCompute APIs and distribute your computations, lambdas, or closures to the participating Ignite nodes. Then from inside of your closure, you can either execute local SQL queries, or perform any other cache operations, including predicate-based scans over locally cached data.
This approach will be executed across multiple Ignite nodes in parallel and should perform well.

Spark: getting cumulative frequency from frequency values

My question is rather simple to be answered in a single node environment, but I don't know how to do the same thing in a distributed Spark environment. What I have now is a "frequency plot", in which for each item I have the number of times it occurs. For instance, it may be something like this: (1, 2), (2, 3), (3,1) which means that 1 occurred 2 times, 2 3 times and so on.
What I would like to get is the cumulated frequency for each item, so the result I would need from the instance data above is: (1, 2), (2, 3+2=5), (3, 1+3+2=6).
So far, I have tried to do this by using mapPartitions which gives the correct result if there is only one partition...otherwise obviously no.
How can I do that?
Thanks.
Marco
I don't think what you want is possible as a distributed transformation in Spark unless your data is small enough to be aggregated into a single partition. Spark functions work by distributing jobs to remote processes, and the only way to communicate back is using an action which returns some value, or using an accumulator. Unfortunately, accumulators can't be read by the distributed jobs, they're write-only.
If your data is small enough to fit in memory on a single partition/process, you can coalesce(1), and then your existing code will work. If not, but a single partition will fit in memory, then you might use a local iterator:
var total = 0L
rdd.sortBy(_._1).toLocalIterator.foreach(tuple => {
total = total + tuple._2;
println((tuple._1, total)) // or write to local file
})
If I understood your question correctly, it really looks like a fit for one of the combiner functions – take a look at different versions of aggregateByKey or reduceByKey functions, both located here.

Resources