How to do always necessary pre processing / cleaning with intake? - intake

I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)
If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:
1. Load unprocessed data with intake and apply the pre-processing immediately:
import intake
from my_tools import pre_process
cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)
2. Apply the pre-processing step with the .read() call.
Catalog:
sources:
some_data:
args:
urlpath: "/path/to/some_raw_data.csv"
description: "Some data (already preprocessed)"
driver: csv
preprocess: my_tools.pre_process
And:
import intake
cat = intake.open_catalog('...')
df = cat.some_data.read()

Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.
However, you have a couple of options within Intake that you could consider alongside Option 1., above:
make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
write an alias-type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.

Related

Profiling the Spark Analyzer: how to access the QueryPlanningTracker for a pyspark query?

Any Spark & Py4J gurus available to explain how to reliably access Spark's java objects and variables from the Python side of pyspark? Specifically, how to access the data in Spark's QueryPlanningTracker from python?
Details
I am trying to profile a creating a pyspark dataframe (df = spark_session.sql(thousand_line_query)). Not running the query. Just creating the dataframe so I can inspect its schema. Merely waiting for the return from that .sql() call which initializes the dataframe with no data takes a long time (10-30 seconds). I have tracked the slow steps to Spark's Analyzer stage. Logging (below) suggests Spark is recomputing the same sub-query too many times, so I'm trying to dig in and see what is going on by profiling Spark's work on my query. I tried to methods from a number of articles for profiling the Spark Optimizer stage for executing queries (e.g. Luca Canali's sparkMeasure, Rose Toomey's Care and Feeding of Catalyst Optimizer). But I have found no guide that focuses on profiling the Spark Analyzer stage that runs before the Optimizer stage. (Hence I also include extra details below on what I've found that others may find helpful.)
Reading Spark's Scala sourcecode, I see the Analyzer is a RuleExecutor, and RuleExecutors have a QueryPlanningTracker which seems to record details on each invocation of each Analyzer Rule that Spark runs, specifically to allow a person to reconstruct a timeline of what the analyzer is doing for a single query.
However, I cannot seem to access the data in the Analyzer's QueryPlanningTracker from python. I would like to be able to retrieve a QueryPlanningTracker java object with the full details of the run of one query, and to inspect what fields & methods are available on the Python code. Any advice?
Examples
In python using pyspark, request a dataframe for my 1,000-line query and find it is slow:
query_sql = 'SELECT ... <long query here>
spark_df = spark_session.sql(query_sql) # takes 10-30 seconds
Turn on copious logging, rerun query above, look at output & see the slow steps all mention the PlanCheckLogger which is in the Spark Analyzer. Also access Spark's RuleExecutor to see how much time is used by each rule & which rules are not effective:
spark_session.sparkContext.setLogLevel('ALL')
rule_executor = spark_session._jvm.org.apache.spark.sql.catalyst.rules.RuleExecutor
rule_executor.resetMetrics()
spark_df = spark_session.sql(query_sql) # logs 10,000+ lines of output, lines with keyword `PlanChangeLogger` give timestamps showing the slow steps are in the Analyzer, but not the order of steps that occur
print(rule_executor.dumpTimeSpent()) # prints Analyzer rules that ran, how much time was 'effective' for each rule, but no specifics on order of rules run, no details on why rules took up a lot of time but were not effective.
Next: Try (unsuccessfully) to access Spark's QueryPlanningTracker data to drill down to a timeline of rules run, how long each call to each rule took, and any other specifics I can get:
tracker = spark_session._jvm.org.apache.spark.sql.catalyst.QueryPlanningTracker
## Use some call here to show data contents of the tracker; which currently gives E.g. intitial exploration:
tracker.measurePhase.topRulesByTime(10)
*** TypeError: 'JavaPackage' object is not callable ....
The above is one example; The tracker code suggests it has other methods & fields I could use, however I do not see how to access those nor how to inspect from Python to see what methods & fields are available, so it is just trial & error from reading Spark's github repository ...
You can try this:
>>> df = spark.range(1000).selectExpr("count(*)")
>>> tracker = df._jdf.queryExecution().tracker()
>>> print(tracker)
org.apache.spark.sql.catalyst.QueryPlanningTracker#5702d8be
>>> print(tracker.topRulesByTime(10))
Stream((org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions,RuleSummary(27004600, 2, 1)), ?)
I'm not sure what kinds of info you need. But if you want to know query plan generated. You can use df.explain()

How to acces output folder from a PythonScriptStep?

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).

Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets?

I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching class, there are both the match and matchBlocks functions. For match the docs suggest to only use on small to moderately sized datasets. From looking through the code, I can't gather how matchBlocks in tandem with block_data performs better than just match on larger datasets when the generator=True in match.
I've tried running both methods on a small-ish dataset (10,000 entities) and didn't notice a difference.
data_d = {'id1': {'name': 'George Bush', 'address': '123 main st.'}
{'id2': {'name': 'Bill Clinton', 'address': '1600 pennsylvania ave.'}...
{id10000...}}
then either method A:
blocks = deduper._blockData(data_d)
clustered_dupes = deduper.matchBlocks(blocks, threshold=threshold)
or method B
clustered_dupes = deduper.match(blocks, threshold=threshold, generator=True)
(Then the computationally intensive part is running a for-loop on the clustered_dupes object.
cluster_membership = {}
for (cluster_id, cluster) in enumerate(clustered_dupes):
# Do something with each cluster_id like below
cluster_membership[cluster_id] = cluster
I expect/wonder if there is a performance difference. If so, could you point me to the code that shows that and explain why?
there is no difference between calling _blockData and then matchBlocks versus just match. Indeed if you look at the code, you'll see that match calls those two methods.
The reason why matchBlocks is exposed is that _blockData can take a lot of memory, and you may want to generate the blocks another way, such as taking advantage of a relational database.

Are the same IDs always given out across the same logic plan?

Beneath you see a simplified version of what I'm trying to do. I load a Dataframe from 150 parquets(>10TB) stored in S3, Then I give this dataframe an id column with func.monotonically_increasing_id(). Afterwards I save a couple of deviates of this dataframe. The function I apply are a little bit more complicated than I present here but I hope this gets the point across
DF_loaded = spark.read.parquet(/some/path/*/')
DF_with_IDs = DF_loaded.withColumn('id',func.monotonically_increasing_id())
#creating parquet_1
DF_with_IDs.where(col('a').isNotNull()).write.parquet('/path/parquet_1/')
#creating parquet_2
DF_with_IDs.where(col('b').isNotNull()).write.parquet('/path/parquet_2/')
now I noticed that spark after creating parquet_1 loads again all the data from S3 to create parquet_2. Now I'm worried that the IDs given to parquet_1 do not match those of parquet_2. That the same row has different IDs in both parquets. Because as far is I understand it the logic plan spark comes up with looks like this:
#parquet_1
load_data -> give_ids -> make_selection -> write_parquet
#parquet_2
load_data -> give_ids -> make_selection -> write_parquet
So are the same IDs given to the same rows in both parquets?
As long as:
You use a recent version of Spark (SPARK-13473, SPARK-14241).
There is no configuration change between actions (Changes in a configuration can affect number of partitions and as a result ids).
monotonically_increasing_id should be stable. Note that this disables predicate pushdown.
rdd.zipWithindex.toDF should be stable independent of configuration, so it might be preferable.

pyspark - using MatrixFactorizationModel in RDD's map function

I have this model:
from pyspark.mllib.recommendation import ALS
model = ALS.trainImplicit(ratings,
rank,
seed=seed,
iterations=iterations,
lambda_=regularization_parameter,
alpha=alpha)
I have successfully used it to recommend users to all product with the simple approach:
recRDD = model.recommendUsersForProducts(number_recs)
Now if I just want to recommend to a set of items, I first load the target items:
target_items = sc.textFile(items_source)
And then map the recommendUsers() function like this:
recRDD = target_items.map(lambda x: model.recommendUsers(int(x), number_recs))
This fails after any action I try, with the following error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I'm trying this locally so I'm not sure if this error persists when on client or cluster mode. I have tried to broadcast the model, which only makes this same error when trying to broadcast instead.
Am I thinking straight? I could eventually just recommend for all and then filter, but I'm really trying to avoid recommending for every item due the large amount of them.
Thanks in advance!
I don't think there is a way to call recommendUsers from the workers because it ultimately calls callJavaFunc which needs the SparkContext as an argument. If target_items is sufficiently small you could call recommendUsers in a loop on the driver (this would be the opposite extreme of predicting for all users and then filtering).
Alternatively, have you looked at predictAll? Roughly speaking, you could run predictions for all users for the target items, and then rank them yourself.

Resources