pyspark - using MatrixFactorizationModel in RDD's map function - apache-spark

I have this model:
from pyspark.mllib.recommendation import ALS
model = ALS.trainImplicit(ratings,
rank,
seed=seed,
iterations=iterations,
lambda_=regularization_parameter,
alpha=alpha)
I have successfully used it to recommend users to all product with the simple approach:
recRDD = model.recommendUsersForProducts(number_recs)
Now if I just want to recommend to a set of items, I first load the target items:
target_items = sc.textFile(items_source)
And then map the recommendUsers() function like this:
recRDD = target_items.map(lambda x: model.recommendUsers(int(x), number_recs))
This fails after any action I try, with the following error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I'm trying this locally so I'm not sure if this error persists when on client or cluster mode. I have tried to broadcast the model, which only makes this same error when trying to broadcast instead.
Am I thinking straight? I could eventually just recommend for all and then filter, but I'm really trying to avoid recommending for every item due the large amount of them.
Thanks in advance!

I don't think there is a way to call recommendUsers from the workers because it ultimately calls callJavaFunc which needs the SparkContext as an argument. If target_items is sufficiently small you could call recommendUsers in a loop on the driver (this would be the opposite extreme of predicting for all users and then filtering).
Alternatively, have you looked at predictAll? Roughly speaking, you could run predictions for all users for the target items, and then rank them yourself.

Related

Using PySpark instead of the multiprocessing module - How can I translate my code?

I am currently using the multiprocessing module to parallelize iterations as shown in this example. The thing is that this way I would be using only 1 worker and its cores, but not using all the workers available. Also I'm not able to parallelize experiments (I'm running several experiments, and several iterations for each experiment).
This code is taking too long to run, and I have the understanding that the runtime could be highly reduced using PySpark. My Spark knowledge is very little and I don´t know how to translate this code in order to use it with Spark.
All the functions and classes used here are written using purely python, (numpy and pandas)
import concurrent.futures
import multiprocessing as mp
def process_simulation(experiment):
number_of_workers = mp.cpu_count()
with concurrent.futures.ProcessPoolExecutor(max_workers=number_of_workers) as executor:
results = list(executor.map(Simulation.simulation_steps(), iterations_generator()))
experiment.simulations = []
for i, v in enumerate(results):
experiment.simulations.append(results[v])
For context, Experiment and Simulation are classes (there's no inheritance). One experiment needs multiple simulations to be completed.
Thank you!
You can use Fugue to bring this type of logic to PySpark with a minimal wrapper. The only thing is you need to start with a DataFrame of inputs, and then you can do something like:
from fugue import transform
transform(input_df, Simulation.simulation_steps, schema=<your output schema here>, partition={"how": "per_row"}, engine="spark"
I can always help more get it in this shape if I have more details to what the logic is. It may just need one wrapper function. (contact info in bio).

Profiling the Spark Analyzer: how to access the QueryPlanningTracker for a pyspark query?

Any Spark & Py4J gurus available to explain how to reliably access Spark's java objects and variables from the Python side of pyspark? Specifically, how to access the data in Spark's QueryPlanningTracker from python?
Details
I am trying to profile a creating a pyspark dataframe (df = spark_session.sql(thousand_line_query)). Not running the query. Just creating the dataframe so I can inspect its schema. Merely waiting for the return from that .sql() call which initializes the dataframe with no data takes a long time (10-30 seconds). I have tracked the slow steps to Spark's Analyzer stage. Logging (below) suggests Spark is recomputing the same sub-query too many times, so I'm trying to dig in and see what is going on by profiling Spark's work on my query. I tried to methods from a number of articles for profiling the Spark Optimizer stage for executing queries (e.g. Luca Canali's sparkMeasure, Rose Toomey's Care and Feeding of Catalyst Optimizer). But I have found no guide that focuses on profiling the Spark Analyzer stage that runs before the Optimizer stage. (Hence I also include extra details below on what I've found that others may find helpful.)
Reading Spark's Scala sourcecode, I see the Analyzer is a RuleExecutor, and RuleExecutors have a QueryPlanningTracker which seems to record details on each invocation of each Analyzer Rule that Spark runs, specifically to allow a person to reconstruct a timeline of what the analyzer is doing for a single query.
However, I cannot seem to access the data in the Analyzer's QueryPlanningTracker from python. I would like to be able to retrieve a QueryPlanningTracker java object with the full details of the run of one query, and to inspect what fields & methods are available on the Python code. Any advice?
Examples
In python using pyspark, request a dataframe for my 1,000-line query and find it is slow:
query_sql = 'SELECT ... <long query here>
spark_df = spark_session.sql(query_sql) # takes 10-30 seconds
Turn on copious logging, rerun query above, look at output & see the slow steps all mention the PlanCheckLogger which is in the Spark Analyzer. Also access Spark's RuleExecutor to see how much time is used by each rule & which rules are not effective:
spark_session.sparkContext.setLogLevel('ALL')
rule_executor = spark_session._jvm.org.apache.spark.sql.catalyst.rules.RuleExecutor
rule_executor.resetMetrics()
spark_df = spark_session.sql(query_sql) # logs 10,000+ lines of output, lines with keyword `PlanChangeLogger` give timestamps showing the slow steps are in the Analyzer, but not the order of steps that occur
print(rule_executor.dumpTimeSpent()) # prints Analyzer rules that ran, how much time was 'effective' for each rule, but no specifics on order of rules run, no details on why rules took up a lot of time but were not effective.
Next: Try (unsuccessfully) to access Spark's QueryPlanningTracker data to drill down to a timeline of rules run, how long each call to each rule took, and any other specifics I can get:
tracker = spark_session._jvm.org.apache.spark.sql.catalyst.QueryPlanningTracker
## Use some call here to show data contents of the tracker; which currently gives E.g. intitial exploration:
tracker.measurePhase.topRulesByTime(10)
*** TypeError: 'JavaPackage' object is not callable ....
The above is one example; The tracker code suggests it has other methods & fields I could use, however I do not see how to access those nor how to inspect from Python to see what methods & fields are available, so it is just trial & error from reading Spark's github repository ...
You can try this:
>>> df = spark.range(1000).selectExpr("count(*)")
>>> tracker = df._jdf.queryExecution().tracker()
>>> print(tracker)
org.apache.spark.sql.catalyst.QueryPlanningTracker#5702d8be
>>> print(tracker.topRulesByTime(10))
Stream((org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions,RuleSummary(27004600, 2, 1)), ?)
I'm not sure what kinds of info you need. But if you want to know query plan generated. You can use df.explain()

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Why the operation on accumulator doesn't work without collect()?

I am learning Spark and following a tutorial. In an exercise I am trying to do some analysis on a data set. This data set has data in each line like:
userid | age | gender | ...
I have the following piece of code:
....
under_age = sc.accumulator(0)
over_age = sc.accumulator(0)
def count_outliers(data):
global under_age, over_age
if data[1] == '0-10':
under_age += 1
if data[1] == '80+':
over_age += 1
return data
data_set.map(count_outliers).collect()
print('Kids: {}, Seniors: {}'.format(under_age, over_age))
I found that I must use the method ".collect()" to make this code work. That is, without calling this method, the code won't count the two accumulators. But in my understanding ".collect()" is used to get the whole dataset to the memory. Why it is necessary here? Is it sth related to lazy evaluation thing? Please advise.
Yes, it is due to lazy evaluation.
Spark doesn't calculate anything until you execute an action such as collect, and the accumulators are only updated as a side-effect of that calculation.
Transformations such as map define what work needs to be done, but it's only executed once an action is triggered to "pull" the data through the transformations.
This is described in the documentation:
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map().
It's also important to note that:
In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
so your accumulators will not necessarily give correct answers; they may overstate the totals.

Bad performance with window function in streaming job

I use Spark 2.0.2, Kafka 0.10.1 and the spark-streaming-kafka-0-8 integration. I want to do the following:
I extract features in a streaming job out of NetFlow connections and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a specified time window before. They count how many connections in the last second were to the same host or service as the current one. I decided to use the SQL window functions for this.
So I build window specifications:
val hostCountWindow = Window.partitionBy("plainrecord.ip_dst").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
val serviceCountWindow = Window.partitionBy("service").orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
And a function which is called to extract this features on every batch:
def extractTrafficFeatures(dataset: Dataset[Row]) = {
dataset
.withColumn("host_count", count(dataset("plainrecord.ip_dst")).over(hostCountWindow))
.withColumn("srv_count", count(dataset("service")).over(serviceCountWindow))
}
And use this function as follows
stream.map(...).map(...).foreachRDD { rdd =>
val dataframe = rdd.toDF(featureHeaders: _*).transform(extractTrafficFeatures(_))
...
}
The problem is that this has a very bad performance. A batch needs between 1 and 3 seconds for a average input rate of less than 100 records per second. I guess it comes from the partitioning, which produces a lot of shuffling?
I tried to use the RDD API and countByValueAndWindow(). This seems to be much faster, but the code looks way nicer and cleaner with the DataFrame API.
Is there a better way to calculate these features on the streaming data? Or am I doing something wrong here?
Relatively low performance is to be expected here. Your code has to shuffle and sort data twice, once for:
Window
.partitionBy("plainrecord.ip_dst")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
and once for:
Window
.partitionBy("service")
.orderBy(desc("timestamp")).rangeBetween(-1L, 0L)
This will have a huge impact on the runtime and if these are the hard requirements you won't be able to do much better.

Resources