Using Ray to transfer data from Spark to Ray datasets

Using Ray to transfer data from Spark to Ray datasets - apache-spark

update:
I think that I finally understood the issue.. As databricks inits the spark session, the raydb-spark session is not really set up..
So, is there a way to make the raydb-context available in the Spark session?
pre-update:
I am currently running Spark on Databricks and set up Ray onto it (head node only).
It seems to work, however, if I try to transfer the data from Spark to Ray datasets, I run into an issue:
TypeError Traceback (most recent call last)
<command-2445755691838> in <module>
5 memory_per_executor = "500M"
6 # spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)
----> 7 dataset = ray.data.from_spark(df)
/databricks/python/lib/python3.7/site-packages/ray/data/read_api.py in from_spark(df, parallelism)
1046 import raydp
1047
-> 1048 return raydp.spark.spark_dataframe_to_ray_dataset(df, parallelism)
1049
1050
/databricks/python/lib/python3.7/site-packages/raydp/spark/dataset.py in spark_dataframe_to_ray_dataset(df, parallelism, _use_owner)
176 if parallelism != num_part:
177 df = df.repartition(parallelism)
--> 178 blocks, _ = _save_spark_df_to_object_store(df, False, _use_owner)
179 return from_arrow_refs(blocks)
180
/databricks/python/lib/python3.7/site-packages/raydp/spark/dataset.py in _save_spark_df_to_object_store(df, use_batch, _use_owner)
150 jvm = df.sql_ctx.sparkSession.sparkContext._jvm
151 jdf = df._jdf
--> 152 object_store_writer = jvm.org.apache.spark.sql.raydp.ObjectStoreWriter(jdf)
153 obj_holder_name = df.sql_ctx.sparkSession.sparkContext.appName + RAYDP_OBJ_HOLDER_SUFFIX
154 if _use_owner is True:
TypeError: 'JavaPackage' object is not callable
with the vanilla code:
# loading the data
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format("csv") \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load("dbfs:/databricks-datasets/nyctaxi/tripdata/green")
import ray
import sys
# disable stdout
sys.stdout.fileno = lambda: False
# connect to ray cluster on a single instance
ray.init()
ray.cluster_resources()
import raydp
dataset = ray.data.from_spark(df)
What am I doing wrong?
pyspark 3.0.1
ray 2.0.0

Related

Error-Queries with streaming sources must be executed with writeStream.start();; kafka

I was trying is to handle real-time data streaming of kafka using pyspark.
I got a table that get updated realtime . whenever there is a content on the table, i need to aggregate it and stream the count to another consumer. While I tried to do it I am getting an error defined as below
I went through lot of reference but couldn't find the solution.
Could anyone please help me resolve it.
My major reference was from Handling real-time Kafka data streams using PySpark
def func_count(df):
dic_new = {}
rows_count = df.count()
if rows_count != 0:
df_count = df.filter(df.c == 1)\
.groupBy('a','b').count('c').alias('count')
print("row count:",rows_count)
dic_new[df['a']] = df_count.to_dict(orient='records')
return df_count, rows_count
selected_col = df_predict.select('a', 'b','c')
result, rows_count = func_count(selected_col)
result_1 = result.selectExpr(
"CAST(a AS STRING)",
"CAST(b AS STRING)",
"CAST(count AS STRING)",
)\
.withColumn("value", to_json(struct("*")).cast("string"),)
result2_1 = result_1 \
.select("value") \
.writeStream \
.trigger(processingTime="5 seconds") \
.outputMode("complete") \
.format("kafka") \
.option("topic", "send_data") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.start() \
.awaitTermination()
ERROR:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-96-c68859698611> in <module>
15 selected_col = df_predict.select('a', 'b','c')
16
---> 17 result, rows_count = func_count(selected_col)
18
19
<ipython-input-96-c68859698611> in func_count(df)
4 def func_count(df):
5 dic_new = {}
----> 6 rows_count = df.count()
7 if rows_count != 0:
8 df_count = df.filter(df.c == 1)\\
~/.local/lib/python3.8/site-packages/pyspark/sql/dataframe.py in count(self)
583 2
584 """
--> 585 return int(self._jdf.count())
586
587 #ignore_unicode_prefix
~/.local/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
~/.local/lib/python3.8/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
135 # Hide where the exception came from that shows a non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise
~/.local/lib/python3.8/site-packages/pyspark/sql/utils.py in raise_from(e)
AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka

So, this dataframe (per the error message) is from a streaming source:
selected_col = df_predict.select('a', 'b', 'c')
When you call:
result, rows_count = func_count(selected_col)
You in turn call:
df.count() on the resulting dataframe.
Per:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
count() - Cannot return a single count from a streaming Dataset.
Instead, use ds.groupBy().count() which returns a streaming Dataset
containing a running count.
It's also a little unclear what you're really trying to accomplish, since you're doing a lot of stuff to create a dic_new and get a rows_count that you don't appear to use.
It seems like instead of:
result, rows_count = func_count(selected_col)
You could just be calling:
df_count = selected_col.filter(selected_col.c == 1).groupBy('a','b').count('c').alias('count')
It's a bit unclear without context why you're checking if num rows is zero in the first place.

How to debug MemoryError in PySpark

I want to download some xml files (50MBs each - about 3000 = 150GBs), process them and upload to BigQuery using pyspark. For the development purpose I was using jupyter notebook and small amount of files 10. I wrote pretty complex code setup cluster on dataproc. My daproc cluster has 6TBs of HDFSs, 10 nodes (each 4 cores) and 120GBs of RAM.
def context():
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
import pyspark
conf = pyspark.SparkConf()
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '4G')
.set('spark.driver.memory', '45G')
.set('spark.driver.maxResultSize', '10G')
.set("spark.python.profile", "true"))
sc = pyspark.SparkContext(conf=conf)
return sc
def job(sc):
print("Job started")
RDDread = sc.wholeTextFiles("s3a://custom-bucket/*/*.gz")
models = RDDread.flatMap(process_xmls).groupByKey()
tracking_all = (models.filter(lambda x: x[0] == TrackInformation)
.flatMap(lambda x: x[1])
.map(lambda model: (model.flight_ref, model))
.groupByKey())
tracking_merged = tracking_all.map(lambda x: x[1]).map(merge_ti)
flight_plans = (models.filter(lambda x: x[0] == FlightPlan).flatMap(lambda x: x[1]).map(lambda fp: (fp.flight_ref, fp)))
fps_tracking = tracking_merged.union(flight_plans).groupByKey().filter(lambda x: len(x[1]) == 2)
in_bq_batch = 1000
n = fps_tracking.count()
parts = ceil(n / in_bq_batch)
many_n = fps_tracking.repartition(parts).mapPartitions(upload_fpm2)
print("Job ended")
return fps_tracking, tracking_merged, flight_plans, models, many_n
After 200 messages org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.gz] I'm getting 2 errors: java.lang.OutOfMemoryError and MemoryError, mostly MemoryError. I thought that I have just 2 partitions after RDDread, so I modified code for: sc.wholeTextFiles("s3a://custom-bucket//.gz", minPartitions=40) -> And it got broke even faster. I was adding persistent(DISK) function in some random places.
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 684, in loads
return s.decode("utf-8") if self.use_unicode else s
MemoryError
19/05/20 14:09:23 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.gz]
19/05/20 14:09:30 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/default/bin/python
java.lang.OutOfMemoryError: Java heap space
What I am doing wrong and how to debug my code?

You seem to be running spark in local mode (local[*]). This means that you are using a single jvm with 45G of RAM (spark.driver.memory) and that all your worker threads run within that jvm. The spark.executor.memory option has no effect What does setMaster `local[*]` mean in spark?.
You should set up your spark master either to the yarn scheduler, or if you have no yarn use standalone mode https://spark.apache.org/docs/latest/spark-standalone.html.

PySpark: Invalid returnType with scalar Pandas UDFs

I'm trying to return a specific structure from a pandas_udf. It worked on one cluster but fails on another.
I try to run a udf on groups, which requires the return type to be a data frame.
from pyspark.sql.functions import pandas_udf
import pandas as pd
import numpy as np
from pyspark.sql.types import *
schema = StructType([
StructField("Distance", FloatType()),
StructField("CarId", IntegerType())
])
def haversine(lon1, lat1, lon2, lat2):
#Calculate distance, return scalar
return 3.5 # Removed logic to facilitate reading
#pandas_udf(schema)
def totalDistance(oneCar):
dist = haversine(oneCar.Longtitude.shift(1),
oneCar.Latitude.shift(1),
oneCar.loc[1:, 'Longitude'],
oneCar.loc[1:, 'Latitude'])
return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index = [0])
## Calculate the overall distance made by each car
distancePerCar= df.groupBy('CarId').apply(totalDistance)
This is the exception I'm getting:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
114 try:
--> 115 to_arrow_type(self._returnType_placeholder)
116 except TypeError:
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
During handling of the above exception, another exception occurred:
NotImplementedError Traceback (most recent call last)
<ipython-input-35-4f2194cfb998> in <module>()
18 km = 6367 * c
19 return km
---> 20 #pandas_udf("CarId: int, Distance: float")
21 def totalDistance(oneUser):
22 dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType, evalType)
62 udf_obj = UserDefinedFunction(
63 f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
---> 64 return udf_obj._wrapped()
65
66
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
184
185 wrapper.func = self.func
--> 186 wrapper.returnType = self.returnType
187 wrapper.evalType = self.evalType
188 wrapper.deterministic = self.deterministic
C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
117 raise NotImplementedError(
118 "Invalid returnType with scalar Pandas UDFs: %s is "
--> 119 "not supported" % str(self._returnType_placeholder))
120 elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
121 if isinstance(self._returnType_placeholder, StructType):
NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true))) is not supported
I've also tried changing the schema to
#pandas_udf("<CarId:int,Distance:float>")
and
#pandas_udf("CarId:int,Distance:float")
but get the same exception. I suspect it has to do with my pyarrow version, which isn't compatible with my pyspark version.
Any help would be appreciated. Thanks!

As reported in the error message ("Invalid returnType with scalar Pandas UDFs"), you are trying to create a SCALAR vectorized pandas UDF, but using a StructType schema and returning a pandas DataFrame.
You should rather declare your function as a GROUPED MAP pandas UDF, i.e.:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
Difference between scalar and grouped vectorized UDFs is explained in the pyspark doc: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf.
A scalar UDF defines a transformation: One or more pandas.Series -> A pandas.Series. The returnType should be a primitive data type, e.g., DoubleType(). The length of the returned pandas.Series must be of the same as the input pandas.Series.
To summarize, a scalar pandas UDF processes a column at a time (a pandas Series), leading to better performance than traditional UDFs that process one row element at a time. Note that the performance improvement is due to efficient python serialization using PyArrow.
A grouped map UDF defines transformation: A pandas.DataFrame -> A pandas.DataFrame The returnType should be a StructType describing the schema of the returned pandas.DataFrame. The length of the returned pandas.DataFrame can be arbitrary and the columns must be indexed so that their position matches the corresponding field in the schema.
A grouped pandas UDF processes multiple rows and columns at a time (using a pandas DataFrame, not to be confused with a Spark DataFrame), and is extremely useful and efficient for multivariate operations (especially when using local python numerical analysis and machine learning libraries like numpy, scipy, scikit-learn etc.). In this case, the output is a single-row DataFrame with several columns.
Note that I did not check the internal logic of the code, only the methodology.

ConsoleBuffer' object has no attribute 'isatty'

I'm doing image classification using sparkdl on databricks community edition.
I added all the library's.
i have created data-frame using the image data.
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
AttributeError Traceback (most recent call last)
<command-2468766328144961> in <module>()
7 p = Pipeline(stages=[featurizer, lr])
8
----> 9 p_model = p.fit(train_df)
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
104 if isinstance(stage, Transformer):
105 transformers.append(stage)
--> 106 dataset = stage.transform(dataset)
107 else: # must be an Estimator
108 model = stage.fit(dataset)

From the title of your question, it sounds like you're hitting a AttributeError: 'ConsoleBuffer' object has no attribute 'isatty' error in a Databricks Python notebook.
If you are using Databricks Runtime 3.3 or later then this bug should be fixed.
In earlier Databricks Runtime releases, you should be able to work around this problem by monkeypatching sys.stdout by running the following code snippet at the beginning of your Python notebook:
import sys
sys.stdout.isatty = lambda: False
sys.stdout.encoding = sys.getdefaultencoding()
Databricks' Python REPL overrides sys.stdout to use our own ConsoleBuffer class and prior to Databricks Runtime 3.3 this class did not implement the isatty and encoding methods.
Source: I'm a Databricks employee who worked on this bugfix.

Spark on Google Cloud Dataproc job failures on last stages

I work with Spark cluster on Dataproc and my job fails in the end of processing.
My datasource is text logs files in csv format on Google Cloud Storage (total volume is 3.5TB, 5000 files).
The processing logic is following:
read files to DataFrame (schema ["timestamp", "message"]);
group all messages into window of 1 second;
apply pipeline [Tokenizer -> HashingTF] to every grouped message to extract words and their frequencies to build a feature vectors;
save feature vectors with timelines on GCS.
The issues that I'm having is that on small subset of data (like 10 files) processing works well, but when I'm running it on all files it fails in the very end with error like "Container killed by YARN for exceeding memory limits. 25.0 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead."
My cluster has 25 worker with n1-highmem-8 machines. So I googled for this error and literally increased "spark.yarn.executor.memoryOverhead" parameter to 6500MB.
Now my spark job still fails, but with error "Job aborted due to stage failure: Total size of serialized results of 4293 tasks (1920.0 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"
I'm new to spark and I believe that I'm doing something wrong or on the configuration level, or in my code. If you can help me to clean these thing up, it will be great!
Here is my code for the spark task:
import logging
import string
from datetime import datetime
import pyspark
import re
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, TimestampType, ArrayType
from pyspark.sql import functions as F
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Constants
NOW = datetime.now().strftime("%Y%m%d%H%M%S")
START_DATE = '2016-01-01'
END_DATE = '2016-03-01'
sc = pyspark.SparkContext()
spark = SparkSession\
.builder\
.appName("LogsVectorizer")\
.getOrCreate()
spark.conf.set('spark.sql.shuffle.partitions', 10000)
logger.info("Start log processing at {}...".format(NOW))
# Filenames to read/write locations
logs_fn = 'gs://databucket/csv/*'
vectors_fn = 'gs://databucket/vectors_out_{}'.format(NOW)
pipeline_fn = 'gs://databucket/pipeline_vectors_out_{}'.format(NOW)
model_fn = 'gs://databucket/model_vectors_out_{}'.format(NOW)
# CSV data schema to build DataFrame
schema = StructType([
StructField("timestamp", StringType()),
StructField("message", StringType())])
# Helpers to clean strings in log fields
def cleaning_string(s):
try:
# Remove ids (like: app[2352] -> app)
s = re.sub('\[.*\]', 'IDTAG', s)
if s == '':
s = 'EMPTY'
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def normalize_string(s):
try:
# Remove punctuation
s = re.sub('[{}]'.format(re.escape(string.punctuation)), ' ', s)
# Remove digits
s = re.sub('\d*', '', s)
# Remove extra spaces
s = ' '.join(s.split())
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def line_splitter(line):
line = line.split(',')
timestamp = line[0]
full_message = ' '.join(line[1:])
full_message = normalize_string(cleaning_string(full_message))
return [timestamp, full_message]
# Read line from csv, split to date|message
# Read CSV to DataFrame and clean its fields
logger.info("Read CSV to DF...")
logs_csv = sc.textFile(logs_fn)
logs_csv = logs_csv.map(lambda line: line_splitter(line)).toDF(schema)
# Keep only lines for our date interval
logger.info("Filter by dates...")
logs_csv = logs_csv.filter((logs_csv.timestamp>START_DATE) & (logs_csv.timestamp<END_DATE))
logs_csv = logs_csv.withColumn("timestamp", logs_csv.timestamp.cast("timestamp"))
# Helpers to join messages into window and convert sparse to dense
join_ = F.udf(lambda x: "| ".join(x), StringType())
asDense = F.udf(lambda v: v.toArray().tolist())
# Agg by time window
logger.info("Group log messages by time window...")
logs_csv = logs_csv.groupBy(F.window("timestamp", "1 second"))\
.agg(join_(F.collect_list("message")).alias("messages"))
# Turn message to hashTF
tokenizer = Tokenizer(inputCol="messages", outputCol="message_tokens")
hashingTF = HashingTF(inputCol="message_tokens", outputCol="tokens_counts", numFeatures=1000)
pipeline_tf = Pipeline(stages=[tokenizer, hashingTF])
logger.info("Fit-Transform ML Pipeline...")
model_tf = pipeline_tf.fit(logs_csv)
logs_csv = model_tf.transform(logs_csv)
logger.info("Spase vectors to Dense list...")
logs_csv = logs_csv.sort("window.start").select(["window.start", "tokens_counts"])\
.withColumn("tokens_counts", asDense(logs_csv.tokens_counts))
# Save to disk
# Save Pipeline and Model
logger.info("Save models...")
pipeline_tf.save(pipeline_fn)
model_tf.save(model_fn)
# Save to GCS
logger.info("Save results to GCS...")
logs_csv.write.parquet(vectors_fn)

spark.driver.maxResultSize is an issue with the size of your driver, which in Dataproc runs on the master node.
By default 1/4 of the memory of the master is given to Driver and 1/2 of that is given set to spark.driver.maxResultSize (the largest RDD Spark will let you .collect().
I'm guessing Tokenizer or HashingTF are moving "metadata" through the driver that is the size of your keyspace. To increase the allowable size you can increase spark.driver.maxResultSize, but you might also want to increase spark.driver.memory and/or use a larger master as well. Spark's configuration guide has more information.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string