Working with hdf files in Databricks cluster - apache-spark

I am trying to create a simple .hdf in the Databricks environment. I can create the file on the driver, but the same code when executed with rdd.map(), it throws following exception.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 287.0 failed 4 times, most recent failure: Lost task 3.3 in stage 287.0 (TID 1080) (10.67.238.26 executor 1): org.apache.spark.api.python.PythonException: **'RuntimeError: Can't decrement id ref count (file write failed:** time = Tue Nov 1 11:38:44 2022
, filename = '/dbfs/mnt/demo.hdf', file descriptor = 7, errno = 95, error message = '**Operation not supported**', buf = 0x30ab998, total write size = 40, bytes this sub-write = 40, bytes actually written = 18446744073709551615, offset = 0)', from <command-1134257524229717>, line 13. Full traceback below:
I can write the same file on the worker and copy the file back to /dbfs/mnt. However, I was looking for a way through which I can write/modify the .hdf files stored in dbfs/mnt locaction through worker nodes directly.
def create_hdf_file_tmp(x):
import numpy as np
import h5py, os, subprocess
import pandas as pd
dummy_data = [1,2,3,4,5]
df_data = pd.DataFrame(dummy_data, columns=['Numbers'])
with h5py.File('/dbfs/mnt/demo.hdf', 'w') as f:
dset = f.create_dataset('default', data = df_data) # write to .hdf file
return True
def read_hdf_file(file_name, dname):
import numpy as np
import h5py, os, subprocess
import pandas as pd
with h5py.File(file_name, 'r') as f:
data = f[dname]
print(data[:5])
return data
#driver code
rdd = spark.sparkContext.parallelize(['/dbfs/mnt/demo.hdf'])
result = rdd.map(lambda x: create_hdf_file_tmp(x)).collect()
Above is the minimal code that I am trying to run in the Databricsk notebook with 1 driver and 2 worker nodes.

From error message: Operation not supported, most probably, when writing HDF file, the API uses something like random writes that aren't supported by DBFS (see DBFS Local API limitations in the docs). You will need to write a file to a local disk and then move that file to DBFS mount. But it will work only on the driver node...

Related

Getting java.lang.IllegalArgumentException after running Pandas UDF on a Spark DF

I'm new to PySpark and I've trying to do inference from a saved model here. I've tried almost every way to print the returned output from the pandas UDF but it always gives me an error. More often than not, it's java.lang.IllegalArgumentException. I don't understand what I'm doing wrong here and it'd be great if someone can help me out by debugging the pandas udf code, if that if where the problem lies.
Code:
saved_xgb.load_model("baseline_xgb.json")
sc = spark.sparkContext
broadcast_model = sc.broadcast(saved_xgb)
prediction_set = spark.sql("select *, floor(rand()*100) as prediction_group from test_dataset_view")
flattened_schema = StructType(prediction_set.schema.fields + [StructField('pred_label', FloatType(), nullable=True),
StructField('score', FloatType(), nullable=True)])
#pandas_udf(flattened_schema, PandasUDFType.GROUPED_MAP)
def model_scoring(pdf):
pdf = pd.DataFrame(pdf)
pdf = pdf.replace('\\N', np.nan)
y_test = pdf['label']
X_test = pdf.drop(['label', 'prediction_group'], axis=1)
y_pred = broadcast_model.value.predict(X_test)
auc_test = roc_auc_score(y_test, broadcast_model.value.predict_proba(X_test)[:, 1])
pdf['pred_label'] = y_pred
pdf['score'] = auc_test
return pdf
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
Error Stack from stdout on Spark Submit:
XGBoost Version1.5.2
PySpark Version :2.3.2.3.1.0.0-78
PySpark Version :2.3.2.3.1.0.0-78
Traceback (most recent call last):
File "baseline_h2o.py", line 130, in <module>
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
File ".../pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File ".../py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File ".../pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File ".../py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1241.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 603, executor 26): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
I have tried printing out the Spark DF in multiple ways and displaying just the 'score' column alone, but all of them lead to errors. Even tried saving that DataFrame directly by saving and writing it, even that gave the same error. Can anyone guide me to solve this?
Edit: Solved
I had to set the environment variable inside the pandas_udf function for PyArrow to work properly on the executors.
import os
os.environ['ARROW_PRE_0_15_IPC_FORMAT']='1'

How to take a subset of parquet files to create a deltatable using deltalake_rs python library

I am using deltalake 0.4.5 Python library to read .parquet files into a deltatable and then convert into a pandas dataframe, following the instructions here: https://pypi.org/project/deltalake/.
Here's the Python code to do this:
from deltalake import DeltaTable
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
files = dt.files() # OK, returns the list of parquet files with full s3 path
# ['s3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00001-8765abc67.parquet',
# 's3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00002-7643adc87.parquet',
# ........]
total_file_count = len(files0) # OK, returns 115530
pt = dt.to_pyarrow_table() # hangs
df = dt.to_pyarrow_table().to_pandas() # hangs
I believe it hangs because of the number of files is high 115K+.
So for my PoC, I wanted to read files only for a day or hour. So, I tried to set the table_path variable up to the hour, but it gives Not a Delta table error as, showing below:
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.7/site-packages/deltalake/table.py", line 19, in __init__
self._table = RawDeltaTable(table_path, version=version)
deltalake.PyDeltaTableError: Not a Delta table
How can I achieve this?
If deltalake Python library can't be used to achieve this, what other tools/libraries are there I should try?

Creating an Apache Spark RDD of a Class in PySpark

I have to convert a Scala code to python.
The scala code converts an RDD of string to RDD of case-class. The code is as follow :
case class Stock(
stockName: String,
dt: String,
openPrice: Double,
highPrice: Double,
lowPrice: Double,
closePrice: Double,
adjClosePrice: Double,
volume: Double
)
def parseStock(inputRecord: String, stockName: String): Stock = {
val column = inputRecord.split(",")
Stock(
stockName,
column(0),
column(1).toDouble,
column(2).toDouble,
column(3).toDouble,
column(4).toDouble,
column(5).toDouble,
column(6).toDouble)
}
def parseRDD(rdd: RDD[String], stockName: String): RDD[Stock] = {
val header = rdd.first
rdd.filter((data) => {
data(0) != header(0) && !data.contains("null")
})
.map(data => parseStock(data, stockName))
}
Is it possible to implement this in PySpark? I tried to use following code and it gave error
from dataclasses import dataclass
#dataclass(eq=True,frozen=True)
class Stock:
stockName : str
dt: str
openPrice: float
highPrice: float
lowPrice: float
closePrice: float
adjClosePrice: float
volume: float
def parseStock(inputRecord, stockName):
column = inputRecord.split(",")
return Stock(stockName,
column[0],
column[1],
column[2],
column[3],
column[4],
column[5],
column[6])
def parseRDD(rdd, stockName):
header = rdd.first()
res = rdd.filter(lambda data : data != header).map(lambda data : parseStock(data, stockName))
return res
Error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 31, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/content/spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 587, in loads
return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute 'main' on <module 'builtins' (built-in)>
The Dataset API is not available for python.
"A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar."
https://spark.apache.org/docs/latest/sql-programming-guide.html

Error 0085 while executing python script in Azure Web service but not in ML Experiment

My workflow is running perfect on Experimentation, but after deployed to web service, I receive this error while post.
Python Code:
# -*- coding: utf-8 -*-
#import sys
import pickle
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
def azureml_main(dataframe1 = None, dataframe2 = None):
print('input dataframe1 ',dataframe1)
decision_tree_pkl_predictive_maint = r'.\Script Bundle\decision_tree_pkl_predictive_maint.pkl'
#sys.path.insert(0,".\Script Bundle")
#model = pickle.load(open(".\Script Bundle\decision_tree_pkl_predictive_maint.pkl", 'rb'))
modle_file = open(decision_tree_pkl_predictive_maint,"rb")
model = pickle.load(modle_file)
#return the mode of prediciton
result = model.predict(dataframe1)
print(result)
result_df = pd.DataFrame({'prediction_class':result})
return result_df,
ERROR:
Execute Python Script RRS : Error 0085: The following error occurred during script evaluation, please view the output log for more information: ---------- Start of error message from Python interpreter ---------- Caught exception while executing function: Traceback (most recent call last): File "\server\InvokePy.py", line 120, in executeScript outframe = mod.azureml_main(*inframes) File "\temp-1036260731852293620.py", line 46, in azureml_main modle_file = open(decision_tree_pkl_predictive_maint,"rb") FileNotFoundError: [Errno 2] No such file or directory: '.\Script Bundle\decision_tree_pkl_predictive_maint.pkl' ---------- End of error message from Python interpreter ----------
Please, Advice.
The issue has to do with your file path. Ensure that you have included the correct path.

read CSV and load into gcp bucket using google cloud composer issue

I started having strange issue when i try to read csv from gcp bucket and write to the same bucket.
Please note that the code below used to work for me before but now an exception is thrown in airflow logs saying
{models.py:1796} ERROR - Error executing an HTTP request: libcurl code 23 meaning 'Failed writing received data to disk/application', error details: Received 134221820 response bytes for a 134217728-byte buffe
when reading gs://file_bucket/abc.csv
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1664, in _run_raw_tas
result = task_copy.execute(context=context
File "/usr/local/lib/airflow/airflow/operators/python_operator.py", line 103, in execut
return_value = self.execute_callable(
File "/usr/local/lib/airflow/airflow/operators/python_operator.py", line 108, in execute_callabl
return self.python_callable(*self.op_args, **self.op_kwargs
File "/home/airflow/gcs/dags/handle_split_rows.py", line 56, in handle_split_row
lines= file_stream.read(
File "/opt/python3.6/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 132, in rea
pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
File "/opt/python3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit_
c_api.TF_GetCode(self.status.status)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error executing an HTTP request: libcurl code 23 meaning 'Failed writing received data to disk/application', error details: Received 134221820 response bytes for a 134217728-byte buffe
when reading gs://file_bucket/abc.csv
code:
#!/usr/bin/env python
import os
import json
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.operators import python_operator
from airflow.contrib.hooks import gcs_hook
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_table_delete_operator import BigQueryTableDeleteOperator
from airflow.contrib.operators.bigquery_to_gcs import BigQueryToCloudStorageOperator
from airflow.contrib.operators.bigquery_operator import BigQueryCreateEmptyTableOperator
from airflow.contrib.operators.gcs_list_operator import GoogleCloudStorageListOperator
from airflow.operators.sensors import ExternalTaskSensor
from airflow.operators import PythonOperator, BranchPythonOperator
from airflow.operators import BashOperator
from lib import notification_utility
default_args = {
'owner': os.environ["OWNER"],
'depends_on_past': False,
'start_date': '2019-10-10 09:31:00'
}
with DAG('parse_bad_rows',
default_args=default_args,
catchup=False,
schedule_interval= None
) as dag:
def parse_rows(**context):
import pandas as pd
import numpy as np
import csv
import os
import gcsfs
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import io
#**tf.disable_v2_behavior() also tried disabling v1 just in case but i dont think it makes any sense**
#updated_file_list = context['ti'].xcom_pull(task_ids='list_files_delta_bucket_test')
fs = gcsfs.GCSFileSystem(project='project_name')
updated_file_list = fs.ls('/bucket_name/foldername/')
updated_file_list = [ x for x in updated_file_list if "abc" in x ]
print("updated_file_list------------------>",updated_file_list)
for f in updated_file_list:
print("File Being processed------->",f)
file_name = os.path.splitext(f)[0]
#**this is where the job is failing while reading the file so I am assuming it has to do something with tensorflow.python.lib.io import file_io**
file_stream = file_io.FileIO("gs://"+f, mode='r')
lines= file_stream.read()
file_stream_less_cols =io.StringIO(lines)
Split_Rows = [x for x in file_stream_less_cols if x.count('|') < 297]
Split_Rows = ' '.join(map(str, Split_Rows))
file_stream.close()
Split_Rows_Stream = pd.DataFrame(io.StringIO(Split_Rows),columns=['BLOB_COLUMN'],dtype='str')
#Split_Rows_Stream['File_Name'] = Split_Rows_Stream.index
parse_names = file_name.split('/')
filename = parse_names[2]
bucketname = parse_names[0]
Split_Rows_Stream['FILE_NAME'] = filename
print("bucketname------------>",bucketname)
print("filename------------->",filename)
Split_Rows_Stream.to_csv("gs://"+bucketname+"/ERROR_FILES/"+filename+".csv",encoding='utf-8',quoting=csv.QUOTE_NONE,escapechar='|')
Python_Task_Split_Rows = PythonOperator(
task_id= 'split_rows_to_process_test',
provide_context=True,
python_callable=parse_rows,
#op_kwargs={'project':'project_name','bucket':'bucket_name','table_name':'abc','delim_num':297},
#trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag
)
# Orchestration
Python_Task_Split_Rows
I also tried the same in local so as to make sure csv is not an issue.
import pandas as pd
import numpy as np
import csv
import io
import os
#Read the file
directory='c:\\Users\BG/Downloads/file_Cleansing'
for filename in os.listdir(directory):
file_name = filename.split('.')[0]
f=open('c:\\Users\BG/Downloads/file_Cleansing/'+filename,'r',encoding="utf8")
#Readlines forom the text file
lines= f.read()
#cleanse the lines
file_stream =io.StringIO(lines)
Split_Rows = [x for x in file_stream if x.count('|') < 297]
Split_Rows = ' '.join(map(str, Split_Rows))
f.close()
Split_Rows_Stream = pd.DataFrame(io.StringIO(Split_Rows),columns=['blob'])
Split_Rows_Stream["File_Name"] = file_name
Split_Rows_Stream.to_csv("c:\\Users\BG/Downloads/file_Cleansed/'+filename+"_error.csv",escapechar='|',encoding='utf-8')
the above worked as expected.
My goal is to find records that are not matching the number of delimiters expected for a row(basically my delim is pipe and there are 297 pipes expected per row as there are 298 columns in this csv but in some rows i have pipe in between data. )
and capture those records and load it into csv and then into a table in bigquery for concatenating back the rows(using sql lead or lag as i am using the filename and index number for ordering and grouping) to repair and recover the records as many as possible.
Also lastly my service account has changed can this be some permission issue on GCP.
any advise appreciated.
Thank you for the time.
It seems to be an issue related to [permissions][1], verify your service account is listed within the bucket permissions and if it has the role to read and/or write
I had replicated you scenario with your code to read the file and it works correctly
from tensorflow.python.lib.io import file_io
import gcsfs
import os, io
fs = gcsfs.GCSFileSystem(project='project_name')
updated_file_list = fs.ls('bucket')
updated_file_list = [ x for x in updated_file_list if "filename" in x ]
print("updated_file_list------------------>",updated_file_list)
for f in updated_file_list:
print("File Being processed------->",f)
file_name = os.path.splitext(f)[0]
#**this is where the job is failing while reading the file so I am assuming it has to do something with tensorflow.python.lib.io import file_io**
file_stream = file_io.FileIO("gs://"+f, mode='r')
lines= file_stream.read()
print(lines)
OUTPUT:
('updated_file_list------------------>', [u'bucket/file'])
('File Being processed------->', u'bucket/file')
this is a text from a file
Yes, this could be a permission issue on GCP. Could you also check the GCS logs, maybe you can get there more information about this problem.

Resources