Continuation to Managing huge zip files in dataBricks
Databricks hangs after 30 files. What to do?
I have split huge 32Gb zip into 100 stand-alone pieces. I've split header from the file and can thus process it like any CSV-file. I need to filter the data based on columns. Files are in Azure Data Lake Storage Gen1 and must be stored there.
Trying to read single file (or all 100 files) at once fails after working for ~30 min. (see linked question above.)
What I've done:
def lookup_csv(CR_nro, hlo_lista =[], output = my_output_dir ):
base_lib = 'adl://azuredatalakestore.net/<address>'
all_files = pd.DataFrame(dbutils.fs.ls(base_lib + f'CR{CR_nro}'), columns = ['full', 'name', 'size'])
done = pd.DataFrame(dbutils.fs.ls(output), columns = ['full', 'name', 'size'])
all_files = all_files[~all_files['name'].isin(tehdyt['name'].str.replace('/', ''))]
all_files = all_files[~all_files['name'].str.contains('header')]
my_scema = spark.read.csv(base_lib + f'CR{CR_nro}/header.csv', sep='\t', header=True, maxColumns = 1000000).schema
tmp_lst = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT'] + [i for i in hlo_lista if i in my_scema.fieldNames()]
for my_file in all_files.iterrows():
print(my_file[1]['name'], time.ctime(time.time()))
data = spark.read.option('comment', '#').option('maxColumns', 1000000).schema(my_scema).csv(my_file[1]['full'], sep='\t').select(tmp_lst)
data.write.csv( output + my_file[1]['name'], header=True, sep='\t')
This works... Kinda. It works thought ~30 files and then hangs up on
Py4JJavaError: An error occurred while calling o70690.csv.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 154.0 failed 4 times, most recent failure: Lost task 0.3 in stage 154.0 (TID 1435, 10.11.64.46, executor 7): com.microsoft.azure.datalake.store.ADLException: Error creating file <my_output_dir>CR03_pt29.vcf.gz/_started_1438828951154916601
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]
I tried to add some deletion and sleeps:
data.unpersist()
data = []
time.sleep(5)
Also some try-exception tries.
for j in range(1,24):
for i in range(4):
try:
lookup_csv(j, hlo_lista =FN_list, output = blake +f'<my_output>/CR{j}/' )
except Exception as e:
print(i, j, e)
time.sleep(60)
No luck with these. Once it fails, it keeps failing.
Any idea how to handle this issue? I'm thinking that connection to ADL-drive fails after a time, but if I queue the commands:
lookup_csv(<inputs>)
<next cell>
lookup_csv(<inputs>)
it works, fails and works next cell just fine. I can live with this, but is highly annoying that basic loop fails to work in this environment.
The best solution would be to permanently mount ADSL storage and use azure app for that.
In Azure please go to App registrations - register app with name for example "databricks_mount". Add IAM role "Storage Blob Data Contributor" for that app in your delta lake storage.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<your-client-id>",
"fs.azure.account.oauth2.client.secret": "<your-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<your-endpoint>/oauth2/token"}
dbutils.fs.mount(
source = "abfss://delta#yourdatalake.dfs.core.windows.net/",
mount_point = "/mnt/delta",
extra_configs = configs)
You can access without mount but still you need to register an app and apply config via spark settings in your notebook to get the access to ADLS. It should be permanent for whole session thanks to azure app:
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"),
spark.conf.set("fs.azure.account.oauth2.client.id", "<your-client-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<your-secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<your-endpoint>/oauth2/token")
This explanation is the best https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html#access-adls-gen2-directly although I remember that first times I had also problems with that. On that page is also explained how to register an app. Maybe it will be ok for your company policies.
Related
I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.
Though I was able to do this in my local system, but the same code is not working in Databricks.
Code I used in my local system
# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model
with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Code I used in Databricks
with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
for ele in train_ds:
tensor = tf.reshape(ele,(2,1,15))
model.fit(tensor,tensor)
Error I ma getting on DataBricks code is:
TypeError: init() missing 2 required positional arguments: 'instance' and 'token'
I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:
# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name#storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)
with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
for row in reader:
print(row)
Here I see some 'sas_token' being passed as input.
Please suggest how do I resolve this error?
I tried changing paths of the parquet file but that did not work out for me.
Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
The first assumption was that access permissions are missing, but that wasn't the case.
When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark
Python script example with execution example:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
Again, note that this script keeps your views compatible with Glue/Athena.
References:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/29
https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I am trying to load data into delta lake from azure blob storage.
I am using below code snippet
storage_account_name = "xxxxxxxxdev"
storage_account_access_key = "xxxxxxxxxxxxxxxxxxxxx"
file_location = "wasbs://bicc-hdspk-eus-qc#xxxxxxxxdev.blob.core.windows.net/FSHC/DIM/FSHC_DIM_SBU"
file_type = "csv"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
dx = df.write.format("parquet")
Till this step it is working and I am also able to load it into databricks table.
dx.write.format("delta").save(file_location)
error : AttributeError: 'DataFrameWriter' object has no attribute 'write'
p.s. - Am I passing the file location wrong into the write statement? If this is the cause then what is file path for delta lake.
Please revert to me in case additional information is needed.
Thanks,
Abhirup
dx is a dataframewriter, so what youre trying to do doesnt make sense. You could do this:
df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)
df.write.format("parquet").save()
df.write.format("delta").save()
Using a very simple-minded approach to read data, select a subset of it, and write it, I'm getting that 'DataFrameWriter' object is not callable.
I'm surely missing something basic.
Using an AWS EMR:
$ pyspark
> dx = spark.read.parquet("s3://my_folder/my_date*/*.gz.parquet")
> dx_sold = dx.filter("keywords like '%sold%'")
# select customer ids
> dc = dx_sold.select("agent_id")
Question
The goal is to now save the values of dc ... e.g. to s3 as a line-separated text file.
What's a best-practice to do so?
Attempts
I tried
dc.write("s3://my_folder/results/")
but received
TypeError: 'DataFrameWriter' object is not callable
Also tried
X = dc.collect()
but eventually received a TimeOut error message.
Also tried
dc.write.format("csv").options(delimiter=",").save("s3://my_folder/results/")
But eventually received messages of the form
TaskSetManager: Lost task 4323.0 in stage 9.0 (TID 88327, ip-<hidden>.internal, executor 96): TaskKilled (killed intentionally)
The first comment is correct: it was an FS problem.
Ad-hoc solution was to convert desired results to list and then serialize the list. E.g.
dc = dx_sold.select("agent_id").distinct()
result_list = [str(c) for c in dc.collect()]
pickle.dump(result_list, open(result_path, "wb"))
i have processed file in azure spark . It takes long time to process the file . can anyone please suggest me the optimized way to achieve less process timings . Also attached my sample code with this.
// Azure container filesystem, it is contain source, destination, archive and result files
val azureContainerFs = FileSystem.get(sc.hadoopConfiguration)
// Read source file list
val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter {
override def accept(path: Path): Boolean = {
val name = path.getName
name.endsWith(".json")
}
}).toList.par
// Ingestion processing to each file
for (sourceFile <- sourceFiles) {
// Tokenize file name from path
val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1)
// Create a customer invoice DF from source json
val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache()
Thanks in Advance!
Please write us a bit more about your stack, and processing power (number of masters, slaves, how you deploy code, things like that)