Workaround for timeout error in Dataset.download() - azure-machine-learning-service

azureml-sdk version: 1.0.85
Calling below (as given in the Dataset UI), I get this
ds_split = Dataset.get_by_name(workspace, name='ret- holdout-split')
ds_split.download(target_path=dir_outputs, overwrite=True)
UnexpectedError:
{'errorCode': 'Microsoft.DataPrep.ErrorCodes.Unknown', 'message':
'The client could not finish the operation within specified timeout.',
'errorData': {}}
The FileDataset 1GB pickled file stored in blob.
Here's a gist with the full traceback

I also faced this same issue(timeout error) while loading sqlpool dataset. After spent some amount of time, I figured out issue in SQL Query and by optimizing the SQL query solved the timeout issue.(It may be useful for someone)

Tried again this AM and it worked. let's file this under "transient error"

Related

Manually Deleted data file from delta lake

I have manually deleted a data file from delta lake and now the below command is giving error
mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)
Error
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
i have tried restarting the cluster with no luck
also tried the below
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")
Any help on repairing the transaction log or fix the error
as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.
In your case you can also use the FSCK REPAIR TABLE command.
as per the docs :
"Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."
The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.
As per MS Doc, you can try vacuum command. Using the vacuum command fix the error.
%sql
vacuum 'Your_path'
For more information refer this link
FSCK Command worked for me. Thanks All

How can I use Brightway2 with US LCI database?

Short version:
I am trying to upload US LCI database to Brightway2 and I am failing miserably. Has anyone succeeded? If so, could you share it with me? :D
Long version:
I am following the notebook IO - Importing the US LCI database notebook and I am having a lot of problems. I am aware that, as the notebook indicates, it is a work in progress. Anyhow, I wanted to give it a try:
I tried uploading every ecospold version database found here, following the method from the notebook. The only one that gave me a similar results was version FY20.Q3.02. However, right off the bat I get the following differences/errors:
Same as the notebook, I get this error: Couldn't apply strategy link_technosphere_by_activity_hash: Object in source database can't be uniquely linked to target database. And two activities that are linked. When I follow the instructions of ignoring these datasets, it throws me that error over and over again.
Trying to move on with the tutorial, I get more errors and at the end I end up with all exchanges unlinked:
633 datasets
37513 exchanges
37505 unlinked exchanges
Finally, after running the code in line [15]:
import functools
f = functools.partial(link_iterable_by_fields,
other=Database(config.biosphere),
kind='biosphere'
)
sp.apply_strategy(f)
sp.statistics(f)
I end up with:
0 datasets
0 exchanges
0 unlinked exchanges
Which is hilarious and sad at the same time. Since I am new with Python and BW, my troubleshooting is clumpsy and probably erroneous (I promise I googled a lot and went through the code). And concluded I am failing and it is time to ask questions:
Has anybody succeeded uploading the US LCI database to Brightway2?
If so, how? Which file did you use?
Thank you!!!!
This is an excellent question. I have added text to the offending notebook to note that it is obsolete.
In general, I think trying to import the ecospold files is a fools errand, as though they are labeled ecospold2, they are actually ecospold1 (which is a totally different format):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ecoSpold xmlns="http://www.EcoInvent.org/EcoSpold01">
The most recent export also raises an error when I try the ecospold1 importer:
AttributeError: no such child: {http://www.EcoInvent.org/EcoSpold01}modellingAndValidation
This is a required attribute in ecospold1.
I think the best way forward would be to consume the JSON-LD directly. Note that it is important not to run bw2setup(), as you would also want to use their list of elementary flows and LCIA methods. Currently the experimental JSON-LD importer fails because the provided datasets need allocation, but don't provide a set of consistent allocation methods. When I use the git checkout of bw2io and do the following:
uslci = JSONLDImporter(
"/Users/cmutel/Downloads/National_Renewable_Energy_Laboratory-USLCI_Database/",
"US LCI",
preferred_allocation="CAUSAL_ALLOCATION"
)
uslci.apply_strategies()
I get the following error:
UnallocatableDataset: We currently only support exchange-specific CAUSAL_ALLOCATION
This is fixable, but someone would need to step through this and fix the allocation procedure, and I don't have the time to do that now.

Google Colab - downloads some files, TypeError: Failed to fetch on others

I have a Google Colab notebook with PyTorch code running in it.
At the beginning of the train function, I create, save and download word_to_ix and tag_to_ix dictionaries without a problem, using the following code:
from google.colab import files
torch.save(tag_to_ix, pos_dict_path)
files.download(pos_dict_path)
torch.save(word_to_ix, word_dict_path)
files.download(word_dict_path)
I train the model, and then try to download it with the code:
torch.save(model.state_dict(), model_path)
files.download(model_path)
Then I get a MessageError: TypeError: Failed to fetch.
Obviously, the problem is not with the third party cookies (as suggested here), because the first files are downloaded without a problem. (I actually also tried adding the link in my Allow section, but, surprise surprise, it made no difference.)
I was originally trying to save the model as is (which, to my understanding, saves it as a Pickle), and I thought maybe Colab files doesn't handle downloading Pickles well, but as you can see above, I'm now trying to save a dict object (which is also what word_to_ix and tag_to_ix) are, and it's still not working.
Downloading the file manually with right-click isn't a solution, because sometimes I leave the code running while I do other things, and by the time I get back to it, the runtime has disconnected, and the files are gone.
Any suggestions?

Google Colab: "Unable to connect to the runtime" after uploading Pytorch model from local

I am using a simple (not necessarily efficient) method for Pytorch model saving.
import torch
from google.colab import files
torch.save(model, filename) # save a trained model on the VM
files.download(filename) # download the model to local
best_model = files.upload() # select the model just downloaded
best_model[filename] # access the model
Colab disconnects during execution of the last line, and hitting RECONNECT tab always shows ALLOCATING -> CONNECTING (fails, with "unable to connect to the runtime" message in the left bottom corner) -> RECONNECT. At the same time, executing any one of the cells gives Error message "Failed to execute cell, Could not send execute message to runtime: [object CloseEvent]"
I know it is related to the last line, because I can successfully connect with my other google accounts which doesn't execute that.
Why does it happen? It seems the google accounts which have executed the last line can no longer connect to the runtime.
Edit:
One night later, I can reconnect with the google account after session expiration. I just attempted the approach in the comment, and found that just files.upload() the Pytorch model would lead to the problem. Once the upload completes, Colab disconnects.
Try disabling your ad-blocker. Worked for me
(I wrote this answer before reading your update. Think it may help.)
files.upload() is just for uploading files. We have no reason to expect it to return some pytorch type/model.
When you call a = files.upload(), a is a dictionary of filename - a big bytes array.
{'my_image.png': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....' }
type(a['my_image.png'])
Just like when you do open('my_image', 'b').read()
So, I think the next line best_model[filename] try to print the whole huge bytes array, which bugs the colab.

Spark (PySpark) File Already Exists Exception

I am trying to save a data frame as a text file, however, I am getting a File Already Exists exception. I tried adding the mode to the code but to no avail. Furthermore, the file does not actually exists. Would anyone have an idea how I can solve this problem? I am using PySpark
This is the code:
distFile = sc.textFile("/Users/jeremy/Downloads/sample2.nq")
mapper = distFile.map(lambda q: __q2v(q))
reducer = mapper.reduceByKey(lambda a, b: a + os.linesep + b)
data_frame = reducer.toDF(["context", "triples"])
data_frame.coalesce(1).write.partitionBy("context").text("/Users/jeremy/Desktop/so")
May I add that the exception is being raised after some time and that some data is actually stored in temporary files (which are obviously deleted).
Thanks!
Edit: Exception can be found here: https://gist.github.com/jerdeb/c30f65dc632fb997af289dac4d40c743
you can used overwrite or append for replacing the file or adding the data into same file.
data_frame.coalesce(1).write.mode('overwrite').partitionBy("context").text("/Users/jeremy/Desktop/so")
or
data_frame.coalesce(1).write.mode('append').partitionBy("context").text("/Users/jeremy/Desktop/so")
I had the same problem and was able get around it with this:
outputDir = "/FileStore/tables/my_result/"
dbutils.fs.rm(outputDir , True)
Just change the outputDir variable to whatever directory you are writing to.
You should check your executors and look at the logs of the ones that are failing.
In my case, I had a coalesce(1) on a large DF. 4 of my executors failed - 3 of them had the same error of org.apache.hadoop.fs.FileAlreadyExistsException: File already exists.
However, 1 of them had a different exception: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 148328
I was able to fix it by increasing the executor memory so that the coalesce did not cause an out of memory error.

Resources