How to download the pretrained dataset of huggingface RagRetriever to a custom directory [duplicate]

How to download the pretrained dataset of huggingface RagRetriever to a custom directory [duplicate] - pytorch

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.

You can specify the cache directory everytime you load a model with .from_pretrained by the setting the parameter cache_dir. You can define a default location by exporting an environment variable TRANSFORMERS_CACHE everytime before you use (i.e. before importing it!) the library).
Example for python:
import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'
Example for bash:
export TRANSFORMERS_CACHE=/blabla/cache/

As #cronoik mentioned, alternative to modify the cache path in the terminal, you can modify the cache directory directly in your code. I will just provide you with the actual code if you are having any difficulty looking it up on HuggingFace:
tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")

I'm writing this answer because there are other Hugging Face cache directories that also eat space in the home directory besides the model cache and the previous answers and comments did not make this clear.
The Transformers documentation describes how the default cache directory is determined:
Cache setup
Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/transformers/. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username.cache\huggingface\transformers. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:
Shell environment variable (default): TRANSFORMERS_CACHE.
Shell environment variable: HF_HOME + transformers/.
Shell environment variable: XDG_CACHE_HOME + /huggingface/transformers.
What this piece of documentation doesn't explicitly mention is that HF_HOME defaults to $XDG_CACHE_HOME/huggingface and is used for other huggingface caches, e.g. the datasets cache, which is separate from the transformers cache. The value of XDG_CACHE_HOME is machine dependent, but usually it is $HOME/.cache (and HF defaults to this value if XDG_CACHE_HOME is not set) - thus the usual default $HOME/.cache/huggingface
So you probably will want to set the HF_HOME environment variable (and possibly set a symlink to catch cases where the environment variable is not set).
export HF_HOME=/path/to/cache/directory
This environment variable is also respected by Hugging Face datasets library, although the documentation does not explicitly state this.

Related

Azure ML release bug AZUREML_COMPUTE_USE COMMON_RUNTIME

On 2021-10-13 in our application in Azure ML platform we get this new error that causes failures in pipeline steps - python module import failures - warning stack <- warning that leads to pipeline runtime error
we needed to set it to false. Why is it failing? What exactly are exact (and long term) consequences when opting out? Also, Azure ML users - do you think it was rolled out appropriately?

Try to add into your envirnoment new variable like this:
environment.environment_variables = {"AZUREML_COMPUTE_USE_COMMON_RUNTIME":"false"}

Long term (throughout 2022), AzureML will be fully migrating to the new Common Runtime on AmlCompute. Short term, this change is a large undertaking, and we're on the lookout for tricky functionality of the old Compute Runtime we're not yet handling correctly.
One small note on disabling Common Runtime, it can be more efficient (avoids an Environment rebuild) to add the environment variable directly to the RunConfig:
run_config.environment_variables["AZUREML_COMPUTE_USE_COMMON_RUNTIME"] = "false"
We'd like to get more details about the import failures, so we can fix the regression. Are you setting the PYTHONPATH environment variable to make your custom scripts importable? If so, this is something we're aware isn't working as expected and are looking to fix it within the next two weeks.

We identified the issue and have rolled out a hotfix on our end addressing the issue. There are two problems that could've caused the import issue. One is that we are overwriting the PYTHONPATH environment variable. The second is that we are not adding the python script's containing directory to python's module search path if the containing directory is not the current working directory.
It would be great if you can please try again without setting the AZUREML_COMPUTE_USE_COMMON_RUNTIME environment variable and see if the problem is still there. If it is, please reply to either Lucas's thread or mine with a minimal repro or description of where the module you are trying to import is located at in relation to the script being run and the root of the snapshot (which is the current working directory).

Is there a way to use a global variable for the whole npm project?

I need to use a variable between node modules folder and src folder.Is there a way to use a variable to be used within the whole project?
Thanks.

Typically this is done with process environment variables. Any project can look for a common environment variable and handle that case accordingly, as well as the node modules of said project. As long as they agree on the name, they all share the same environment. Have a good default, don't force people to set this environment variable.
The environment can be set using an environment file (do not check this into source control!), on your container or cloud configuration, or even right on the command line itself.
TLOG=info npm test
That is an example that I use frequently. My project runs logging for the test cases only at alert level - there are a lot of tests so it makes the output less verbose. However, sometimes while developing I want to see all the logs, so my project is looking for an environment variable TLOG (short for "test logging") and I can set it just for that run! Also no code change is needed, which is nicer than JavaScript variables that need to be set back to original values, forget to be turned off, etc.

Running Brightway with project dir on user-defined directory

The default directory in which Brightway stores projects and all associated components is determined by appdirs. Indeed, in bw2data.projects, the project directory is set as:
data_dir = appdirs.user_data_dir(LABEL, "pylca")
For example, for my Windows install , this is C:\users\me\AppData\Local\pylca\Brightway3.
I would like for one of my projects to be on an external network-based disk. This is for a used project, not just for cold storage. Is there functionality within Brightway to change the location of a project?

Yes, and the best way to do this is in the activation script for your project-specific virtual environment. See the FAQs (and please report an issue if more detail is needed or something is wrong):
https://docs.brightwaylca.org/faq.html#how-do-i-find-where-my-data-is-saved
https://docs.brightwaylca.org/faq.html#setting-brightway2-dir-in-a-virtual-environment

As an alternative procedure if you want to change BRIGHTWAY2_DIR in Python, this works:
import os
os.environ['BRIGHTWAY2_DIR']='path/to/my/other/dir'
from brightway2 import *
Despite interesting leads such as this one on reload, I have been unable to make this work if there was a brightway2 import before setting the BRIGHTWAY2_DIR environment variable.

change directory of tmp file for sqlite3 [duplicate]

I'm trying to run the VACUUM command on my database, but I seem to run out of space:
> sqlite3 mydatabase.db "VACUUM"
Error: database or disk is full
The database is about 36 GB and the drive that I'm running it on looks like (via df -h):
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 406G 171G 215G 45% /home
So I am clearly above the double size limited needed. What can I do to allow the vacuum command to run?

To allow the VACUUM command to run, change the directory for temporary files to one that has enough free space.
SQLite's documentation says the temporary directory is (in order):
whatever is set with the PRAGMA temp_store_directory command; or
whatever is set with the SQLITE_TMPDIR environment variable; or
whatever is set with the TMPDIR environment variable; or
/var/tmp; or
/usr/tmp; or
/tmp; or
., the current working directory

The OP noted that during a VACUUM, SQLite creates a temporary file that is approximately the same size as the original database. It does this in order to maintain the database ACID properties. SQLite uses a directory to hold the temporary files it needs while doing the VACUUM. In order to determine what directory to use, it descends a hierarchy looking for the first directory that has the proper access permissions. If it finds a directory that it doesn't have access to, it will ignore that directory and continue to descend the hierarchy looking for one that it does. I mention this in case anyone has specified an environment variable and SQLite seems to be ignoring it.
In his answer CL gives the hierarchy for Linux and in his comment mentions that the hierarchy is OS-dependent. For completeness, here is the hierarchies (in so far as I can determine them from the code).
For Unix (and Linux) the hierarchy is:
Whatever is specified by the SQLITE_TMPDIR environment variable,
Whatever is specified by the TMPDIR environment variable,
/var/tmp,
/usr/tmp,
/tmp, and finally
The current working directory.
For Cygwin the hierarchy is:
Whatever is specified by the SQLITE_TMPDIR environment variable,
Whatever is specified by the TMPDIR environment variable,
Whatever is specified by the TMP environment variable,
Whatever is specified by the TEMP environment variable,
Whatever is specified by the USERPROFILE environment variable,
/var/tmp,
/usr/tmp,
/tmp, and finally
The current working directory.
For Windows the hierarchy is:
GetTempPath(), which is documented to return:
Whatever is specified by the TMP environment variable,
Whatever is specified by the TEMP environment variable,
Whatever is specified by the USERPROFILE environment variable, and finally
the Windows directory.
Hope this helps.

Probably the drive, where your temporary files are created, has not enough space. See here
Vacuum-command-is-failing-with-SQL-Error-Database-or-disk-is-full

Can InstallShield use environment variables in part of a source file path at built time?

We have a build script which build three types of projects - C++, Java and finally the respective InstallShield installers.
Right now the installer build script relies on the fact that the C++ projects are always built in the Release configuration.
But now I wish to allow building them in an additional configuration, namely Profile.
We are using the Jenkins CI server and thus the desired configuration is provided through a dedicated Jenkins build parameter DRIVER_PROXY_CONFIG, which is surfaced as an environment variable with the same name.
Now the problem. According to our InstallShield guy, IS cannot use an environment variable in part of a source file path. I quote:
You can use or 'environment variable' or 'user-defined path variables
defined through InstallShield' as file path.
So we can:
Create 'environment variable' for each component (since 'DRIVER_PROXY_CONFIG' is only part of the component path) – not desirable.
Make the 'environment variable' part of the component 'user-defined path variable' – not possible, I have just tried it.
Has anyone done anything like this? The installer depends on multiple source files in different locations, where a part of such a location path is the value of the DRIVER_PROXY_CONFIG environment variable. Note that this part is neither the path prefix nor the suffix.

You absolutely can create it as part of a path. Some exact behaviors do depend on the version of InstallShield, but for the last several you can even use relative parent directories. Just go to the Path Variables view, add a new environment path variable (say Env), and set the environment variable it references. Then either add any number of standard path variables (say Stn) that are defined as <Env>\Sub\Dir, or skip this step and just reference those for the ISBuildSourcePath of the relevant files. Typically adding a file from a path under a defined path variable will use that path variable as part of its path.
If you've already added the files, the convert source paths wizard may help here, but you might find it easier to visit the File table directly to update the ISBuildSourcePath
However there is at least one exception. If your environment variable has the value Sub and your full directory name is SubDirectory, you cannot always reference <Env>Directory. Typically the path variable support will turn that into Sub\Directory instead.

Michael:
What if 'env' is not prefix nor suffix of the path ("SomeDir\<env>\SubDir")?
I have created system env config=release
I have created IS variable 'MyConf' that reference the env 'config'
I have created IS standard path MyPath = "SomeDir\<MyConf>\SubDir"
If I add file from this path - IS won't suggest 'MyPath' as suggested path!!!
The only way I have found, is to add the files, and then visit the File table directly to update the ISBuildSourcePath.

I added the environment variable as a path variable, ytou can set environment variable types here (not string types!)
then you can use that anywhere you'd use a path variable - though I did have to enclose it in square brackets rather than the usual angle ones). It should work in the middle of a path as I have done that with ordinary path variables.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string