pd.read_parquet produces: OSError: Passed non-file path - python-3.x

I'm trying to retrieve a DataFrame with pd.read_parquet and I get the following error:
OSError: Passed non-file path: my/path
I access the .parquet file in GCS with a path with prefix gs://
For some unknown reason, the OSError shows the path without the prefix gs://
I did suspect it was related to credentials, however from my Mac I can use gsutil with no problem. I can access and download the parquet file via the Google Cloud Platform. I just can't read the .parquet file directly in the gs:// path. Any ideas why not?

Arrow does not currently natively support loading data from GCS. There is some progress in implementing this the tasks are tracked in JIRA.
As a workaround you should be able to use the fsspec GCS filesystem implementation to access objects (I would try installing it and using 'gcs://' instead 'gs://'), or opening the file directly with the gcsfs api and passing it through.

Related

Give direct access of local files to Document AI

I know there is a way by which we can call Document AI from python environment in local system. In that process one needs to upload the local file to GCS bucket so that Document AI can access the file from there. Is there any way by which we can give direct access of local files to Document AI (i.e., without uploading the file to GCS bucket) using python? [Note that it's a mandatory requirement for me to run python code in local system, not in GCP.]
DocumentAI cannot "open" files by itself from your local filesystem.
If you don't want / cannot upload the documents to a bucket, you can send them in as part of the REST API. BUT in this case you cannot use BatchProcessing: I mean, you must process the files one by one and wait for a response.
The relevant REST API documentation is here: https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/process
In the quickstart documentation for python you've got this sample code that reads a file and sends it inline as part of the request:
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)

Can we read files from server path using any fs method in NodeJs

In my case I need to read file/icon.png from cloud storage/bucket which is a token base URL/path. Token resides in header of request.
I tried to use fs.readFile('serverpath') but it gave back error as 'ENOENT' i.e. 'No such file or directory' is existed, but file is existed on that path. So are these methods are eligible to make calls and read files from server or they work only with static path, if that is so then in my case how to read file from cloud bucket/server.
Here i need to pass that file-path to UI, to show this icon.
Use this lib to handle GCS operations.
https://www.npmjs.com/package/#google-cloud/storage
If you do need use fs, install https://cloud.google.com/storage/docs/gcs-fuse, mount bucket to your local filesystem, then use fs as you normally would.
I would like to complement Cloud Ace's answer by saying that if you have Storage Object Admin permission you can make the URL of the image public and use it like any other public URL.
If you don't want to make the URL public you can get temporary access to the file by creating a signed URL.
Otherwise, you'll have to download the file using the GCS Node.js Client.
I posted this as an answer as it is quite long to be a comment.

How TensorFlow read file from s3 bytestream

I have done a deep learning model in TensorFlow for image recognition, and this one works reading an image file from local directory with tf.read_file() method, but I need now that the file be read by TensorFlow since a variable that is a Byte-Streaming that extract the image file since an S3 Bucket of Amazon without storage the streaming in local directory
You should be able to pass in the fully formed s3 path to tf.read_file(), like:
s3://bucket-name/path/to/file.jpeg where bucket-name is the name of your s3 bucket, and path/to/file.jpeg is where it's stored in your bucket. It seems possible you might be running into some access permissions issue, depending on if your bucket is private. You can follow https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md to set up your credentials
Is there an error you ran into when doing this?

Downloading file from Dropbox API for use in Python Environment with Apache Tika on Heroku

I'm trying to use Dropbox as a cloud-based file receptacle for an app/script. The script, written in Python, needs to take PDFs from the Dropbox and use the tika-python wrapper to convert to string.
I'm able to connect to the Dropbox API and use the files_download_to_file() method to download the PDFs to disk, and then use the tika from_file() method to pull that download file from the disk to process. Example:
# Download ex.pdf to local disk
dbx.files_download_to_file('/my_local_path/ex_on_disk.pdf', '/my_dropbox_path/ex.pdf')
from tika import parser
parsed = parser.from_file('ex_on_disk.pdf')
The problem is that I'm planning on running this app on something like Heroku. I don't think I'm able to save anything locally and then access it again. I'm not sure how to get something from the Dropbox API that can be directly referenced by the tika wrapper to run the same as above. I think the PHP SDK has a file_get_contents and a file_put_contents set of methods but it doesn't appear to have a companion in the Python SDK.
I've tried using the shareable links in place of a filename but that hasn't worked. Any ideas? I know there's also the files_download method which downloads the FileMetadata object but I have no idea what to do with this and am having trouble finding more about it.
TLDR; How can I reference a file on Dropbox with a filename string such as 'example.pdf' to be used in another function that is trying to read a file from disk, without saving that Dropbox file to disk?
I figured it out. I used the files_download method to get the byte string and then use the from_buffer method of tika instead:
md, response = dbx.files_download(path)
file_contents = response.content
parsed = parser.from_buffer(file_contents)

Setting Metadata in Google Cloud Storage (Export from BigQuery)

I am trying to update the metadata (programatically, from Python) of several CSV/JSON files that are exported from BigQuery. The application that exports the data is the same with the one modifying the files (thus using the same server certificate). The export goes all well, that is until I try to use the objects.patch() method to set the metadata I want. The problem is that I keep getting the following error:
apiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/storage/v1/b/<bucket>/<file>?alt=json returned "Forbidden">
Obviously, this has something to do with bucket or file permissions, but I can't manage to get around it. How come if the same certificate is being used in writing files and updating file metadata, i'm unable to update it? The bucket is created with the same certificate.
If that's the exact URL you're using, it's a URL problem: you're missing the /o/ between the bucket name and the object name.

Resources