Read a gpkg file from memory/zipfile - python-3.x

I know that it is possible to read a shapefile from a zipfile by extracting it in memory and then reading it:
https://gis.stackexchange.com/questions/250092/using-pyshp-to-read-a-file-like-object-from-a-zipped-archive
Fiona also has ways to read a shapefile from memory:
https://pypi.org/project/Fiona/1.5.0/
However, I haven't been able to find a way to read in a .gpkg (geopackage) in the same way.
How do I extract a geopackage from a zipfile and then into a geopandas geodataframe?

You can read it directly by specifying the path to gpkg within zip.
df = gpd.read_file('zip:///path/to/file.zip!data.gpkg')
for relative path:
df = gpd.read_file('zip://../path/to/file.zip!data.gpkg')
(in the case of needing to go back a directory and then into 'path/to/' etc

Related

Read a large hdf5 file from url by chunk in python

I have a 1.5 terabyte sized hdf5 file on an Amazon Simple Storage Service located at the link below. I don't have the disk space to save it nor do I have the memory to read it. Accordingly, I want to read it by chunk, process it, and discard the read part. I was hoping to use pandas' read_hdf to read it but it does not support urls. Neither does the h5py library it seems. Though it does mention a ros3 driver but I haven't been able to get it to work yet. I also tried the response to this question but the chunks cannot be read by h5py or I have not found a way yet. So I'm rather left with no idea on how to process this file. Does anyone have any idea how to do so? The link to the file is this:
https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5
After having this exact same issue, I believe I've cobbled together a working solution for this using fsspec:
import h5py
import fsspec
URL = "..." # Assuming a publicly accessible url
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
# Do regular hdf5 things...
I've confirmed, using your link above, that this does not read the data into memory, just as if it were a local file:
import h5py
import fsspec
URL = "https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5"
remote_f = fsspec.open(URL, mode="rb")
if hasattr(remote_f, "open"):
remote_f = remote_f.open()
f = h5py.File(remote_f)
f.visititems(print)
# 1. README <HDF5 dataset "1. README": shape (), type "|O">
# 2. Resources <HDF5 group "/2. Resources" (2 members)>
# 2. Resources/2.1. Building Models <HDF5 group "/2. Resources/2.1. Building Models" (9 members)>
...

pandas : read_csv not accepting relative path

I have python code in Jupyter notebook and accompanying data in the same folder. I will be bundling both the code and data into a zip file and submitting for evaluation. I am trying to read the data inside the Notebook using pandas.read_csv using a relative path and thats not working. the API doesnt seem to work with relative path. What is the correct way to handle this?
Update:
My findings so far seem to suggest that, I should be using os.chdir() to set the current working directory. But I wouldn't know where the zip file will get extracted. The code is supposed to be read-only..So I cannot expect the receiver to update the path as appropriate.
You could append the current working directory with the relative path to avoid problem as such:
import os
import pandas as pd
BASE_DIR = os.getcwd()
csv_path = "csvname.csv"
df = pd.read_csv(os.path.join(BASE_DIR, csv_path)
where csv_path is the relative path.
I think first of all you should make a unzip file then you can run.
You may use the below code to unzip file,
from zipfile import ZipFile
file_name = "folder_name.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print("Done !")

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
Thanks
Zash
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
#delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Get HDFS file path in PySpark for files in sequence file format

My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things:
Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format.
How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format.
Appreciate any pointers.
If stored types are compatible with SQL types and you use Spark 2.0 it is quite simple. Import input_file_name:
from pyspark.sql.functions import input_file_name
Read file and convert to a DataFrame:
df = sc.sequenceFile("/tmp/foo/").toDF()
Add file name:
df.withColumn("input", input_file_name())
If this solution is not applicable in your case then universal one is to list files directly (for HDFS you can use hdfs3 library):
files = ...
read one by one adding file name:
def read(f):
"""Just to avoid problems with late binding"""
return sc.sequenceFile(f).map(lambda x: (f, x))
rdds = [read(f) for f in files]
and union:
sc.union(rdds)

Resources