read_csv one file from several files in a gzip? - python-3.x

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak

To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv

If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Related

How to read in stream multiple .zip folder, unzip and write in stream each files contains by unzipp folder through Spark?

I have archive with zip files that I would like to open 'through' Spark in streaming and write in streaming the unzip files in other directory that kip the name of the zip file(one by one).
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
Is there an easy way to read and write the above code in streaming ? Thank you for your help.
As far as I know, Spark can't read archives out of the box.
A ZIP file is both archiving and compressing data. If you can, use a program like gzip to compress the data but keep each file separate, so don't archive multiple files into a single one.
If the archive is a given, and can't be changed. You can consider reading it with sparkContext.binaryFiles(https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html) This would allow you to have the zipped file in a byte array in spark, so you can write a mapper function which can unzip and return the content of the file. You can then flatten that result to get an RDD of the files' contents.

How to read or open a qrel format file?

I was working with TREC qrel file and I would like to have a look at the file. I was wondering how to read a qrel file? or how can I open the file? what is the format> what library should I use?
If you reformat the file into a .txt file you would see that the file would have multiple columns, of which one column would be the relevant judgment.
If you are used to working with CSV files and Python Pandas Dataframes you can opt to follow these steps:
Rename the qrel file with a .txt extension. (Just so that you can read it on a notepad or something)
Read the file as a usual .txt line by line and push it into a CSV file.
Of the top of my head, I have written an easy snippet in Python which you could try:
import pandas as pd
rel_query = []
with open('/content/renamed_qrel.qrel.txt', 'r') as fp:
Lines = fp.readlines()
for line in Lines:
# The line below may need to be changed based on the type of data in the qrel file
rel_query.append(line.split())
qrel_df = pd.DataFrame(rel_query)
NOTE: Although this may/may not be the right way to do it, this surely can help you get started.
I think the right way of doing this would be as follows:
import pandas as pd
df = pd.read_csv('abcd.txt',
sep="\s+", # Or whichever seperator
names=["A", "B", "C", "D"]) # For header names

Read a gpkg file from memory/zipfile

I know that it is possible to read a shapefile from a zipfile by extracting it in memory and then reading it:
https://gis.stackexchange.com/questions/250092/using-pyshp-to-read-a-file-like-object-from-a-zipped-archive
Fiona also has ways to read a shapefile from memory:
https://pypi.org/project/Fiona/1.5.0/
However, I haven't been able to find a way to read in a .gpkg (geopackage) in the same way.
How do I extract a geopackage from a zipfile and then into a geopandas geodataframe?
You can read it directly by specifying the path to gpkg within zip.
df = gpd.read_file('zip:///path/to/file.zip!data.gpkg')
for relative path:
df = gpd.read_file('zip://../path/to/file.zip!data.gpkg')
(in the case of needing to go back a directory and then into 'path/to/' etc

how to store multiple files in one file in python?

How can I store multiple files in one file using python?
I mean my own file format not a zip or a rar.
For e.g I want to create an archive from a folder but with my own file format. ( like 'Files.HR' )
Or just storing files in one file without any dictionary or file format. ( 'Files' No file format )
You may want to use "tar" files. In python, you can use the tarfile module to write files in the file and then later extract them back into real files.
You do not have to name the file *.tar. You can name it something else related to your specific application, such as naming it Files.HR.
Please see this nice tutorial or read the official docs to see how to use tarfile.

Download multiple Dropbox zip files from csv file

In have a .csv file that contains ~100 links to dropbox files. The current method I have downloads the files missing the ?dl=0 extension that seems to be critical
#import packages
import pandas as pd
import wget
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
wget.download(filename)
Output:
https://www.dropbox.com/s/xjtu071g7o6gimg/metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0
https://www.dropbox.com/s/9oc9j8zhd4mn113/metal_roi_volume_dec12_2018_pheno2.txt.zip?dl=0
https://www.dropbox.com/s/0jkdrb76i7rixa5/metal_roi_volume_dec12_2018_pheno3.txt.zip?dl=0
https://www.dropbox.com/s/gu5p46bakgvozs5/metal_roi_volume_dec12_2018_pheno4.txt.zip?dl=0
https://www.dropbox.com/s/8zfpfscp8kdwu3h/metal_roi_volume_dec12_2018_pheno5.txt.zip?dl=0
These look like the correct links, but the download files are in the format
metal_roi_volume_dec12_2018_pheno1.txt.zip instead of metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0, so I cannot unzip them. Any ideas how to download the actual dropbox files?
By default (without extra URL parameters, or with dl=0 like in your example), Dropbox shared links point to an HTML preview page for the linked file, not the file data itself. Your code as-is will download the HTML, not the actual zip file data.
You can modify these links for direct file access though, as documented in this Dropbox help center article.
So, you should modify the link, e.g., to use raw=1 instead of dl=0, before calling wget.download on it.
Quick fix would be something like:
#import packages
import pandas as pd
import wget
import os
from urllib.parse import urlparse
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
parsed = urlparse(filename)
fname = os.path.basename(parsed.path)
wget.download(filename, fname)
Basically, you extract filename from the URL and then use that filename as the output param in the wget.download fn.

Resources