How to read all csv files that begin with consonant? - apache-spark

import os
for file in os.listdir("/content/drive/MyDrive/BigData2021/Lecture23/datasets"):
if file.endswith(".csv"):
print(os.path.join(file))
cities.csv
airports.csv
data_scientist_salaries.csv
I want to read this CSV file with spark which begins consonant without tell CSV filename. How to do that?

Using wildcard [b-df-hj-np-tv-z]*.csv in the path should do the job:
df = spark.read.csv("/your_directory/datasets/[b-df-hj-np-tv-z]*.csv")

Related

Downloading zip files with more than 1 csv and append them all to a dataframe

I am trying to read csv files from a .zip file. The code works fine when the zip file has only one file but fails when there's more than one file.
DATE = ['2010','2011']
x=[]
for i in range(len(DATE)):
file_i='http://url_of_file/file_' + DATE[i] +'.zip'
x.append(file_i)
xi=[]
for filename in x:
dfi = pd.read_csv(filename, index_col=None, header=0, sep=';')
xi.append(dfi)
dfi = pd.concat(xi, axis=0, ignore_index=True)
What's the best way to deal with zip files with more than one csv file
I Know I can manually unzip the zip file and read from there, but there's a lot of zip files so I want to avoid to manually unzip them. But I can unzip them locally to read from there if needed but doing that through code not manually.
What's the best way here?

Python on Chrome OS through Linux: Cannot Export DataFrame [duplicate]

I am trying to write a DataFrame to a .csv file:
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")
enrichedDataDir = "/export/market_data/temp"
enrichedDataFile = enrichedDataDir + "/marketData_optam_" + date + ".csv"
dbutils.fs.ls(enrichedDataDir)
df.to_csv(enrichedDataFile, sep='; ')
This throws me the following error
IOError: [Errno 2] No such file or directory:
'/export/market_data/temp/marketData_optam_2018-10-12.csv'
But when i do
dbutils.fs.ls(enrichedDataDir)
Out[72]: []
There is no error! When i go on the directory levels (one level higher):
enrichedDataDir = "/export/market_data"
dbutils.fs.ls(enrichedDataDir)
Out[74]:
[FileInfo(path=u'dbfs:/export/market_data/temp/', name=u'temp/', size=0L)
FileInfo(path=u'dbfs:/export/market_data/update/', name=u'update/', size=0L)]
This works, too. This mean for me that i have really all the folders which i want to access. But i dont know thy the .to_csv option throws the error. I also have checked the permissions, which are fine!
The main problem was, that i am using Micrsoft Azure Datalake Store for storing those .csv files. And for whatever reason, it is not possible through df.to_csv to write to Azure Datalake Store.
Due to the fact that i was trying to use df.to_csv i was using a Pandas DataFrame instead of a Spark DataFrame.
I changed to
from pyspark.sql import *
df = spark.createDataFrame(result,['CustomerId', 'SalesAmount'])
and then write to csv via the following lines
from pyspark.sql import *
df.coalesce(2).write.format("csv").option("header", True).mode("overwrite").save(enrichedDataFile)
And it works.
Here is a more general answer.
If you want to load file from DBFS to Pandas dataframe, you can do this trick.
Move the file from dbfs to file
%fs cp dbfs:/FileStore/tables/data.csv file:/FileStore/tables/data.csv
Read data from file dir
data = pd.read_csv('file:/FileStore/tables/data.csv')
Thanks
have you tried opening the file first ? (replace last row of your first example with below code)
from os import makedirs
makedirs(enrichedDataDir)
with open(enrichedDataFile, 'w') as output_file:
df.to_csv(output_file, sep='; ')
check the permissions on the sas token you used for the container when you mounted this path.. if it starts with "sp=racwdlmeopi" then you have a sas token with immutable storage.. your token should start with "sp=racwdlmeop"

Read a utf-16LE file directly in cloud function -python/GCP

I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!
Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2

How to read or open a qrel format file?

I was working with TREC qrel file and I would like to have a look at the file. I was wondering how to read a qrel file? or how can I open the file? what is the format> what library should I use?
If you reformat the file into a .txt file you would see that the file would have multiple columns, of which one column would be the relevant judgment.
If you are used to working with CSV files and Python Pandas Dataframes you can opt to follow these steps:
Rename the qrel file with a .txt extension. (Just so that you can read it on a notepad or something)
Read the file as a usual .txt line by line and push it into a CSV file.
Of the top of my head, I have written an easy snippet in Python which you could try:
import pandas as pd
rel_query = []
with open('/content/renamed_qrel.qrel.txt', 'r') as fp:
Lines = fp.readlines()
for line in Lines:
# The line below may need to be changed based on the type of data in the qrel file
rel_query.append(line.split())
qrel_df = pd.DataFrame(rel_query)
NOTE: Although this may/may not be the right way to do it, this surely can help you get started.
I think the right way of doing this would be as follows:
import pandas as pd
df = pd.read_csv('abcd.txt',
sep="\s+", # Or whichever seperator
names=["A", "B", "C", "D"]) # For header names

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Resources