I am trying to write a DataFrame to a .csv file:
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")
enrichedDataDir = "/export/market_data/temp"
enrichedDataFile = enrichedDataDir + "/marketData_optam_" + date + ".csv"
dbutils.fs.ls(enrichedDataDir)
df.to_csv(enrichedDataFile, sep='; ')
This throws me the following error
IOError: [Errno 2] No such file or directory:
'/export/market_data/temp/marketData_optam_2018-10-12.csv'
But when i do
dbutils.fs.ls(enrichedDataDir)
Out[72]: []
There is no error! When i go on the directory levels (one level higher):
enrichedDataDir = "/export/market_data"
dbutils.fs.ls(enrichedDataDir)
Out[74]:
[FileInfo(path=u'dbfs:/export/market_data/temp/', name=u'temp/', size=0L)
FileInfo(path=u'dbfs:/export/market_data/update/', name=u'update/', size=0L)]
This works, too. This mean for me that i have really all the folders which i want to access. But i dont know thy the .to_csv option throws the error. I also have checked the permissions, which are fine!
The main problem was, that i am using Micrsoft Azure Datalake Store for storing those .csv files. And for whatever reason, it is not possible through df.to_csv to write to Azure Datalake Store.
Due to the fact that i was trying to use df.to_csv i was using a Pandas DataFrame instead of a Spark DataFrame.
I changed to
from pyspark.sql import *
df = spark.createDataFrame(result,['CustomerId', 'SalesAmount'])
and then write to csv via the following lines
from pyspark.sql import *
df.coalesce(2).write.format("csv").option("header", True).mode("overwrite").save(enrichedDataFile)
And it works.
Here is a more general answer.
If you want to load file from DBFS to Pandas dataframe, you can do this trick.
Move the file from dbfs to file
%fs cp dbfs:/FileStore/tables/data.csv file:/FileStore/tables/data.csv
Read data from file dir
data = pd.read_csv('file:/FileStore/tables/data.csv')
Thanks
have you tried opening the file first ? (replace last row of your first example with below code)
from os import makedirs
makedirs(enrichedDataDir)
with open(enrichedDataFile, 'w') as output_file:
df.to_csv(output_file, sep='; ')
check the permissions on the sas token you used for the container when you mounted this path.. if it starts with "sp=racwdlmeopi" then you have a sas token with immutable storage.. your token should start with "sp=racwdlmeop"
Related
Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.
The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:
When I modify this file, the changes can be observed in the information returned by os.stat.
So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
These files would have this new column as shown below after running the code:
It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.
See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...
I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)
I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!
Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2
I am writing a lambda to read some data from a csv into a dataframe, manipulate said data then convert it back to a csv and make an api call with the new csv all on a python lambda.
I am running into an issue using pandas.read_csv command. It ends my lambdas trigger execution with no errors.
os.chdir('/tmp')
for root, dirs, files in os.walk('/tmp', topdown=True):
for name in files:
if '.csv' in name:
testdic[name] = root
print(os.path.isfile('/tmp/' + name))
print(os.path.isfile(name))
df = pd.read_csv(name)
df = pd.read_csv('/tmp/' + name)
Both os.path.isfile return true and i have tried both versions of read_csv, both do not work and end the lambda prematurely without error.
I have confirmed the csv is downloaded into the lambda tmp directory, I can read and print off rows of the csv in tmp. However when i run = pd.read_csv('/tmp/file.csv') or changing my directory to /tmp and doing = pd.read_csv('file.csv') it ends the lambda with no error and does not pass that point in the code. I am using pandas 0.23.4 as that is what I need to use and the code works locally. Any suggestions would be helpful
Expected results should be the csv being read into a dataframe so I can manipulate it.
FIXED: Could not just use '/tmp/' + filename. Had to use os.path.join(root, filename), also had to increase the timeout of my lambda due to file size.
os.path.join - works for different platforms
Use
file_path = os.path.join(root, name)
and then
pd.read_csv(file_path)
NOTE: Increase the AWS lambda timeout as suggested in comments by #Gabe Maurer
I am currently working on an automation project for a company, and one of the tasks require that I loop through a directory and convert all the pdf files into a CSV file. I am using the camelot-py library (which has been better than the others I have tried). When I apply the code below to a single file, it works just fine; however, I wish to make it loop through all pdf files in the directory. I get the following error with the code below:
"OSError: [Errno 22] Invalid argument"
import camelot
import csv
import pandas as pd
import os
directoryPath = r'Z:\testDirectory'
os.chdir(directoryPath)
print(os.listdir())
folderList = os.listdir(directoryPath)
for folders, sub_folders, file in os.walk(directoryPath):
for name in file:
if name.endswith(".pdf"):
filename = os.path.join(folders,name)
print(filename)
print(name)
tables = camelot.read_pdf(filename, flavor = 'stream', columns= ['72,73,150,327,442,520,566,606,683'])
tables = tables[0].df
print(tables[0].parsing_report)
tables.to_csv('foo2.csv')
I expect all files to be converted to '.csv' files but I get the error 'OSError: [Errno 22] Invalid argument'. My error appears to be from line 16.
I don’t know if you have the same problem, but in my case I made a really stupid mistake of not putting the files in the correct directory. I was getting the same error but once I found out the problem, script works within a regular for loop.
Instead of the to methods, I am using the bulk export to export the results in sql, but that should not be a problem.