How to efficiently read the data lake files' metadata [duplicate] - azure

This question already has answers here:
script to get the file last modified date and file name pyspark
(3 answers)
Closed 1 year ago.
I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)
UPDATE:
If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well:

Regarding the issue, please refer to the following code
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
conf.set(
"fs.azure.account.key.<account-name>.dfs.core.windows.net",
"<account-access-key>")
fs = Path('abfss://<container-name>#<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
status=fs.listStatus(Path('abfss://<container-name>#<account-name>.dfs.core.windows.net/<file-path>/'))
for i in status:
print(i)
print(i.getModificationTime())

We can get those details using a Python code as we don't have direct method to get the modified time and date of the files in data lake
Here is the code
from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path
block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for blob in generator:
length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
print(line)
For more details, refer to the SO thread which addressing similar issue.

Related

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc.
So far I have tried for loops with regex expressions:
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.
appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
appended_data.append(read)
But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781)
TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>
The final way I tried:
li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)
This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.
If you want to use pandas to read excel file in databricks, the path should be like /dbfs/mnt/....
For example
import os
import glob
import pandas as pd
li = []
os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.xlsx") # match your csvs
for file in allFiles:
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)

How to read csv file with using pandas and cloud functions in GCP?

I just try to read csv file which was upload to GCS.
I want to read csv file which is upload to GCS with Cloud functions in GCP. And I want to deal with the csv data as "DataFrame".
But I can't read csv file by using pandas.
This is the code to read csv file on the GCS with using cloud functions.
def read_csvfile(data, context):
try:
bucket_name = "my_bucket_name"
file_name = "my_csvfile_name.csv"
project_name = "my_project_name"
# create gcs client
client = gcs.Client(project_name)
bucket = client.get_bucket(bucket_name)
# create blob
blob = gcs.Blob(file_name, bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
print(train.head())
except Exception as e:
print("error:{}".format(e))
When I ran my Python code, I got the following error.
No columns to parse from file
Some websites says that the error means I read un empty csv file. But actually I upload non empty csv file.
So how can I solve this problem?
please give me your help. Thanks.
----add at 2020/08/08-------
Thank you for giving me your help!
But finally I cloud not read csv file by using your code... I still have the error, No columns to parse from file.
So I tried new way to read csv file as Byte type. The new Python code to read csv file is bellow.
MAIN.PY
from google.cloud import storage
import pandas as pd
import io
import csv
from io import BytesIO
def check_columns(data, context):
try:
object_name = data['name']
bucket_name = data['bucket']
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(object_name)
data = blob.download_as_string()
#read the upload csv file as Byte type.
f = io.StringIO(str(data))
df = pd.read_csv(f, encoding = "shift-jis")
print("df:{}".format(df))
print("df.columns:{}".format(df.columns))
print("The number of columns:{}".format(len(df.columns)))
REQUIREMENTS.TXT
Click==7.0
Flask==1.0.2
itsdangerous==1.1.0
Jinja2==2.10
MarkupSafe==1.1.0
Pillow==5.4.1
qrcode==6.1
six==1.12.0
Werkzeug==0.14.1
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0
The output I got is bellow.
df:Empty DataFrame
Columns: [b'Apple, Lemon, Orange, Grape]
Index: []
df.columns:Index(['b'Apple', 'Lemon', 'Orange', 'Grape'])
The number of columns:4
So I could read only first record in csv file as df.column!? But I could not get the other records in csv file...And the first column is not the column but normal record.
So how can I get some records in csv file as DataFrame with using pandas?
Could you help me again? Thank you.
Pandas, since version 0.24.1, can directly read a Google Cloud Storage URI.
For example:
gs://awesomefakebucket/my.csv
Your service account attached to your function must have access to read the CSV file.
Please, feel free to test and modify this code.
I used Python 3.7
function.py
from google.cloud import storage
import pandas as pd
def hello_world(request):
# it is mandatory initialize the storage client
client = storage.Client()
#please change the file's URI
temp = pd.read_csv('gs://awesomefakebucket/my.csv', encoding='utf-8')
print (temp.head())
return f'check the results in the logs'
requirements.txt
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0

Direct way to slice a string with f.read() to ultimately read in csv as pandas dataframe [duplicate]

This question already has answers here:
Pandas: how to designate starting row to extract data
(2 answers)
Closed 4 years ago.
I have a .csv file which I want to open and ultimately save it as a pandas dataframe. This file has some junk text above the data frame per se, whose header starts at the string Sample_ID. I wrote a code which does the job in multiple steps, and I am now wondering if there's a more elegant way to do so. Here's my code
import pandas as pd
import re
from io import StringIO
with open('SampleSheet.csv') as f:
## read in the .csv file as a string
step1 = f.read()
## subset the step1 file
# define where my df should start
start = 'Sample_ID'
step2 = step1[step1.index(start):]
## read in step2 as a pandas dataframe with stringio
step3 = pd.read_csv(StringIO(step2))
I was wondering if there's a way to slice directly with f.read(), such that I would have one step less already.
I also tried to use pd.read_csv() with skiprows, but I am having a hard time in assigning the row number which starts with Sample_ID
You can import and read in the file using only read_csv() as follows:
df = pd.read_csv('SampleSheet.csv', header=3)
where header is the number of lines you want to skip at the top of the file before your data set starts.

How to load information in two columns?

Question 1: The file phone.txt stores the lines in the format code:number
import pandas as pd
import sqlite3
con = sqlite3.connect('database.db')
data = pd.read_csv('phone.txt', sep='\t', header=None)
data.to_sql('post_table', con, if_exists='replace', index=False)
I want to load all the data from the phone.txt file into the database.db database. But I have everything loaded in one column. And I need to load in two columns:
code
number
How to do it?
Question 2: after downloading the information to the database, how can I find the number by code? For example, if I want to find out what number code = 7 (answer: 9062621390).
Question 1
In your example pandas is not able to distinguish between the code and the number since your file is :-separated. When reading your file you need to change the separator to : and also specify columns since your csv doesn't seem to have a header like so
data = pd.read_csv('phone.txt',
sep=':',
names=['code', 'number'])
Question 2
After putting your data to the database you can query it as follows
number = pd.read_sql_query('SELECT number FROM post_table WHERE code = (?)',
con,
params=(code,))
where con is your sqlite connection.

pandas creating a dataframe from mysql database

So I have been trying to create a dataframe from a mysql database using pandas and python but I have encountered an issue which I need help on.
The issue is when writing the dataframe to excel, it only writes the last row ie, it overwrites all the previous entries and only the last row is written. Please see the code below
import pandas as pd
import numpy
import csv
with open('C:path_to_file\\extract_job_details.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
jobid = str(row[1])
statement = """select jt.job_id ,jt.vendor_data_type,jt.id as TaskId,jt.create_time as CreatedTime,jt.job_start_time as StartedTime,jt.job_completion_time,jt.worker_path, j.id as JobId from dspe.job_task jt JOIN dspe.job j on jt.job_id = j.id where jt.job_id = %(jobid)s"""",
df_mysql = pd.read_sql(statement1, con=mysql_cn)
try:
with pd.ExcelWriter(timestr+'testResult.xlsx', engine='xlsxwriter') as writer:
df_mysql.to_excel(writer, sheet_name='Sheet1')
except pymysql.err.OperationalError as error:
code, message = error.args
mysql_cn.close()
Please can anyone help me identify where I am going wrong?
PS i am a new to pandas and python.
Thanks Carlos
I'm not really sure what you're trying to do reading from disk and a database at the same time...
First, you don't need csv when you're already using Pandas:
df = pd.read_csv("path/to/input/csv")
Next you can simply provide a file path as an argument to to_excel instead of an ExcelWriter instance:
df.to_excel("path/to/desired/excel/file")
If it doesn't actually need to be an excel file you can use:
df.to_csv("path/to/desired/csv/file")

Resources