How to read parquet files from Azure Blobs into Pandas DataFrame? - azure

I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure.
I am doing something like following and I am not sure how to proceed :
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container="abc", blob="/xyz/pqr/folder_with_parquet_files")
I have used dummy names here for privacy concerns. Assuming the directory "folder_with_parquet_files" contains 'n' no. of parquet files, how can I read them into a single Pandas DataFrame?

Hi you could use pandas and read parquet from stream. It colud be very helpful for small data set, sprak session is not required here. It could be the fastest way especially for testing purposes.
import pandas as pd
from io import BytesIO
from azure.storage.blob import ContainerClient
path = '/path_to_blob/..'
conn_string = <conn_string>
blob_name = f'{path}.parquet'
container = ContainerClient.from_connection_string(conn_str=conn_string, container_name=<name_of_container>)
blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')

Here is a very similar solution, but slightly different using the new method azure.storage.blob._download.StorageStreamDownloader.readall:
from io import BytesIO
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container="parquet")
downloaded_blob = container_client.download_blob(upload_name)
bytes_io = BytesIO(downloaded_blob.readall())
df = pd.read_parquet(bytes_io)
print(df.head())

get_blob_to_bytes method can be used
Here the file is fetched from blob storage and held in memory. Pandas can then read this byte array as parquet format.
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import BytesIO
#Source account and key
source_account_name = 'testdata'
source_account_key ='****************'
SOURCE_CONTAINER = 'my-data'
eachFile = 'test/2021/oct/myfile.parq'
source_block_blob_service = BlockBlobService(account_name=source_account_name, account_key=source_account_key)
f = source_block_blob_service.get_blob_to_bytes(SOURCE_CONTAINER, eachFile)
df = pd.read_parquet(BytesIO(f.content))
print(df.shape)

Related

Saving CSV output from Azure function to Azure container (Python 3X)

I want to run an Azure function and save a CSV output to a Azure container.
I currently have two blocks of code that
Generates a CSV file.
Loads a CSV file into my container.
Each blocks works on my local PC in a Jupyter Notebook.
But I am struggling to combine them to work together in an Azure function. So I am looking for help.
Block 1 (Generate the CSV)
import yfinance as yf
import pandas as pd
from datetime import date
import csv
#stock names
NZX =[["Ascension Capital Limited", "ACE"],["AFC Group Holdings Limited", "AFC"],["Z Energy Limited", "ZEL"]]
today = str(date.today().isoformat())
directory = "C:\\Users\\Etc...\\SharePrices\\CSVs\\"
df_list = list()
for i in NZX:
code =i[1]
name =i[0]
cmpy = f"{code}.NZ"
tickerStrings = [cmpy]
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period='1d')
data['ticker'] = ticker
df_list.append(data)
df = pd.concat(df_list)
df.to_csv(f"{directory}_{today}.csv")
Block 2
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="Myconnectionstring", container_name="container1", blob_name="StevesBlob3.csv")
with open("./output.csv", "rb") as data:
blob.upload_blob(data)
Can anyone point me in the right direction? Current issues I am struggling with
Do I need to save the file in a temp folder in the Azure function before trying to move it, or can I push it directly to the container
How do I reference the destination folder/container when I save the CSV?
Any guidance would be much appreciated.
New to Azure functions.
Example with a generated CSV file
#creates random csv in blob storage
import numpy as np
import pandas as pd
from datetime import datetime
from azure.storage.blob import ContainerClient
#Create dynamic filename
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%b%Y%H%M%S")
filename =f"{timestampStr}.csv"
df = pd.DataFrame(np.random.randn(5, 3), columns=['Column1','Column2','Colum3'])
df.to_csv(filename, index=False)
blob = BlobClient.from_connection_string(
conn_str="DefaultEndpointsProtocol=https;AccountName=storageaccountXXXXXXX;AccountKey=XXXXXXXXXXXXXXXX;EndpointSuffix=core.windows.net",
container_name="container2",
blob_name=filename)
with open(filename, "rb") as data:
blob.upload_blob(data)

Load data to Azure Blob from python data frame

I am trying to upload data from python dataframe into Azure Blob.
I have been using this to download data from Azure Blob which works:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import pandas as pd
from pandas import DataFrame as df
from io import StringIO
blob_service_client =
BlobServiceClient.from_connection_string(os.environ["blob_conn_string"])
blob_client = blob_service_client.get_blob_client(blob_container,file_name)
Localfile = blob_client.download_blob().content_as_text()
df_data = pd.read_csv(StringIO(Localfile))
I want to load the df_data back to Azure blob container.
I tried the following code:
blob_client.upload_blob(df_data)
Can anyone suggest what I am doing wrong?
As flow_me_over says in the comment, blob_client.upload_blob(df_data) is impossible.
The type allowed in upload_blob() should be Union[Iterable[AnyStr], IO[AnyStr]], but the type of df_data is TextFileReader.
Below code can work with no problem:
blob_client2.upload_blob(Localfile)
Or
blob_client2.upload_blob(data=df_data.to_csv(index=False))
This is what I did to resolve this:
def blob_conn(df,blob_name):
blob_service_client = BlobServiceClient.from_connection_string(os.environ["blob_conn_str"])
blob_client = blob_service_client.get_blob_client(container=os.environ["container"],blob=blob_name)
blob_client.upload_blob(df, overwrite = True)
I am passing the 'df' as a csv file using the data frame and blob_name as the location.

Upload multiple pandas dataframe as single excel file with multiple sheets to Google Cloud Storage

I am new to Google Cloud Storage.
In my python code, I have couple of Dataframes and I want to store them in a GCS bucket as a single excel file with multiple sheets.
In local directory, I am able to do that with using ExcelWriter. Here is the code for that
writer = pd.ExcelWriter(filename)
dataframe1.to_excel(writer, 'sheet1', index=False)
dataframe2.to_excel(writer, 'sheet2', index=False)
writer.save()
I don't want to save a temp file in local directory and then upload it to GCS.
You can skip the use of gcsfs and directly use the ExcelWriter object with storage client:
import io
import pandas as pd
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket('name_of_your_bucket')
blob = bucket.blob('path/to/excel')
with io.BytesIO() as output:
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
output.seek(0)
blob.upload_from_file(output, content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
OLD Answer:
You can instantiate your ExcelWriter() with engine=xlsxwriter and use fs-gcsfs to write the bytes array to excel file on your GCS bucket.
In your case you can do the following:
import io
import pandas as pd
from fs_gcsfs import GCSFS
gcsfs = GCSFS(bucket_name='name_of_your_bucket',
root_path='path/to/excel',
#set a different root path if you wish to upload multiple files in different locations
strict=False)
gcsfs.fix_storage()
output = io.BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
xlsx_data = output.getvalue()
with gcsfs.open('./excel_file.xlsx', 'wb') as f:
f.write(xlsx_data)
PS: I had to use strict=False as fs-gcsfs wasn't able to locate the root path (Do check the limitations section in the documentation for fs-gcsfs)
Source: https://xlsxwriter.readthedocs.io/working_with_pandas.html#saving-the-dataframe-output-to-a-string

how to write json back to the s3 in aws Glue?

Iam new to the aws-glue. I am trying to read the csv and transforming to the json object. As i seen the approach would be to read the csv via crawler and convert to Pyspark DF, then convert to json object.
Till now, i have converted to json object. Now i would need to write these json back to s3 bucket?
Below is the code
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################
#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import json
import boto3
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
s3_source = boto3.resource('s3')
#Parameters
glue_db = "umesh-db"
glue_tbl = "read"
#########################################
### EXTRACT (READ DATA)
#########################################
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
## Show DF data
print("Showing Df data")
data_frame.show()
### Convert the DF to the json
jsonContent = data_frame.toJSON()
jsonValue={}
arrraYObj=[]
for row in jsonContent.collect():
print("Row data ", row)
arrraYObj.append(row)
print("Array Obj",arrraYObj)
jsonValue['Employee']=arrraYObj
print("Json object ", jsonValue)
#Log end time
#dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
#print("Start time:", dt_end)
Appreciate if anyone can help to provide the right approach?
Thanks
data_frame.write.format(‘json’).save(‘s3://bucket/key’)
Or directly from dynamic frame
glue_context.write_dynamic_frame.from_options(frame = dynamic_frame_read,
connection_type = "s3",
connection_options = {"path": "s3://bucket/key"},
format = "json")

Loading same data files one by one in Database table in python pandas

I have 6 files with named with Data_20190823101010,Data_20190823101112,Data_20190823101214,Data_20190823101310,Data_20190823101410,Data_20190823101510.
These are daily files to be loaded into a SQL Server DB table.
Due to size and performance reasons need to load one by one.
Python code must pick one file at a time,process and load into DB Table.
How to write the code?
Thanks in advance.
import glob
import os
import pandas as pd
import time
from datetime import datetime
import numpy as np
#folder_name = 'Data_Folder'
file_type = 'csv'
file_titles = ['C1','C2','C3',C4','C5']
df = pd.concat([pd.read_csv(f, header=None,skiprows=1,names=file_titles,low_memory=False) for f in glob.glob(folder_name + "//*Data_*" )])
You can import those csv files in a dataframe and then concatenate and use pandas to_sql function to connect and upload the data to MS SQL Server DB
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
import glob
connection= urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=Server_name;DATABASE=DB Name")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(connection))
path = r'C:\file_path' # local drive File path
all_csv_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df.to_sql('Table_Name', schema='dbo', con = engine)

Resources