Saving CSV output from Azure function to Azure container (Python 3X) - python-3.x

I want to run an Azure function and save a CSV output to a Azure container.
I currently have two blocks of code that
Generates a CSV file.
Loads a CSV file into my container.
Each blocks works on my local PC in a Jupyter Notebook.
But I am struggling to combine them to work together in an Azure function. So I am looking for help.
Block 1 (Generate the CSV)
import yfinance as yf
import pandas as pd
from datetime import date
import csv
#stock names
NZX =[["Ascension Capital Limited", "ACE"],["AFC Group Holdings Limited", "AFC"],["Z Energy Limited", "ZEL"]]
today = str(date.today().isoformat())
directory = "C:\\Users\\Etc...\\SharePrices\\CSVs\\"
df_list = list()
for i in NZX:
code =i[1]
name =i[0]
cmpy = f"{code}.NZ"
tickerStrings = [cmpy]
for ticker in tickerStrings:
data = yf.download(ticker, group_by="Ticker", period='1d')
data['ticker'] = ticker
df_list.append(data)
df = pd.concat(df_list)
df.to_csv(f"{directory}_{today}.csv")
Block 2
from azure.storage.blob import BlobClient
blob = BlobClient.from_connection_string(conn_str="Myconnectionstring", container_name="container1", blob_name="StevesBlob3.csv")
with open("./output.csv", "rb") as data:
blob.upload_blob(data)
Can anyone point me in the right direction? Current issues I am struggling with
Do I need to save the file in a temp folder in the Azure function before trying to move it, or can I push it directly to the container
How do I reference the destination folder/container when I save the CSV?
Any guidance would be much appreciated.
New to Azure functions.

Example with a generated CSV file
#creates random csv in blob storage
import numpy as np
import pandas as pd
from datetime import datetime
from azure.storage.blob import ContainerClient
#Create dynamic filename
dateTimeObj = datetime.now()
timestampStr = dateTimeObj.strftime("%d%b%Y%H%M%S")
filename =f"{timestampStr}.csv"
df = pd.DataFrame(np.random.randn(5, 3), columns=['Column1','Column2','Colum3'])
df.to_csv(filename, index=False)
blob = BlobClient.from_connection_string(
conn_str="DefaultEndpointsProtocol=https;AccountName=storageaccountXXXXXXX;AccountKey=XXXXXXXXXXXXXXXX;EndpointSuffix=core.windows.net",
container_name="container2",
blob_name=filename)
with open(filename, "rb") as data:
blob.upload_blob(data)

Related

Load data to Azure Blob from python data frame

I am trying to upload data from python dataframe into Azure Blob.
I have been using this to download data from Azure Blob which works:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import pandas as pd
from pandas import DataFrame as df
from io import StringIO
blob_service_client =
BlobServiceClient.from_connection_string(os.environ["blob_conn_string"])
blob_client = blob_service_client.get_blob_client(blob_container,file_name)
Localfile = blob_client.download_blob().content_as_text()
df_data = pd.read_csv(StringIO(Localfile))
I want to load the df_data back to Azure blob container.
I tried the following code:
blob_client.upload_blob(df_data)
Can anyone suggest what I am doing wrong?
As flow_me_over says in the comment, blob_client.upload_blob(df_data) is impossible.
The type allowed in upload_blob() should be Union[Iterable[AnyStr], IO[AnyStr]], but the type of df_data is TextFileReader.
Below code can work with no problem:
blob_client2.upload_blob(Localfile)
Or
blob_client2.upload_blob(data=df_data.to_csv(index=False))
This is what I did to resolve this:
def blob_conn(df,blob_name):
blob_service_client = BlobServiceClient.from_connection_string(os.environ["blob_conn_str"])
blob_client = blob_service_client.get_blob_client(container=os.environ["container"],blob=blob_name)
blob_client.upload_blob(df, overwrite = True)
I am passing the 'df' as a csv file using the data frame and blob_name as the location.

How to read parquet files from Azure Blobs into Pandas DataFrame?

I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure.
I am doing something like following and I am not sure how to proceed :
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container="abc", blob="/xyz/pqr/folder_with_parquet_files")
I have used dummy names here for privacy concerns. Assuming the directory "folder_with_parquet_files" contains 'n' no. of parquet files, how can I read them into a single Pandas DataFrame?
Hi you could use pandas and read parquet from stream. It colud be very helpful for small data set, sprak session is not required here. It could be the fastest way especially for testing purposes.
import pandas as pd
from io import BytesIO
from azure.storage.blob import ContainerClient
path = '/path_to_blob/..'
conn_string = <conn_string>
blob_name = f'{path}.parquet'
container = ContainerClient.from_connection_string(conn_str=conn_string, container_name=<name_of_container>)
blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')
Here is a very similar solution, but slightly different using the new method azure.storage.blob._download.StorageStreamDownloader.readall:
from io import BytesIO
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container="parquet")
downloaded_blob = container_client.download_blob(upload_name)
bytes_io = BytesIO(downloaded_blob.readall())
df = pd.read_parquet(bytes_io)
print(df.head())
get_blob_to_bytes method can be used
Here the file is fetched from blob storage and held in memory. Pandas can then read this byte array as parquet format.
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import BytesIO
#Source account and key
source_account_name = 'testdata'
source_account_key ='****************'
SOURCE_CONTAINER = 'my-data'
eachFile = 'test/2021/oct/myfile.parq'
source_block_blob_service = BlockBlobService(account_name=source_account_name, account_key=source_account_key)
f = source_block_blob_service.get_blob_to_bytes(SOURCE_CONTAINER, eachFile)
df = pd.read_parquet(BytesIO(f.content))
print(df.shape)

Upload multiple pandas dataframe as single excel file with multiple sheets to Google Cloud Storage

I am new to Google Cloud Storage.
In my python code, I have couple of Dataframes and I want to store them in a GCS bucket as a single excel file with multiple sheets.
In local directory, I am able to do that with using ExcelWriter. Here is the code for that
writer = pd.ExcelWriter(filename)
dataframe1.to_excel(writer, 'sheet1', index=False)
dataframe2.to_excel(writer, 'sheet2', index=False)
writer.save()
I don't want to save a temp file in local directory and then upload it to GCS.
You can skip the use of gcsfs and directly use the ExcelWriter object with storage client:
import io
import pandas as pd
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket('name_of_your_bucket')
blob = bucket.blob('path/to/excel')
with io.BytesIO() as output:
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
output.seek(0)
blob.upload_from_file(output, content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
OLD Answer:
You can instantiate your ExcelWriter() with engine=xlsxwriter and use fs-gcsfs to write the bytes array to excel file on your GCS bucket.
In your case you can do the following:
import io
import pandas as pd
from fs_gcsfs import GCSFS
gcsfs = GCSFS(bucket_name='name_of_your_bucket',
root_path='path/to/excel',
#set a different root path if you wish to upload multiple files in different locations
strict=False)
gcsfs.fix_storage()
output = io.BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
xlsx_data = output.getvalue()
with gcsfs.open('./excel_file.xlsx', 'wb') as f:
f.write(xlsx_data)
PS: I had to use strict=False as fs-gcsfs wasn't able to locate the root path (Do check the limitations section in the documentation for fs-gcsfs)
Source: https://xlsxwriter.readthedocs.io/working_with_pandas.html#saving-the-dataframe-output-to-a-string

Loading same data files one by one in Database table in python pandas

I have 6 files with named with Data_20190823101010,Data_20190823101112,Data_20190823101214,Data_20190823101310,Data_20190823101410,Data_20190823101510.
These are daily files to be loaded into a SQL Server DB table.
Due to size and performance reasons need to load one by one.
Python code must pick one file at a time,process and load into DB Table.
How to write the code?
Thanks in advance.
import glob
import os
import pandas as pd
import time
from datetime import datetime
import numpy as np
#folder_name = 'Data_Folder'
file_type = 'csv'
file_titles = ['C1','C2','C3',C4','C5']
df = pd.concat([pd.read_csv(f, header=None,skiprows=1,names=file_titles,low_memory=False) for f in glob.glob(folder_name + "//*Data_*" )])
You can import those csv files in a dataframe and then concatenate and use pandas to_sql function to connect and upload the data to MS SQL Server DB
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
import glob
connection= urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=Server_name;DATABASE=DB Name")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(connection))
path = r'C:\file_path' # local drive File path
all_csv_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df.to_sql('Table_Name', schema='dbo', con = engine)

Combine multiple csv files into a single xls workbook Python 3

We are in the transition at work from python 2.7 to python 3.5. It's a company wide change and most of our current scripts were written in 2.7 and no additional libraries. I've taken advantage of the Anaconda distro we are using and have already change most of our scripts over using the 2to3 module or completely rewriting them. I am stuck on one piece of code though, which I did not write and the original author is not here. He also did not supply comments so I can only guess at the whole of the script. 95% of the script works correctly until the end where after it creates 7 csv files with different parsed information it has a custom function to combine the csv files into and xls workbook with each csv as new tab.
import csv
import xlwt
import glob
import openpyxl
from openpyxl import Workbook
Parsefiles = glob.glob(directory + '/' + "Parsed*.csv")
def xlsmaker():
for f in Parsefiles:
(path, name) = os.path.split(f)
(chort_name, extension) = os.path.splittext(name)
ws = wb.add_sheet(short_name)
xreader = csv.reader(open(f, 'rb'))
newdata = [line for line in xreader]
for rowx, row in enumerate(newdata)
for colx, value in enumerate(row):
if value.isdigit():
ws.write(rowx, colx, value)
xlsmaker()
for f in Parsefiles:
os.remove(f)
wb.save(directory + '/' + "Finished" + '' + oshort + '' + timestr + ".xls")
This was written all in python 2.7 and still works correctly if I run it in python 2.7. The issue is that it throws an error when running in python 3.5.
File "parsetool.py", line 521, in (module)
xlsmaker()
File "parsetool.py", line 511, in xlsmaker
ws = wb.add_sheet(short_name)
File "c:\pythonscripts\workbook.py", line 168 in add_sheet
raise TypeError("The paramete you have given is not of the type '%s'"% self._worksheet_class.__name__)
TypeError: The parameter you have given is not of the type "Worksheet"
Any ideas about what should be done to fix the above error? Iv'e tried multiple rewrites, but I get similar errors or new errors. I'm considering just figuring our a whole new method to create the xls, possibly pandas instead.
Not sure why it errs. It is worth the effort to rewrite the code and use pandas instead. Pandas can read each csv file into a separate dataframe and save all dataframes as a separate sheet in an xls(x) file. This can be done by using the ExcelWriter of pandas. E.g.
import pandas as pd
writer = pd.ExcelWriter('yourfile.xlsx', engine='xlsxwriter')
df = pd.read_csv('originalfile.csv')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
Since you have multiple csv files, you would probably want to read all csv files and store them as a df in a dict. Then write each df to Excel with a new sheet name.
Multi-csv Example:
import pandas as pd
import sys
import os
writer = pd.ExcelWriter('default.xlsx') # Arbitrary output name
for csvfilename in sys.argv[1:]:
df = pd.read_csv(csvfilename)
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
writer.save()
(Note that it may be necessary to pip install openpyxl to resolve errors with xlsxwriter import missing.)
You can use the code below, to read multiple .csv files into one big .xlsx Excel file.
I also added the code for replacing ',' by '.' (or vice versa) for improved compatibility on windows environments and according to your locale settings.
import pandas as pd
import sys
import os
import glob
from pathlib import Path
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
writer = pd.ExcelWriter('fc15.xlsx') # Arbitrary output name
for csvfilename in all_filenames:
txt = Path(csvfilename).read_text()
txt = txt.replace(',', '.')
text_file = open(csvfilename, "w")
text_file.write(txt)
text_file.close()
print("Loading "+ csvfilename)
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8')
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
print("done")
writer.save()
print("task completed")
Here's a slight extension to the accepted answer. Pandas 1.5 complains about the call to writer.save(). The fix is to use the writer as a context manager.
import sys
from pathlib import Path
import pandas as pd
with pd.ExcelWriter("default.xlsx") as writer:
for csvfilename in sys.argv[1:]:
p = Path(csvfilename)
sheet_name = p.stem[:31]
df = pd.read_csv(p)
df.to_excel(writer, sheet_name=sheet_name)
This version also trims the sheet name down to fit in Excel's maximum sheet name length, which is 31 characters.
If your csv file is in Chinese with gbk encoding, you can use the following code
import pandas as pd
import glob
import datetime
from pathlib import Path
now = datetime.datetime.now()
extension = "csv"
all_filenames = [i for i in glob.glob(f"*.{extension}")]
with pd.ExcelWriter(f"{now:%Y%m%d}.xlsx") as writer:
for csvfilename in all_filenames:
print("Loading " + csvfilename)
df = pd.read_csv(csvfilename, encoding="gb18030")
df.to_excel(writer, index=False, sheet_name=Path(csvfilename).stem)
print("done")
print("task completed")

Resources