Creating a graph for credits used by warehouse for a year using python - python-3.x

I want to generate a graph, first i will be querying in the snowflake to fetch the data for the credits/resources consumed by warehouse over a year, i want to use this data to generate a line graph to see the trend of how a warehouse has consumed the costs/resources over past one year, for example if i have 5 warehouses, i want to see a line for each of them showing the trend for past one year..
i am new to this in graph thing in python and need help with this.
Regards
Vivek

You can perform this using matplotlib, pandas and the snowflake-connector-python Python 3.x modules installed.
You'll need to build a query that aggregates your warehouse metering history in the way you need, using the WAREHOUSE_METERING_HISTORY account usage view or equivalent. The example provided below uses a query that aggregates each month.
With the query results in a Pandas DataFrame, you can then use pivot to format the data such that each warehouse series can appear as a line of its own.
import matplotlib.pyplot as plt
from snowflake import connector
# Establish your connection here
con = connector.connect(…)
q = """
select
warehouse_name as warehouse,
date_trunc('month', end_time)::date in_month,
sum(credits_used) credits
from snowflake.account_usage.warehouse_metering_history
where warehouse_id != 0
group by warehouse, in_month
order by warehouse, in_month;
"""
df = con.cursor().execute(q).fetch_pandas_all()
# Explicitly specify datatypes for all columns so they behave well
df['IN_MONTH'] = pd.to_datetime(df['IN_MONTH'])
tdf = df.astype({'WAREHOUSE': 'string', 'CREDITS': 'float'})
pdf = tdf.pivot(index='IN_MONTH', columns='WAREHOUSE', values='CREDITS')
pdf.plot()
plt.show()
This yields:
P.s. You can alternatively try the native Snowsight features in Snowflake to plot interactive charts right from its SQL editor interface:

Related

How to filter specific values in Azure Table Storage in a column?

I have column called Id in azure table storage as shown here. I would like to query all the rows that contain 'ActPow'. I understood that I can't use "like" here due to limitation of azure table storage and need to use le and ge for filtering as per documentation. How can I do this for my data? I am using Azure Table Storage SDK for python.
Using Azure Table Storage SDK for python, I've written a python script to retrieve the Name/ID from a table entities by filtering through row key.
I've taken few row entities from your data table and pass query_entities as shown below:
Table_Service.query_entities(
'<tablename>', filter="Name ge 'SH3.PV01.PCS1_1.ActPow' and Name le 'SH6.PV01.PCS1_1.ActPow'")
Try below script which worked for me successfully:
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
Table_Service = TableService(account_name='jahnaviaea0', account_key='gef7M174ISk1jubiZ7bOzP5cXVIU2XjUEm9G5PX4KjXl19JuAmswip2/77Za30FrmBz0CiQtChol+AStDNIGIw==')
rowouput = Table_Service.query_entities(
'new', filter="Name ge 'SH3.PV01.PCS1_1.ActPow' and Name le 'SH6.PV01.PCS1_1.ActPow'")
for row in rowoutput:
print(row.Name)
Before executing the code, Install the required azure-cosmosdb-table package
pip install azure-cosmosdb-table
Output:
Refer SO by #Ivan Glasenberg & SDK sample tables.

Execution failed on sql ' SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000': no such table: Reviews

I need a help with the below issue i am facing:
I am trying to connect to sqllite and trying to read the data using read sql query from pandas and i am stuck with the error.
Execution failed on sql ' SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000': no such table: Reviews
Below is the code snippet for accessing the sqllite and connection:
con = sqlite3.connect('database.sqlite')
print(con)
import os
os.getcwd()
os.listdir()
output of above code: You will see the reviews.csv file in the directory.
<sqlite3.Connection object at 0x00000240C2C6C570>
['.ipynb_checkpoints',
'03 Amazon Fine Food Reviews Analysis_KNN.ipynb',
'Assignment_SAMPLE_SOLUTION.ipynb',
'database.sqlite',
'K NN Implementation with Sample Data for regression and classification.ipynb',
'Reviews.csv']
Now as the file is in the directory i use this :
import sqlite3
import pandas as pd
import numpy as np
con = sqlite3.connect('database.sqlite')
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000""", con)
the above snippet of code gives the error:
Execution failed on sql ' SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000': no such table: Reviews
can you please anyone let me know where i am going wrong.
The error message you get is explicit: there is no table Reviews in your database 'database.sqlite'.
The .csv file you mention is just a csv file, and you can run sql queries only on databases by definition.
In order to know what are the available tables in the database, go to your sqllite command line by entering something like sqlite3 database.sqlite and use the command .tables. This will give you the list of tables in this database.
if you want to learn more about sqlite you can use sqlitetutorial.net for example.
give actual location of sql dataset and still if it dont work then put r'location'.
As Bluexm said that there is no table available called review.
I was using colab I follwed below steps and it worked for me. may be you can try these steps.
Generate Json api key from your kaggle profile i.e(Account-> Generate Api Key)
Upload the same Json file in "/root/.kaggle/" folder
Download the dataset using api key
eg.
!kaggle datasets download -d snap/amazon-fine-food-reviews
!unzip archive
con = sqlite3.connect('/content/database.sqlite')
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000""", con)

Is there a way to split a DF using column name comparison?

I am extremely new to Python. I've created a DataFrame using a csv file. My file is a complex nested json file having header values at the lowest granular level.
[Example] df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
Requirement: I have to create multiple smaller csvs using each of the columns that are granular and ID1, fullID2.
My approach that I figured is: save the smaller slices by splitting on the header value.
Problem 1: Not able to split the value correctly or traverse to the first location for comparison.
[Example]
I'm using df.columns.str.split('.').tolist(). Suppose I get the below listed value, I want to compare seedValue of id with seedValue of value1 and pull out this entire part as a new df.
{['seedValue','id'],['seedValue'.'value1'], ['seedValue'.'value2']}
Problem 2: Adding ID1 and fullID2 to this new df.
Any help or direction to achieve this would be super helpful !
[Final output]
df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
post-processing the file -
seedValue.columns = ID1,fullID2,id,value1,value2
total.columns = ID1,fullID2,count,value
seedValue.largeFile.columns = ID1,fullID2,id,value1,value2
While I do not possess your complex data to provide a more particular solution. I was able to reproduce a similar case with a .csv sample data, which will exemplify how to achieve what you aim with your data.
In order to save in each ID in a different file, we need to loop through the ID's. Also, assuming there might be more duplicate ID's, the script will save each group of ID's into a .csv file. Below is the script, already with sample data:
import pandas as pd
import csv
my_dict = { 'ids' : [11,11,33,55,55],
'info' : ["comment_1","comment_2", "comment_3", "comment_4", "comment_5"],
'other_column': ["something", "something", "something", "", "something"]}
#Creating a dataframe from the .csv file
df = pd.DataFrame(my_dict)
#sorting the value
df = df.sort_values('ids')
#g=df.groupby('ids')
df
#looping through each group of ids and saving them into a file
for id,g in df.groupby('ids'):
g.to_csv('id_{}.csv'.format(id),index=False)#, header=True, index_label=False)
And the output,
id_11.csv
id_33.csv
id_55.csv
For instance, within id_11.csv:
ids info other_column
11 comment_1 something
11 comment_2 something
Notice that, we use the field ids in the name of each file. Moreover, index=False which means that a new column with indexes for each line of data won't be created.
ADDITIONAL INFO: I have used the Notebook in AI Platform within GCP to execute and test the code.
Compared to the more widely known pd.read_csv, pandas offers more granular json support through pd.json_normalize, which allows you to specify how to unnest the data, which meta-data to use, etc.
Apart from this, reading nested fields from a csv into a two-dimensional dataframe might not be the ideal solution here, and having nested objects inside a dataframe can often be tricky to work with.
Try to read the file as a pure dictionary or list of dictionaries. You can then loop through the keys and create a custom logic to check how many more levels you want to go down, how to return the values and so on. Once you are on a lower level and prefer to have this inside of a dataframe, create a new temporary dataframe, then append these parts together inside the loop.

Using quantopian for data analysis

I want to know were Quantopian gets data from?
If I want to do an analysis on a stock market other than NYSE, will I get the data? If not, can I manually upload the data so that I can run my algorithms on it.
1.) Quantopian gets its data from several places, and provides most online although some are premium and require subscription.
2.) Yes, you can get standard stock market data, but if you have something like a Bloomberg, other subscription or something else you've built and want to pull it in, you can use fetcher.
The basic code is:
fetch_csv(url, pre_func=None, post_func=None, date_column='date',
date_format='%m/%d/%y', timezone='UTC', symbol=None, **kwargs)
Here is an example for something like Dropbox:
def initialize(context):
# fetch data from a CSV file somewhere on the web.
# Note that one of the columns must be named 'symbol' for
# the data to be matched to the stock symbol
fetch_csv('https://dl.dropboxusercontent.com/u/169032081/fetcher_sample_file.csv',
date_column = 'Settlement Date',
date_format = '%m/%d/%y')
context.stock = symbol('NFLX')
def handle_data(context, data):
record(Short_Interest = data.current(context.stock, 'Days To Cover'))
You can get data for non-NYSE stocks as well like Nasdaq securities. Screens are also available by fundamentals(market, exchange, market cap). These screens can limit stocks analyzed from the broad universe.
You can get stock data from Yahoo or other quant sites.

How to filter a CSV file without Pandas? (Best Substitute for Pandas in Pythonista)

I am trying to do some data analysis on Pythonista 3 (iOS app for python), however because of the C libraries of pandas it does not compile in the iOS device.
Is there any substitute for Pandas?
Would numpy be an option for data of type string?
The data set I have at the moment is the history of messages between my friends and I.
The whole history is in one csv file. Each row has the columns 'day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'
The goal of the analysis is to produce a report of our chat for the past year.
I want be able to count number of messages each friend sent. I want to be able to plot a histogram of the hours in which the messages where sent by each friend.
Then, I want to do some word counting individually and as a group.
In Pandas I know how to do that. For example:
df = read_csv("messages.csv")
number_of_messages_friend1 = len(df[df.author_of_message == 'friend1']
How can I filter a csv file without Pandas?
Since Pythonista does have numpy, you will want to look at recarrays, which are numpy's approach to this type of problem. The following worked out of the box in Pythonista for me:
import numpy as np
df=np.recfromcsv('messages.csv')
len(df[df.author_of_message==b'friend1'])
Depending on your data format, tou may find that recsfromcsv "just works", since it tries to guess data types, or you might need to customize things a bit. See genfromtext for a number of options, such as explictly specifying data types or for using converters for converting string dates to datetime objects. recsfromcsv is just a convienece wrapper around genfromtext
https://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html#
Once in recarray, many of the simple indexing operations work the same as in pandas. Note you may need to do string compares using b-prefixed strings (bytes objects), unless you convert to unicode strings, as shown above.
Use the csv module from the standard library to read the messages.
You could store it into a list of collections.namedtuple for easy access.
import csv
messages = []
with open('messages.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames=('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
for row in reader:
messages.append(row)
That gives you all the messages as a list of dictionaries.
Alternatively you could use a normal csv reader combined with a collections.namedtuple to make a list of named tuples, which are slightly easier to access.
import csv
from collections import namedtuple
Msg = namedtuple('Msg', ('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
messages = []
with open('messages.csv') as csvfile:
msgreader = csv.reader(csvfile)
for row in msgreader:
messages.append(Msg(*row))
Pythonista now has competition on iOS. The pyto app provides python 3.8 with pandas. https://apps.apple.com/us/app/pyto-python-3-8

Resources