Problems with graphing excel data off an internet source with dates

Problems with graphing excel data off an internet source with dates - excel

this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')

Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.

Related

Plotting graphs with Altair from a Pandas Dataframe

I am trying to read table values from a spreadsheet and plot different charts using Altair.
The spreadsheet can be found here
import pandas as pd
xls_file = pd.ExcelFile('PET_PRI_SPT_S1_D.xls')
xls_file
crude_df = xls_file.parse('Data 1')
crude_df
I am setting the second row values as column headers of the data frame.
crude_df.columns = crude_df.iloc[1]
crude_df.columns
Index(['Date', 'Cushing, OK WTI Spot Price FOB (Dollars per Barrel)',
'Europe Brent Spot Price FOB (Dollars per Barrel)'],
dtype='object', name=1)
The following is a modified version of Altair code got from documentation examples
crude_df_header = crude_df.head(100)
import altair as alt
alt.Chart(crude_df_header).mark_circle().encode(
# Mapping the WTI column to y-axis
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
This does not work.
Error is shown as
TypeError: Object of type datetime is not JSON serializable
How to make 2 D plots with this data?
Also, how to make plots for number of values exceeding 5000 in Altair? Even this results in errors.

Your error is due to the way you parsed the file. You have set the column name but forgot to remove the first two rows, including the ones which are now the column names. The presence of these string values resulted in the error.
The proper way of achieving what you are looking for will be as follow:
import pandas as pd
import altair as alt
crude_df = pd.read_excel(open('PET_PRI_SPT_S1_D.xls', 'rb'),
sheet_name='Data 1',index_col=None, header=2)
alt.Chart(crude_df.head(100)).mark_circle().encode(
x ='Date',
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
For the max rows issue, you can use the following
alt.data_transformers.disable_max_rows()
But be mindful of the official warning
If you choose this route, please be careful: if you are making multiple plots with the dataset in a particular notebook, the notebook will grow very large and performance may suffer.

Convert all dates from a data frame column python

I have a csv file that have a column with the date that ppl get vaccinated, in format 'YYYY-MM-DD' as string. Then, my goal its add X days to the respective date, with X based on the vaccine that these person got. In order to add days to a date, i've to convert the string date to iso date, so i need to loop each element in that column conveting those dates. Im kinda new to Python and im not getting really right how do deal with it.
So i read and create a data frame with pandas, then i tryed as follow in the image:
df column content and for try
I dont know why im getting this error, i tryed different ways to deal with it but cant figure it out.
Thx

This is because the type of values is 'str,' and 'str' does not have 'fromisoformat' method. I would recommend you to convert a type of the values to 'datetime' instead of 'str,' so that you can do whatever you want regarding date calculation such as calculating X days from a specific date.
You can convert the values from 'str' to 'datetime' and do what you want as follows:
import pandas as pd
import datetime
df_reduzido['vacina_dataAplicacao'] = pd.to_datetime(df_reduzido['vacina_dataAplicacao'] , format='%Y-%m-%d')
df_reduzido['vacina_dataAplicacao'] = df_reduzido['vacina_dataAplicacao'] + datetime.datetime.timedelta(days=3)
print(df_reduzido['vacina_dataAplicacao']) # 3 days added
You can study how to deal with datetime in detail here: https://docs.python.org/3/library/datetime.html

Thanks for your help Sangkeun. Just want to point out that, for some reason, python was returning me error saying: "'AttributeError: type object 'datetime.datetime' has no attribute 'datetime'".
Then i've found a solution by calling
import datetime
from datetime import timedelta, date, datetime
Then using " + timedelta() ", like this:
df_reduzido['vacina_dataAplicacao'] = ( pd.to_datetime(df_reduzido['vacina_dataAplicacao'] , format='%Y-%m-%d', utc=False) + timedelta(days=10) ).dt.date
At the end, i set ().dt.date in order to rid off the time from pd.to_datetime(). Look that i tryed to set utc=False hoping that this would do the job but nothing happened. Anyway,
i'm grateful for your help.
Problem solved.

Need to sort a numpy array based on column value, but value is present in String format

I am a newbie in python and need to extract info from a csv file containing terrorism data.
I need to extract top 5 cities in India, having maximum casualities, where Casuality = Killed(given in CSV) + Wounded(given in CSV).
City column is also given in the CSV file.
Output format should be like below in descending order of casuality
city_1 casualty_1 city_2 casualty_2 city_3 casualty_3 city_4
casualty_4 city_5 casualty_5
Link to CSV- https://ninjasdatascienceprod.s3.amazonaws.com/3571/terrorismData.csv?AWSAccessKeyId=AKIAIGEP3IQJKTNSRVMQ&Expires=1554719430&Signature=7uYCQ6pAb1xxPJhI%2FAfYeedUcdA%3D&response-content-disposition=attachment%3B%20filename%3DterrorismData.csv
import numpy as np
import csv
file_obj=open("terrorismData.csv",encoding="utf8")
file_data=csv.DictReader(file_obj,skipinitialspace=True)
country=[]
killed=[]
wounded=[]
city=[]
final=[]
#Making lists
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(row['Killed'])
wounded.append(row['Wounded'])
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])
#Making numpy arrays out of lists
np_month=np.array(country)
np_killed=np.array(killed)
np_wounded=np.array(wounded)
np_city=np.array(city)
np_final=np.array(final)
#Fixing blank values in final arr
for i in range(len(np_final)):
for j in range(len(np_final[0])):
if np_final[i][j]=='':
np_final[i][j]='0.0'
#Counting casualities(killed+wounded) and storing in 1st column of final array
for i in range(len(np_final)):
np_final[i,1]=float(np_final[i,1])+float(np_final[i,2])
#Descending sort on casualities column
np_final=np_final[np_final[:,1].argsort()[::-1]]
I expect np_final to get sorted on column casualities , but it's not happening because type(casualities) is coming as 'String'
Any help is appreciated.

I would offer for you to use Pandas. It would be easier for you to manipulate date.
Read everything to DataFrame. It should read numbers into number formats.
If you must to use np, while reading data, you could simply cast your values to float or integer and everything should work, if there are no other bugs.
Something like this:
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(int(row['Killed']))
wounded.append(int(row['Wounded']))
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])

Slicing files in python with conditons

Suppose i have a txt. file that looks like this:
0 day0 event_data0
1 day1 event_data1
2 day2 event_data2
3 day3 event_data3
4 day4 event_data4
........
n dayn event_datan
#where:
#n is the event index
#dayn is the day when the event happened. year-month-day format
#event_datan is what happened at the event.
From this file, i need to create a new one with all the events that happened between two specific dates. like after september the 7th 2003 and before christmas 2006.
Could someone help me this problem? Much appreciated!

Looks like the datetime module is what you'll want. Iterate through the file line by line until the timedelta between the current line's date and your beginning threshold date (Sept 7, 2003 in your example) is positive; stop iterating when you breach Christmas 2006. Load the lines into either a pandas dataframe or numpy array.

Lucas, you can try this:
import re
import os
from datetime import datetime as dt
__date_start__ = dt.strptime('2003-09-07', "%Y-%m-%d").date()
__date_end__ = dt.strptime('2006-12-25', "%Y-%m-%d").date()
f = open('file.txt', 'r').read()
os.remove('events.txt')
for i in f:
date = re.search('\d{4}\-\d{2}-\d{2}',i).group(0)
if date != '':
date_converted = dt.strptime(date, '%Y-%m-%d').date()
if (date_converted > __date_start__) and (date_converted < __date_end__):
open('events.txt', 'a').write(i)
You will change __date_start__ and __date_end__ values to your desire interval, then, the code will search in lines a regex that match with the format of date yyyy-mm-dd. So on, it going to compare in range (date start & end) and, if true, append a events.txt file the content of line.

I assume your file is tab delimited so you can use the pandas package to read it. Just add a the first row with the column names (index, date, event) in your .txt file separated by tab and then read in the data.
df = pandas.read_csv('txt_file.txt', sep='\t', index_col=0)
#index_col=0 just sets your first column as index
After you've done so, follow the steps from this link. That will essentially answer your question on how to select events between two dates by simply using this package. That way you can return a new data frame only with those events you need.

You have not described that you want especially for "after September the 7th 2003 and before Christmas 2006." or you have other options for these two dates ?
if specially for "after september the 7th 2003 and before christmas 2006." then you can get result with regex module in my opinion :
import re
c=r"([0-9]{1,2}\s+)(2003-09-07).+(2006-12-25)\s+\w+"
with open("event.txt","r") as f:
file_data=f.readlines()
regex_search=re.search(c,str(file_data))
print(regex_search.group())
You can also use conditions with group() , or you can use findall() method.

Matplotlib: Import and plot multiple time series with legends direct from .csv

I have several spreadsheets containing data saved as comma delimited (.csv) files in the following format: The first row contains column labels as strings ('Time', 'Parameter_1'...). The first column of data is Time and each subsequent column contains the corresponding parameter data, as a float or integer.
I want to plot each parameter against Time on the same plot, with parameter legends which are derived directly from the first row of the .csv file.
My spreadsheets have different numbers of (columns of) parameters to be plotted against Time; so I'd like to find a generic solution which will also derive the number of columns directly from the .csv file.
The attached minimal working example shows what I'm trying to achieve using np.loadtxt (minus the legend); but I can't find a way to import the column labels from the .csv file to make the legends using this approach.
np.genfromtext offers more functionality, but I'm not familiar with this and am struggling to find a way of using it to do the above.
Plotting data in this style from .csv files must be a common problem, but I've been unable to find a solution on the web. I'd be very grateful for your help & suggestions.
Many thanks
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Data.csv', skiprows=1, delimiter=',') # skip the column labels
cols = data.shape[1] # get the number of columns in the array
for n in range (1,cols):
plt.plot(data[:,0],data[:,n]) # plot each parameter against time
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
plt.show()

Here's my minimal working example for the above using genfromtxt rather than loadtxt, in case it is helpful for anyone else.
I'm sure there are more concise and elegant ways of doing this (I'm always happy to get constructive criticism on how to improve my coding), but it makes sense and works OK:
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('Data.csv', delimiter=',', dtype=None) # dtype=None automatically defines appropriate format (e.g. string, int, etc.) based on cell contents
names = (arr[0]) # select the first row of data = column names
for n in range (1,len(names)): # plot each column in turn against column 0 (= time)
plt.plot (arr[1:,0],arr[1:,n],label=names[n]) # omitting the first row ( = column names)
plt.legend()
plt.show()

The function numpy.genfromtxt is more for broken tables with missing values rather than what you're trying to do. What you can do is simply open the file before handing it to numpy.loadtxt and read the first line. Then you don't even need to skip it. Here is an edited version of what you have here above that reads the labels and makes the legend:
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
#open the file
with open('Data.csv') as f:
#read the names of the colums first
names = f.readline().strip().split(',')
#np.loadtxt can also handle already open files
data = np.loadtxt(f, delimiter=',') # no skip needed anymore
cols = data.shape[1]
for n in range (1,cols):
#labels go in here
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
#And finally the legend is made
plt.legend()
plt.show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problems with graphing excel data off an internet source with dates - excel

Related

Plotting graphs with Altair from a Pandas Dataframe

Convert all dates from a data frame column python

Need to sort a numpy array based on column value, but value is present in String format

Slicing files in python with conditons

Matplotlib: Import and plot multiple time series with legends direct from .csv

Categories

Resources