extract information from single cells from pandas dataframe - python-3.x

I'm looking to pull specific information from the table below to use in other functions. For example extracting the volume on 1/4/16 to see if the volume traded is > 1 million. Any thoughts on how to do this would be greatly appreciated.
import pandas as pd
import pandas.io.data as web # Package and modules for importing data; this code may change depending on pandas version
import datetime
1, 2016
start = datetime.datetime(2016,1,1)
end = datetime.date.today()
apple = web.DataReader("AAPL", "yahoo", start, end)
type(apple)
apple.head()
Results:

The datareader will return a df with a datetimeIndex, you can use partial datetime string matching to give you the specific row and column using loc:
apple.loc['2016-04-01','Volume']
To test whether this is larger than 1 million, just compare it:
apple.loc['2016-04-01','Volume'] > 1000000
which will return True or False

Related

Get second column of a data frame using pandas

I am new to Pandas in Python and I am having some difficulties returning the second column of a dataframe without column names just numbers as indexes.
import pandas as pd
import os
directory = 'A://'
sample = 'test.txt'
# Test with Air Sample
fileAir = os.path.join(directory,sample)
dataAir = pd.read_csv(fileAir,skiprows=3)
print(dataAir.iloc[:,1])
The data I am working with would be similar to:
data = [[1,2,3],[1,2,3],[1,2,3]]
Then, using pandas I wanted to have only
[[2,2,2]].
You can use
dataframe_name[column_index].values
like
df[1].values
or
dataframe_name['column_name'].values
like
df['col1'].values

Loop through a list of tickers with separate outputs

I have a list of tickers of which I would like to output individual datasets with financial information from pandas datareader.
I have tried to create a simple loop that takes a list of tickers and inputs it into the pandas datareader function.
import pandas as pd
import pandas_datareader as pdr
myTickers = ['AAPL', 'PG']
for ticks in myTickers:
print(ticks)
ticks = pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
The problem here seems to be that the loop only substitutes in the myTickers values inside the DataReader function but it does not change the name of the dataframe from "ticks" to e.g. AAPL. Thereby all results will be overridden with whatever ticker loops last.
What do I need to modify in order for this loop to output two different dataframes with the names in the ticker list?
You can save in a DataFrame, after get a col like a dataframe with a function.
import pandas as pd
import pandas_datareader as pdr
myTickers = ['AAPL', 'PG']
df=pd.DataFrame(columns=myTickers)
for ticks in myTickers:
df[ticks] = pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
def ticks(s):
return df[s].to_frame()
ticks('AAPL')
Output:
AAPL
Date
2019-01-02 156.049484
2019-01-03 140.505798
2019-01-04 146.503891
2019-01-07 146.177811
2019-01-08 148.964386
ticks('PG')
Output:
PG
Date
2019-01-02 89.350105
2019-01-03 88.723633
2019-01-04 90.534523
2019-01-07 90.172348
2019-01-08 90.505157
As you pointed out, the loop variable is forgotten, so needs to be stored somewhere. You could replace ticks in myTickers with it's corresponding DataFrame. However, a reference to the ticker will be useful. Perhaps the following may help.
tickers_df_dict = {
ticks: pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
for ticks in myTickers
}
That being said, as far as I'm aware, using the yahoo API will result in the following error. You may need to revise your chosen data source.
ImmediateDeprecationError:
Yahoo Daily has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.
See https://github.com/pydata/pandas-datareader/issues

Use Pandas to extract the values from a column based on some condition

I'm trying to pick a particular column from a csv file using Python's Pandas module, where I would like to fetch the Hostname if the column Group is SJ or DC.
Below is what I'm trying but it's not printing anything:
import csv
import pandas as pd
pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
low_memory=False
data = pd.read_csv('splnk.csv', usecols=['Hostname', 'Group'])
for line in data:
if 'DC' and 'SJ' in line:
print(line)
The data variable contains the values for Hostname & Group columns as follows:
11960 NaN DB-Server
11961 DC Sap-Server
11962 SJ comput-server
Note: while printing the data it stripped the data and does not print complete data.
PS: I have used the pandas.set_option to get the complete data on the terminal!
for line in data: doesn't iterate over row contents, it iterates over the column names. Pandas has several good ways to filter columns by their contents.
For example, you can use df.Series.isin() to select rows matching one of several values:
print data[data['Group'].isin(['DC', 'SJ'])]['Hostname']
If it's important that you iterate over rows, you can use df.iterrows():
for index, row in data.iterrows():
if row['Group'] == 'DC' or row['Group'] == 'SJ':
print row['Hostname']
If you're just getting started with Pandas, I'd recommend trying a tutorial to get familiar with the basic structure.
Try this:
import csv
import pandas as pd
import numpy as np #You can comment numpy as it is not needed.
low_memory=False
data = pd.read_csv('splnk.csv', usecols=['Hostname', 'Group'])
hostnames = data[(data['Group']=='DC') | (data['Group']=='SJ')]['Hostname'] # corrected the `hostname` to `Hostname`
print(hostnames)

Unable to Parse pandas Series to datetime

I'm importing a csv files which contain a datetime column, after importing the csv, my data frame will contain the Dat column which type is pandas.Series, I need to have another column that will contain the weekday:
import pandas as pd
from datetime import datetime
data =
pd.read_csv("C:/Users/HP/Desktop/Fichiers/Proj/CONSOMMATION_1h.csv")
print(data.head())
all the data are okay, but when I do the following:
data['WDay'] = pd.to_datetime(data['Date'])
print(type(data['WDay']))
# the output is
<class 'pandas.core.series.Series'>
the data is not converted to datetime, so I can't get the weekday.
Problem is you need dt.weekday with .dt:
data['WDay'] = data['WDay'].dt.weekday
Without dt is used for DataetimeIndex (not in your case) - DatetimeIndex.weekday:
data['WDay'] = data.index.weekday
use the command data.dtypes to check the type of the columns.

Python - Filtering Pandas Timestamp Index

Given Timestamp indices with many per day, how can I get a list containing only the last Timestamp of a day? So in case I have such:
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
I would like to get:
# End of Day:
EoD = [pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 15:38:55')]
Thx in advance!
Try pandas groupby
all = pd.Series(all)
all.groupby([all.dt.year, all.dt.month, all.dt.day]).max()
You get
2016 5 1 2016-05-01 23:56:37
2 2016-05-02 15:38:55
I've created an example dataframe.
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
df = pd.DataFrame({'values':0}, index = all)
Assuming your data frame is structured as example, most importantly is sorted by index, code below is supposed to help you.
for date in set(df.index.date):
print(df[df.index.date == date].iloc[-1,:])
This code will for each unique date in your dataframe return last row of the slice so while sorted it'll return your last record for the day. And hey, it's pythonic. (I believe so at least)

Resources