I am trying to create a new column on my stockmarket data frame that was imported form yahoo. I am dealing with just one symbol at the moment.
symbol['profit']= [[symbol.loc[ei, 'close1']-symbol.loc[ei, 'close']] if symbol[ei, 'shares']==1 else 0 for ei in symbol.index]
I am expecting to have a new column in the dataframe labeled 'profit', but instead I am getting this as an output:
KeyError: (Timestamp('1993-01-29 00:00:00), 'shares')
I imported the csv to a df with
parse_dates=True
index_col='Date' setting the 'Date' column as a datetimeindex which has been working. I am not sure how to overcome this roadblock at the moment. Any help would be appreciated!
In your if statement, you forget the .loc
symbol['profit']= [symbol.loc[ei, 'close1']-symbol.loc[ei, 'close'] if symbol.loc[ei, 'shares']==1 else 0 for ei in symbol.index]
Also in pandas we usually try not use for loop as much as we could .
symbol['profit']=np.where(symbol.shares==1,symbol.close1-symbol.close,0)
I think it may be related to the fact that Jan 29th, 1993 was a Saturday
Try shifting the date to the next trading day
Related
I have a very simple Problem I guess.
I have loaded an csv file into python of the form:
Date
Time
18.07.2018
12:00 AM
18.07.2018
12:30 AM
...
...
19.07.2018
12:00 AM
19.07.2018
12:30 AM
...
...
I basically just want to extract all rows with the Date 18.07.2018 and the single one from 19.07.2018 at 12:00 AM to calculate some statistical measures from the Data.
My current Code (Klimadaten is the Name of the Dataframe):
Klimadaten = pd.read_csv ("Klimadaten_18-20-July.csv")
Day_1 = Klimadaten[Klimadaten.Date == "18.07.2018"]
I guess it could be solved with something like an if statment?
I have just a very basic knowledge of python but im willing to learn the necessary steps. I'm currently doing my Bachelorthesis with simulated climate Data, and I will have to perform statistical Tests and work with a lot of Data, so maybe someone also could tell me in what concepts I should look further in (I have access to an online Python course but will not have the time to watch all lessons)
Thanks in advance
From what I understand you want to take only the data that is there on the date 18.07.2018. The example I wrote to you below writes the date only if it is equal to 18.07.2018, but by changing the row and the columns ( you can search on an entire column or on an entire line (depends on your excel).
for i in range(len(Klimadaten)):
date = df.values[i][0]
if date == "18.07.2018":
print(date)
element = df.values[i][1]
print(element)
Hope i was helpful
Klimadaten = pd.read_csv ("Klimadaten_18-20-July.csv")
Day_1 = Klimadaten[Klimadaten.Date == "18.07.2018" | (Klimadaten.Date == "18.07.2018" & Klimadaten.Time == "12:00 AM")]
basically what it means is: bring me all the rows that date is 18.07.2018 OR (date is 19.07.2018 AND time is 12:00 AM)" so you can construct more complex queries like that :)
With help of the the Pandas Documentation I figured out the right syntax for my problem:
Day_1 = Klimadaten[(Klimadaten["Date"] == "18.07.2018") | (Klimadaten["Date"] == "19.07.2018") & (Klimadaten["Time"] == "12:00:00 AM")]
First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column
I have a column of times expressed as seconds since Jan 1, 1990, that I need to convert to a DateTime. I can figure out how to do this for a constant (e.g. add 10 seconds), but not a series or column.
I eventually tried writing a loop to do this one row at a time. (Probably not the right way, and I'm new to python).
This code works for a single row:
def addSecs(secs):
fulldate = datetime(1990,1,1)
fulldate = fulldate + timedelta(seconds=secs)
return fulldate
b= addSecs(intag112['outTags_1_2'].iloc[1])
print(b)
2018-06-20 01:05:13
Does anyone know an easy way to do this for a whole column in a dataframe?
I tried this:
for i in range(len(intag112)):
intag112['TransactionTime'].iloc[i]=addSecs(intag112['outTags_1_2'].iloc[i])
but it errored out.
If you want to do something with column (series) in DataFrame you can use apply method, for example:
import datetime
# New column 'datetime' is created from old 'seconds'
df['datetime'] = df['seconds'].apply(lambda x: datetime.datetime.fromtimestamp(x))
Check documentation for more examples. Overall advice - try to think in terms of vectors (or series) of values. Most operations in pandas can be done with entire series or even dataframe.
So I have a rather easy question r.e. some plotting issues. I have don't have the greatest level of Python knowledge, its been a few months since looking at it, there isn't anything that I can see that would aid me.
I have the following data frame:
Date Open High Low Close Adj Close Volume
0 11/01/2018 86.360001 87.370003 85.930000 86.930000 86.930000 143660001
1 10/01/2018 87.000000 87.190002 85.980003 86.080002 86.080002 108223002
This isn't all of the data; there's 3000+ rows of it.
QUESTION: I'm trying to plot Adj Close vs. Date. However, due to the index column, which I don't actually want, I end up with a plot of Adj Close vs. the index column. No use obviously.
I've used:
bp['Adj Close'].plot(label='BP',figsize=(16,8),title='Adjusted Closing Price')
So really it's a case of, where do I put the ['Date'] part into the code, so the Index column isn't used?
Many thanks for any help.
You need first convert column by to_datetime:
bp['Date'] = pd.to_datetime(bp['Date'])
and then use x and y parameters in DataFrame.plot:
bp.plot(x='Date', y='Adj Close', label='BP',figsize=(16,8),title='Adjusted Closing Price')
Or set_index from column Date and then use Series.plot:
bp.set_index('Date')['Adj Close'].plot(label='BP',figsize=(16,8),title='Adjusted Closing Price')
I am attempting to create a new column of values in a Pandas dataframe that are calculated from another column in the same dataframe:
df['ema_ideal'] = df['Adj Close'].ewm(span=df['ideal_moving_average'], min_periods=0, ignore_na=True).mean
However, I am receiving the error:
ValueError: The truth of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any(), or a.all().
If I have the span set to 30, or some integer, I do not receive this error. Also, ideal_moving_average is a column of float.
My two questions are:
Why exactly am I receiving the error?
How can I incorporate the column values from ideal_moving_average into the df['ema_ideal'] column (subquestion as I am new to Pandas - is this column a Series within the dataframe?)
Thanks for the help!
EDIT: Example showing Adj Close data, in bad formatting
Date Open High Low Close Adj Close
2017-01-03 225.039993 225.830002 223.880005 225.240005 222.073914
2017-01-04 225.619995 226.750000 225.610001 226.580002 223.395081
2017-01-05 226.270004 226.580002 225.479996 226.399994 223.217606
2017-01-06 226.529999 227.750000 225.899994 227.210007 224.016220
2017-01-09 226.910004 227.070007 226.419998 226.460007 223.276779
2017-01-10 226.479996 227.449997 226.009995 226.460007 223.276779
I think something like this will work for you:
df['ema_ideal'] = df.apply(lambda x: df['Adj Close'].ewm(span=x['ideal_moving_average'], min_periods=0, ignore_na=True).mean(), axis=1)
Providing axis=1 to DataFrame.apply allows you to access the data row wise like you need.
There's absolutely no issue creating a dataframe column from another dataframe.
The error you're receiving is completely different, this error is returned when you try to compare Series with logical fonctions such as and, or, not etc...
In general, to avoid this error you must compare Series element wise, using for example & instead of and, or ~ instead of not, or using numpy to do element wise comparation.
Here, the issue is that you're trying to use a Serie as the span of your ema, and pandas ewma function only accept integers as spans.
You could for example, calculate the ema for each possible periods, and then regroup them in a Serie that you set as the ema idealcolumn of your dataframe.
For anyone wondering, the problem was that span could not take multiple values, which was happening when I tried to pass df['ideal_moving_average'] into it. Instead, I used the below code, which seemed to go line by line passing the value for that row into span.
df['30ema'] = df['Adj Close'].ewm(span=df.iloc[-1]['ideal_ma'], min_periods=0, ignore_na=True).mean()
EDIT: I will accept this as correct for now, until someone shows that it doesn't work or can create something better, thanks for the help.