pandas apply over a single column - python-3.x

I have a few columns in a dataframe read from a CSV file that appears to have a mix of nan and strings (that dataframe also has a few other columns are float with a few nan values as well), for example:
[nan '12/31/1990 12:00:00 AM' '06/03/1991 12:00:00 AM'
'09/15/1991 12:00:00 AM' '11/11/1991 12:00:00 AM']
I am interested to convert this to
[nan '12/31/1990' '06/03/1991'
'09/15/1991' '11/11/1991']
This question is in four parts:
suppose I want to convert the strings in the example above to remove the time, such as by using the function
def rem_t_from_d(x): return x.split(sep = ' ')[0]
I was thinking to convert the function above to handle the nan type (float) ? Would any missing value in a string column (when read from the CSV) be converted to nan (float) even if rest of the column in string?
how can I apply this function over a single column of pandas? In pandas documentation, the structure of the function is given as
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
but I don't see any ability to apply it to a column of the dataframe.
How can I check if an element in the dataframe is nan? The documentation enter link description here gave examples of how to check if there are nan in the whole column (e.g. using .notnull()) and how to assign an element to nan (e.g. using = np.nan), but not how to check. I tried using np.isnan to check for nan, but that seems to give me a type error.
also, in pandas, is there the equivalent of NA_integer_ , NA_real_, NA_character_, NA_complex_ like in R? such that an entire column can be designated as a string type if needed, instead of a mix of string and float?
I am using Python 3.4

Related

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Resample-interpolate Pandas Dataframe, interpolate to NaN

I have a Pandas dataframe with datetime indices and a single column. The datetime indices are not regularly spaced (but ordered), and the corresponding values in the column are either numbers or NaN.
My goal is to interpolate these entries to hourly datetimes. I can do so with the following snippet:
df = dataframe.resample('H')
interpolated = df.interpolate(method='linear')
where dataframe is my raw dataframe, and interpolated is the interpolated dataframe.
This works perfectly well, but the problem is that this function seems to interpolate over the NaN values, which is not what I want. Ideally, I'd like the function to return NaN if it tries interpolate between two entries at least one of which is NaN (so if I interpolate between NaN and NaN: return NaN; if I interpolate between NaN and 5: return NaN; if I interpolate halfway between 3 and 5: return 4).
I could code my own brute-force interpolator which does this, but I'd much rather get my grip on Pandas. This seems like a pretty straightforward problem - does anyone know a method to achieve this?

Column of NaN created when concatenating series into dataframe

I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.

How to obtain the difference of values of two specific dates with Pandas [duplicate]

This questions is similar to Python: Pandas Divide DataFrame by first row
I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-04-02 98.91 NaN 5626.79
2013-04-03 99.29 NaN 5727.53
2013-04-04 99.79 NaN 5643.75
2013-04-07 100.55 NaN 5630.78
2013-04-08 100.65 NaN 5633.77
I would like to divide the values of the last row by the values of the first row in order to obtain the percentage difference over time.
A clearer way is to use iloc:
df.iloc[0] / df.iloc[-1]
Just take the first row and the last row values, then divide them, like this: df.T[df.index[0]] / df.T[df.index[-1]]

string to pandas dataframe

following the parsing of a large pdf document I end up with string in the format in python:
Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
Shares;Shares
TANSEISHA CO.,LTD.;(9743)48,424,071;0
MEITEC CORPORATION;(9744)31,300,000;0
TKC Corporation;(9746)26,731,033;0
ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
May 2013 Resolution based 2;0Shares
Would it be possible to transform this into a pandas dataframe as follows where the columns are delimited by the ";". So looking at the above section from the string my df should look like:
Company Name (Code) at End of Month Reason for Alteration ....
Value,etc after Alteration Remarks Shares .....
As additional problem my rows don't always have the same number of strings delimited by ";", meaning that I would need to find a way to see my columns( I don't mind setting like a dataframe with 15 columns and delete afterwards those II do no need)
Thanks
This is a nice opportunity to use StringIO to make your result look like an open file handle so that you can just use pd.read_csv:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = """Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
...: Shares;Shares
...: TANSEISHA CO.,LTD.;(9743)48,424,071;0
...: MEITEC CORPORATION;(9744)31,300,000;0
...: TKC Corporation;(9746)26,731,033;0
...: ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
...: May 2013 Resolution based 2;0Shares"""
In [4]: pd.read_csv(StringIO(s), sep=";")
Out [4]: Company Name (Code) at End of Month Reason for Alteration No. of Shares Bond Symbol, etc. Value, etc. after Alteration Remarks
0 Shares Shares NaN NaN NaN NaN NaN
1 TANSEISHA CO.,LTD. (9743)48,424,071 0 NaN NaN NaN NaN
2 MEITEC CORPORATION (9744)31,300,000 0 NaN NaN NaN NaN
3 TKC Corporation (9746)26,731,033 0 NaN NaN NaN NaN
4 ASATSU-DK INC. (9747) 42,155,400 Exercise of Subscription Warrants 0.0 May 2013 Resolution based 1 0Shares
5 May 2013 Resolution based 2 0Shares NaN NaN NaN NaN NaN
Note that it does look like there are some obvious data cleanup problems to tackle from here, but that should at least give you a start.
I would split your read in string into a list of list. Possibly use regex to find the beginning of each record (or at least use something that you know where it shows up, it looks like (Code) at End of Month might work) and slice your way through. Something like this:
import re
import pandas as pd
# Start your list of list off with your expected headers
mystringlist = [["Company Name",
"(Code) at End of Month",
"Reason for Alteration",
"Value,etc",
"after Alteration",
"Remarks Shares"]]
# This will be used to store the start and end indexes of each record
indexlist = []
# A recursive function to find the start location of each record. It expects a list of 1s and 0s
def find_start(thestring, startloc=0):
if startloc >=len(thestring):
return
else:
foundindex = thestring.find("1",startloc)
indexlist.append(foundindex)
return find_start(thestring, foundindex+1)
# Split on your delimiter
mystring = thestring.split(";")
# Use a list comprehension to make your list of 1s
# and 0s based on the location of a fixed regular expressible record
stringloc = "".join([1 if re.match(x, "\(\d+\)\d+,\d+,\d+") else 0 for x in mystring])
find_start(stringloc)
# Make your list of list based on found indexes
# We subtract 1 from the index position because we want the element
# that immediately precedes the element we find (it's an easier regex
# to make when it's a consistent structure.
for x in indexlist:
if mystringlist.index(x)+1 != len(indexlist):
mystringlist.append(mystring[x-1:indexlist[indexlist.index(x)+1]-1])
# Turn mystring list into a data frame
mydf = pd.DataFrame(mystringlist)

Resources