Creating a new DataFrame getting only Columns with a variable = True - python-3.x

Still new in Python.
I have found that the Y-axis always is scaled to the whole DataFrame using plt.show() when plotting a part of the x-axis (Date)
I have made a variable 'plot' in the complete DataFrame with values False/True depending on the Date range I want to plot.
How do I make a new DataFrame plotdata with all data where 'plot' has value "True" in order to rescale the Y-axis.
data['plot'] = (data['Date'] > startplot) & (data['Date'] <= end)
print(data)
plotdata = pd.DataFrame()
plotdata = pd.DataFrame([data('plot') == 'True'])
.
.
My print(data) looks like this:
Date Open High Low Close Volume plot
0 2015-01-02 47.00 47.23 46.91 46.96 11421233 False
1 2015-01-05 47.08 47.16 46.56 46.57 18964458 False
2 2015-01-06 46.79 47.38 46.46 47.04 22950060 False
3 2015-01-07 46.92 47.04 46.05 46.19 20793637 False
.
.
644 2017-07-25 43.90 44.17 43.81 43.98 9818802 True
645 2017-07-26 44.80 44.83 44.28 44.40 19045166 True
646 2017-07-27 46.26 47.84 45.95 47.81 44702494 True
647 2017-07-28 47.70 48.38 47.12 47.94 25296508 True
I have searched and looked a lot without success. Hope someone has a solution. Thanks / Henning

Something like
plotdata=data[data['plot']]
plotdata
Out[470]:
Date Open High Low Close Volume plot
644 2017-07-25 43.90 44.17 43.81 43.98 9818802 True
645 2017-07-26 44.80 44.83 44.28 44.40 19045166 True
646 2017-07-27 46.26 47.84 45.95 47.81 44702494 True
647 2017-07-28 47.70 48.38 47.12 47.94 25296508 True

Related

Calculation of stock values with yfinance and python

I would like to make some calculations on stock prices in Python 3 and I have installed the module yfinance.
I try to get an individual value like this:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
row_date = tickerDf[tickerDf['Date']=='2020-12-30']
value = row_date.Open.item()
#see your data
print (value)
But when I run this, it says:
KeyError: 'Date'
Which is strange because when I do this, it works well and I have the column Date:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
#row_date = tickerDf[tickerDf['Date']=='2020-12-30']
#value = row_date.Open.item()
#see your data
print (tickerDf)
I get the following result:
G:\python> python test.py
Open High Low Close Volume Dividends Stock Splits
Date
2014-12-31 41.512481 42.143207 41.263744 41.263744 21552500 0.0 0
2015-01-02 41.450302 42.125444 41.343701 41.539135 27913900 0.0 0
2015-01-05 41.192689 41.512495 41.086088 41.157158 39673900 0.0 0
2015-01-06 41.201567 41.530255 40.455355 40.553074 36447900 0.0 0
2015-01-07 40.846223 41.272629 40.410934 41.068310 29114100 0.0 0
... ... ... ... ... ... ... ...
2020-12-22 222.690002 225.630005 221.850006 223.940002 22612200 0.0 0
2020-12-23 223.110001 223.559998 220.800003 221.020004 18699600 0.0 0
2020-12-24 221.419998 223.610001 221.199997 222.750000 10550600 0.0 0
2020-12-28 224.449997 226.029999 223.020004 224.960007 17933500 0.0 0
2020-12-29 226.309998 227.179993 223.580002 224.149994 17403200 0.0 0
[1510 rows x 7 columns]
Under the hood, yfinance uses a Pandas data frame to create a Ticker. In this dataframe, Date isn't an ordinary column, but is instead a name given to the index (see line 240 in base.py of yfinance). The index column behaves differently than other columns and actually can't be referenced by name. You can access it using TickerDf.index=='2020-12-30' or by turning it into a regular column using reset_index as explained in another question. Searching through an index is faster than searching a regular column, so if you are looking through a lot of data, it will be to your advantage to leave it as an index.

Finding out the NAN values for Summary report

List item
```def drag_mis(data):
list = []
for val in data.values:
if np.any(val) == None:
list.append(val)
return list.count(val)```
""" Need a summary report like a file attached in xls format need to automate this boring stuff"""
**
The Above function will help us to drag nan values give the count
**
df.groupby(["Operator","Model"],axis=0)[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17',
'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17',
'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18',
'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19',
'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19',
'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20',
'May-20']].apply(drag_mis)
####I want to drag all nan values so that i can make count for summary report in new CSV file
#### The output is as follows:
AAL 737 0
757 0
767 0
777 0
787 0
MD80 0
AAR 747 0
767 0
777 0
ABM 747 0
ACN 737 0
######Please add your ideas,any one,where my function going wrong#######
********tried below code but i need a summary like value_counts,which can not be implemented in dataframe[![enter image description here][1]][1]********
**
df.groupby(["Operator","Model"])[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17', 'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17', 'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18', 'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19', 'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19', 'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20', 'May-20']].apply(lambda x: x.isnull().sum())
**
******
Please look in to this snapshot of xls file
`
<[1]: https://i.stack.imgur.com/E1FTN.jpg>strong text

Obtaining hyperpolarization depth from electrophysiological graph

I am working on electrophysiological data which is in .abf format.
I want to obtain the hyperpolarization depth as indicated above in the figure. This is what I have done so far;
import matplotlib.pyplot as plt
import pyabf
import pandas as pd
abf = pyabf.ABF("test.abf")
abf.setSweep(10) # I can access a given sweep. Here sweep 10
df = pd.DataFrame({'time': abf.sweepX, 'current':abf.sweepY})
df1 = df.loc[15650:15800]
df1.plot(x='time', y='current')
I am thinking to apply change in derivative to find the first point of interest (x1,y1) and then lower point (x2,y2), but it looks complex. I would appreciate if someone give some hint or procedure.
The dataset as follow,
time current
0.7825 -63.323975
0.78255 -63.171387
0.7826 -62.89673
0.78265 -62.713623
0.7827 -62.469482
0.78275 -62.37793
0.7828 -62.10327
0.78285 -61.950684
0.7829 -61.76758
0.78295 -61.584473
0.783 -61.401367
0.78305 -61.24878
0.7831 -61.035156
0.78315 -60.85205
0.7832 -60.72998
0.78325 -60.516357
0.7833 -60.455322
0.78335 -60.2417
0.7834 -60.08911
0.78345 -59.96704
0.7835 -59.814453
0.78355 -59.661865
0.7836 -59.509277
0.78365 -59.417725
0.7837 -59.23462
0.78375 -59.11255
0.7838 -58.95996
0.78385 -58.86841
0.7839 -58.685303
0.78395 -58.59375
0.784 -58.441162
0.78405 -58.34961
0.7841 -58.19702
0.78415 -58.044434
0.7842 -57.922363
0.78425 -57.769775
0.7843 -57.678223
0.78435 -57.434082
0.7844 -57.34253
0.78445 -56.9458
0.7845 -56.274414
0.78455 -54.96216
0.7846 -53.253174
0.78465 -51.208496
0.7847 -48.950195
0.78475 -46.325684
0.7848 -43.09082
0.78485 -38.42163
0.7849 -31.036377
0.78495 -22.033691
0.785 -13.397217
0.78505 -6.072998
0.7851 -0.61035156
0.78515 2.7160645
0.7852 3.9367676
0.78525 3.4179688
0.7853 1.3427734
0.78535 -1.4953613
0.7854 -5.0964355
0.78545 -9.185791
0.7855 -13.641357
0.78555 -18.249512
0.7856 -23.132324
0.78565 -27.98462
0.7857 -32.714844
0.78575 -37.261963
0.7858 -41.47339
0.78585 -45.22705
0.7859 -48.553467
0.78595 -51.54419
0.786 -53.985596
0.78605 -56.18286
0.7861 -58.013916
0.78615 -59.539795
0.7862 -60.760498
0.78625 -61.88965
0.7863 -62.652588
0.78635 -63.323975
0.7864 -63.934326
0.78645 -64.2395
0.7865 -64.60571
0.78655 -64.78882
0.7866 -65.00244
0.78665 -64.971924
0.7867 -65.093994
0.78675 -65.03296
0.7868 -64.971924
0.78685 -64.819336
0.7869 -64.78882
0.78695 -64.66675
0.787 -64.48364
0.78705 -64.42261
0.7871 -64.2395
0.78715 -64.11743
0.7872 -63.964844
0.78725 -63.842773
0.7873 -63.659668
0.78735 -63.568115
0.7874 -63.446045
0.78745 -63.26294
0.7875 -63.171387
0.78755 -62.98828
0.7876 -62.89673
0.78765 -62.74414
0.7877 -62.713623
0.78775 -62.530518
0.7878 -62.438965
0.78785 -62.37793
0.7879 -62.25586
0.78795 -62.164307
0.788 -62.042236
0.78805 -62.01172
0.7881 -61.88965
0.78815 -61.88965
0.7882 -61.73706
0.78825 -61.706543
0.7883 -61.645508
0.78835 -61.61499
0.7884 -61.523438
0.78845 -61.462402
0.7885 -61.431885
0.78855 -61.340332
0.7886 -61.37085
0.78865 -61.279297
0.7887 -61.279297
0.78875 -61.157227
0.7888 -61.187744
0.78885 -61.09619
0.7889 -61.157227
0.78895 -61.12671
0.789 -61.09619
0.78905 -61.12671
0.7891 -61.00464
0.78915 -61.00464
0.7892 -60.97412
0.78925 -60.97412
0.7893 -60.943604
0.78935 -61.00464
0.7894 -60.913086
0.78945 -60.97412
0.7895 -60.943604
0.78955 -60.913086
0.7896 -60.943604
0.78965 -60.85205
0.7897 -60.85205
0.78975 -60.821533
0.7898 -60.88257
0.78985 -60.88257
0.7899 -60.913086
0.78995 -60.88257
0.79 -60.913086
We can plot the difference in current between consecutive points (which essentially is to a constant factor the derivative, since times are evenly spaced). First chart shows the actual diffs. Based on this we can set some threshold, such as 0.3, and apply it to filter the main DataFrame. The filtered values are shown in orange on the second chart:
fig, ax = plt.subplots(2, figsize=(8,8))
# plot derivative
df['current'].diff().plot(ax=ax[0])
# current
threshold = 0.4
df['filtered'] = df.loc[df['current'].diff().abs() > threshold]
df.plot(ax=ax[1])
# add spans
x = df['filtered'].dropna()
ax[1].axhspan(x.iloc[0], x.iloc[-1], alpha=0.3, edgecolor='skyblue', facecolor="none", hatch='////')
ax[1].axvspan(x.index.min(), x.index.max(), alpha=0.3, edgecolor='orange', facecolor="none", hatch='\\\\')
Output:
If you're interested in range values, you can dropna values in the filtered subset and find min and max from the index:
print('min', df['filtered'].dropna().index.min())
print('max', df['filtered'].dropna().index.max())
Output:
min 0.78445
max 0.7865
For the value of the gap you can use:
abs(df['filtered'].dropna().iloc[-1] - df['filtered'].dropna().iloc[0])
Output:
7.6599100000000035
Note: We can alternatively also get left edges of these spans as points where diff in the point is lower than the threshold and diff in the next point is higher than the threshold, and similarly for the right edges. This would also work in case we have multiple peaks:
threshold = 0.3
x = df['current'].diff().abs()
spanA = df.loc[(x < threshold) & (x.shift(-1) >= threshold)]
spanB = df.loc[(x >= threshold) & (x.shift(-1) < threshold)]
print(spanA)
current
time
0.7844 -57.34253
print(spanB)
current
time
0.7865 -64.60571

Simple datetime conversion from integer or string

Is there a simple way to convert a start and end time input into a list of evenly separated times? the input can be string or integer with format 1000,"1000",or "10:00" in 2400hr format. I've managed to accomplish this in a messy looking way, is there a tighter more efficient way to create this list? As you'll notice I created an array first and then called .tolist() to make the time transformation iteration easier. The problem is that an input of 1030 or 1015 would need to be translated into 1050 or 1025 to create the right spacing but if there were a way I could call a datetime.timedelta or something and cleanly make the array?
start="1000"
end="1600"
total_minutes=(int(end[:2])*60)+int(end[2:])-(int(start[:2])*60)-
int(start[2:])
dog=list(range(0,int(total_minutes),25))
walk=dog_df["Walk Length"][dog_df.index[dog_df["Name"]==self.name][0]]
if walk=='half':
self.dogarr=np.array([(x-25,x,x+25,x+50) for x in dog])
elif walk=='full':
self.dogarr=np.array([(x-25,x,x+25,x+50,x+75,x+100) for x in dog])
else:
self.dogarr=np.array([(x,x+25,x+50) for x in dog])
if int(start[2])!=0:
start=start[:2]+str(int(int(start[2:])*1.667))
self.dogarr+=(int(start))
self.dogarr=self.dogarr.tolist()
z=0
while z<len(self.dogarr):
for timespot in self.dogarr[z].copy():
self.dogarr[z][self.dogarr[z].index(timespot)]=time.strftime('%H%M', time.gmtime(self.dogarr[z][self.dogarr[z].index(timespot)]*36))
z+=1
self.dogarr=np.array(self.dogarr)```
array([['1115', '1130', '1145', '1200'],
['1130', '1145', '1200', '1215'],
['1145', '1200', '1215', '1230'],
['1200', '1215', '1230', '1245'],
['1215', '1230', '1245', '1300']], dtype='<U4')
I'm sure you can figure out to parse times from any number of existing questions. The crux of your question seems to be how to create evenly separated times within a range. Here's a simple way:
start = datetime.datetime(2018,12,20,10) # or use strptime etc.
end = datetime.datetime(2018,12,24,18)
count = 10
interval = (end - start) / count
dt = start
while dt <= end:
print(dt)
dt += interval
The output is:
2018-12-20 10:00:00
2018-12-20 20:24:00
2018-12-21 06:48:00
2018-12-21 17:12:00
2018-12-22 03:36:00
2018-12-22 14:00:00
2018-12-23 00:24:00
2018-12-23 10:48:00
2018-12-23 21:12:00
2018-12-24 07:36:00
2018-12-24 18:00:00

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources