Selecting data with boolean - python-3.x

I have this data:
Time Filename 60Ni 61Ni 62Ni 63Cu 64Ni 65Cu 66Zn
0 input/25.03.2022/220310001_Blk.TXT 0.004213561117649 0.0004941140553758 0.0008264054505464 0.0031620886176824 0.0027940864441916 0.0015828267166701 0.0016676299002331
0 input/25.03.2022/220310002_s01.TXT 10.450854313373563 0.4648053120821714 1.515178832411766 6.799761742353439 0.4018541350960731 3.1670226126909258 -0.0004149001091036
0 input/25.03.2022/220310003_Blk.TXT 0.0046728020068258 0.0005113642575452 0.0008742292275807 0.0038990775356069 0.0016948795023684 0.0018952833449778 0.0010220745559005
0 input/25.03.2022/220310004_NBS1.TXT 10.198064727783205 0.4535825699567795 1.4786694407463077 6.644256343841553 0.3992856484651566 3.094865655899048 0.0037580380868166
0 input/25.03.2022/220310004_NBS1.TXT 10.079309902191165 0.4483201849460602 1.46148056268692 6.564267292022705 0.3941892045736312 3.0576720571517946 0.0034487369190901
0 input/25.03.2022/220310004_NBS1.TXT 10.055493621826171 0.4472497546672821 1.4580038189888 6.549974250793457 0.3931813925504684 3.050980033874512 0.0033985046250745
0 input/25.03.2022/220310005_Blk.TXT 0.0054564320637534 0.000552704881799 0.0009742037393152 0.0051360657749076 0.0016950297751463 0.0024249535597239 0.0009928302722983
0 input/25.03.2022/220310006_s02.TXT 10.01531049013138 0.4454658553004265 1.4522675812244414 6.514146220684052 0.386061905324459 3.0344271540641783 0.000107468456008
0 input/25.03.2022/220310007_Blk.TXT 0.0058440430865933 0.0005767503850317 0.0010166236781515 0.005754723213613 0.0016812159873855 0.0026916580740362 0.0009854609476557
0 input/25.03.2022/220310008_IC11.TXT 9.994083042144776 0.4445306330919266 1.4492901706695556 7.470176630020141 0.3894518518447876 3.4796351766586304 0.0025492239557206
0 input/25.03.2022/220310008_IC11.TXT 9.96211524963379 0.4431046521663666 1.4446095180511476 7.446916389465332 0.3881216913461685 3.4687003660202027 0.0025008138502016
0 input/25.03.2022/220310008_IC11.TXT 9.986034145355225 0.4441730308532715 1.4480724453926086 7.463679447174072 0.3892696779966354 3.4765151739120483 0.0026400250289589
0 input/25.03.2022/220310009_Blk.TXT 0.0063325553511579 0.0006097427569329 0.0010847105528227 0.0069349051453173 0.001787266593116 0.0031935206692044 0.0010376764985267
0 input/25.03.2022/220310010_IC21.TXT 9.907929096221924 0.4407092106342316 1.436798827648163 6.06647294998169 0.381802784204483 2.8266239309310914 -1.9165552830600065e-05
0 input/25.03.2022/220310010_IC21.TXT 9.898069801330566 0.440289745926857 1.4353946447372437 6.061149816513062 0.381249166727066 2.824158968925476 -0.000127439211501
0 input/25.03.2022/220310010_IC21.TXT 9.873182182312013 0.4391724157333374 1.4317991995811463 6.046364660263062 0.3803336149454117 2.8172996759414675 -0.0001105861682299
0 input/25.03.2022/220310011_Blk.TXT 0.006776000233367 0.0006151825053772 0.0011437231053908 0.0076008092767248 0.0019233425652297 0.0034851033162946 0.0010968211592019
0 input/25.03.2022/220310012_s03.TXT 9.919624531269074 0.4412683457136154 1.4386062547564509 6.453964179754257 0.3823180440813303 3.006967249512672 -5.925155056729638e-06
0 input/25.03.2022/220310013_Blk.TXT 0.0070609439785281 0.000630383901686 0.0011728596175089 0.0080440094384054 0.002036696835421 0.0036809448773662 0.0011725965615672
0 input/25.03.2022/220310014_NBS2.TXT 9.981303691864014 0.4439967930316925 1.4475916290283204 6.529758939743042 0.3842234253883362 3.0423321390151976 -0.0003124714840669
0 input/25.03.2022/220310014_NBS2.TXT 9.914944686889648 0.4410672289133072 1.43802725315094 6.483507022857666 0.3816052615642548 3.020912132263184 -0.0003612653422169
0 input/25.03.2022/220310014_NBS2.TXT 9.886322116851806 0.4397846907377243 1.433876838684082 6.466586313247681 0.3804377472400665 3.012995510101318 -0.0004034811019664
0 input/25.03.2022/220310015_Blk.TXT 0.0073987463644395 0.0006374313224417 0.00121240914256 0.0086917337340613 0.0023169121006503 0.0039539939956739 0.0013138331084822
0 input/25.03.2022/220310016_IC12.TXT 9.85557996749878 0.4384415912628174 1.429577150344849 7.381738519668579 0.3795971584320068 3.4393920421600344 -0.0002216772679821
0 input/25.03.2022/220310016_IC12.TXT 9.82486707687378 0.4370819771289825 1.4251339054107666 7.356578054428101 0.3785045564174652 3.4277337741851808 -0.0001708243161556
0 input/25.03.2022/220310016_IC12.TXT 9.80790719985962 0.4363300800323486 1.4227154779434203 7.344034671783447 0.3781744015216827 3.421962966918945 4.260804889781866e-06
0 input/25.03.2022/220310017_Blk.TXT 0.0075777747668325 0.0006548739698094 0.0012614171913204 0.0094458545868595 0.0020529380068182 0.0042742350138723 0.0011683187098242
0 input/25.03.2022/220310018_s04.TXT 9.812697958946227 0.4365527924150228 1.4234635725617408 6.387677395343781 0.3783831790089607 2.9768310248851777 -2.183340569104075e-05
0 input/25.03.2022/220310019_Blk.TXT 0.0078119638841599 0.0006677620811387 0.001280953696308 0.0097443414541582 0.0025180794376259 0.0044003704718003 0.0014272088340173
I read it and I drop everything I don't want::
import pandas as pd
df = pd.read_csv(fullname, delimiter='\t', index_col=False)
df.drop(['Time', '60Ni', '61Ni', '62Ni', '64Ni', '66Zn'], axis=1, inplace=True)
inserting a ratio:
df['Cu_ratio'] = df['65Cu'] / df['63Cu']
detecting rows by name:
df['blank']=df['Filename'].str.contains('_Blk|_BLK')
df['standard']=df['Filename'].str.contains('_s')
df['sample']=~df['blank'] & ~df['standard']
dfSamples=df[df['sample'] == True]
then I need to detect the "standard" before and after each "sample"
i = 0
for sampleIndex in dfSamples.index:
dfBefore=df.iloc[0:sampleIndex]
dfBeforeInverse=dfBefore[::-1]
dfStandardBefore = dfBeforeInverse[dfBeforeInverse['standard'] == True]
indexBefore = dfStandardBefore.index[0]
df.at[sampleIndex, 'standardBefore'] = indexBefore
lastIndex = df.index[-1]
dfAfter=df.iloc[sampleIndex:lastIndex]
dfStandardAfter = dfAfter[dfAfter['standard'] == True]
indexAfter = dfStandardAfter.index[0]
df.at[sampleIndex, 'standardAfter'] = indexAfter
this works so far.
But where I'm struggling: I need to make a calculation:
First, I need to calculate the average of the "standard" before and after each "sample":
I tried it with
av_std = sum(dfStandardBefore['Cu_ratio'], dfStandardAfter['Cu_ratio']) / 2
but it calculates the sum of ALL standards before and after, not only the masked with boolean. What I need it to be for example:
s01 and s02 for NBS1; s02 and s03 for IC11 and IC21, s03 and s04 for NBS2 and IC12, and so on.
The last step would be to make a calculation for each sample:
delta = (av_std / sample) * 1000
and export it into a csv.
How can I solve this?

Related

Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong)

I'm quite new to Python. I'm trying to load a .csv file with Panda but it returns a 50x1 matrix instead of expected 50x7. I'm a bit uncertain whether it is becaue my data contains numbers with "," (although I thought the quotechar attribute would solve that problem).
EDIT: Should perhaps mention that including the attribute sep=',' doesn't solve the issue)
My code looks like this
df = pd.read_csv('data.csv', header=None, quotechar='"')
print(df.head)
print(len(df.columns))
print(len(df.index))
Any ideas? Thanks in advance
Here is a subset of the data as text
10-01-2021,813,116927,"2,01",-,-,-
11-01-2021,657,117584,"2,02",-,-,-
12-01-2021,462,118046,"2,03",-,-,-
13-01-2021,12728,130774,"2,24",-,-,-
14-01-2021,17895,148669,"2,55",-,-,-
15-01-2021,15206,163875,"2,81",5,5,"0,0001"
16-01-2021,4612,168487,"2,89",7,12,"0,0002"
17-01-2021,2536,171023,"2,93",717,729,"0,01"
18-01-2021,3883,174906,"3,00",2147,2876,"0,05"
Here is the output of the head-function
0
0 27-12-2020,6492,6492,"0,11",-,-,-
1 28-12-2020,1987,8479,"0,15",-,-,-
2 29-12-2020,8961,17440,"0,30",-,-,-
3 30-12-2020,11477,28917,"0,50",-,-,-
4 31-12-2020,6197,35114,"0,60",-,-,-
5 01-01-2021,2344,37458,"0,64",-,-,-
6 02-01-2021,8895,46353,"0,80",-,-,-
7 03-01-2021,6024,52377,"0,90",-,-,-
8 04-01-2021,2403,54780,"0,94",-,-,-
Using your data I got the expected result. (even without quotechar='"')
Could you maybe show us your output?
import pandas as pd
df = pd.read_csv('data.csv', header=None)
print(df)
> 0 1 2 3 4 5 6
> 0 10-01-2021 813 116927 2,01 - - -
> 1 11-01-2021 657 117584 2,02 - - -
> 2 12-01-2021 462 118046 2,03 - - -
> 3 13-01-2021 12728 130774 2,24 - - -
> 4 14-01-2021 17895 148669 2,55 - - -
> 5 15-01-2021 15206 163875 2,81 5 5 0,0001
> 6 16-01-2021 4612 168487 2,89 7 12 0,0002
> 7 17-01-2021 2536 171023 2,93 717 729 0,01
> 8 18-01-2021 3883 174906 3,00 2147 2876 0,05
You need to define the seperator and delimiter, like this:
df = pd.read_csv('data.csv', header=None, sep = ',', delimiter=',' , quotechar='"')

Calculation of stock values with yfinance and python

I would like to make some calculations on stock prices in Python 3 and I have installed the module yfinance.
I try to get an individual value like this:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
row_date = tickerDf[tickerDf['Date']=='2020-12-30']
value = row_date.Open.item()
#see your data
print (value)
But when I run this, it says:
KeyError: 'Date'
Which is strange because when I do this, it works well and I have the column Date:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
#row_date = tickerDf[tickerDf['Date']=='2020-12-30']
#value = row_date.Open.item()
#see your data
print (tickerDf)
I get the following result:
G:\python> python test.py
Open High Low Close Volume Dividends Stock Splits
Date
2014-12-31 41.512481 42.143207 41.263744 41.263744 21552500 0.0 0
2015-01-02 41.450302 42.125444 41.343701 41.539135 27913900 0.0 0
2015-01-05 41.192689 41.512495 41.086088 41.157158 39673900 0.0 0
2015-01-06 41.201567 41.530255 40.455355 40.553074 36447900 0.0 0
2015-01-07 40.846223 41.272629 40.410934 41.068310 29114100 0.0 0
... ... ... ... ... ... ... ...
2020-12-22 222.690002 225.630005 221.850006 223.940002 22612200 0.0 0
2020-12-23 223.110001 223.559998 220.800003 221.020004 18699600 0.0 0
2020-12-24 221.419998 223.610001 221.199997 222.750000 10550600 0.0 0
2020-12-28 224.449997 226.029999 223.020004 224.960007 17933500 0.0 0
2020-12-29 226.309998 227.179993 223.580002 224.149994 17403200 0.0 0
[1510 rows x 7 columns]
Under the hood, yfinance uses a Pandas data frame to create a Ticker. In this dataframe, Date isn't an ordinary column, but is instead a name given to the index (see line 240 in base.py of yfinance). The index column behaves differently than other columns and actually can't be referenced by name. You can access it using TickerDf.index=='2020-12-30' or by turning it into a regular column using reset_index as explained in another question. Searching through an index is faster than searching a regular column, so if you are looking through a lot of data, it will be to your advantage to leave it as an index.

Counting the number of times the values are more than the mean for a specific column in Dataframe

I'm trying to find the number of times the value in a certain column (in this case under "AveragePrice") is more than its mean & median. I calculated the mean using the below:
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
how do I count the number of times the values were more than the mean?
Sample of the Dataframe:
Date AveragePrice Total Volume PLU4046 PLU4225 PLU4770 Total Bags
0 27/12/2015 1.33 64236.62 1036.74 54454.85 48.16 8696.87
1 20/12/2015 1.35 54876.98 674.28 44638.81 58.33 9505.56
2 13/12/2015 0.93 118220.22 794.70 109149.67 130.50 8145.35
3 06/12/2015 1.08 78992.15 1132.00 71976.41 72.58 5811.16
4 29/11/2015 1.28 51039.60 941.48 43838.39 75.78 6183.95
5 22/11/2015 1.26 55979.78 1184.27 48067.99 43.61 6683.91
6 15/11/2015 0.99 83453.76 1368.92 73672.72 93.26 8318.86
7 08/11/2015 0.98 109428.33 703.75 101815.36 80.00 6829.22
8 01/11/2015 1.02 99811.42 1022.15 87315.57 85.34 11388.36
import numpy as np
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
where_bigger = np.where((avocadodf["AveragePrice"] > mean_AveragePrice) & (avocadodf["AveragePrice"] > median_AveragePrice), 1, 0 )
where_bigger.sum()
So you got the data you need and now you need the test. np.where will help you out

Finding out the NAN values for Summary report

List item
```def drag_mis(data):
list = []
for val in data.values:
if np.any(val) == None:
list.append(val)
return list.count(val)```
""" Need a summary report like a file attached in xls format need to automate this boring stuff"""
**
The Above function will help us to drag nan values give the count
**
df.groupby(["Operator","Model"],axis=0)[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17',
'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17',
'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18',
'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19',
'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19',
'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20',
'May-20']].apply(drag_mis)
####I want to drag all nan values so that i can make count for summary report in new CSV file
#### The output is as follows:
AAL 737 0
757 0
767 0
777 0
787 0
MD80 0
AAR 747 0
767 0
777 0
ABM 747 0
ACN 737 0
######Please add your ideas,any one,where my function going wrong#######
********tried below code but i need a summary like value_counts,which can not be implemented in dataframe[![enter image description here][1]][1]********
**
df.groupby(["Operator","Model"])[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17', 'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17', 'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18', 'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19', 'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19', 'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20', 'May-20']].apply(lambda x: x.isnull().sum())
**
******
Please look in to this snapshot of xls file
`
<[1]: https://i.stack.imgur.com/E1FTN.jpg>strong text

Axis is specified but .drop() errors out

I've a data set with 3 columns:
UserID ProductID Ratings
0 AKM1MP6P0OYPR 0132793040 5.0
1 A2CX7LUOHB2NDG 0321732944 5.0
2 A2NWSAGRHCP8N5 0439886341 1.0
3 A2WNBOD3WNDNKT 0439886341 3.0
4 A1GI0U4ZRJA8WN 0439886341 1.0
I'm trying to drop the rows where the UserID count is les than 50, that is if the user has rated less than 50 products. So I ran the code below:
for i in data.UserID.unique():
print (i)
if(data['UserID'].loc[data['UserID'] == i].value_counts().values < 50):
data.drop(data.loc[data['UserID'] == i], axis = 0, inplace = True)
But I receive an error message stating the labels are not contained in an axis like so:
KeyError: "labels ['UserID' 'ProductID' 'Ratings'] not contained in axis"
Could anyone please tell me what I'm getting wrong?
Thanks!

Resources