python - cannot make corr work - python-3.x

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object

Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.

You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

Related

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

How to find the shortest distance between two line segments capturing the sign values with python

I have a pandas dataframe of the form:
benchmark_x benchmark_y ref_point_x ref_point_y
0 525039.140 175445.518 525039.145 175445.539
1 525039.022 175445.542 525039.032 175445.568
2 525038.944 175445.558 525038.954 175445.588
3 525038.855 175445.576 525038.859 175445.576
4 525038.797 175445.587 525038.794 175445.559
5 525038.689 175445.609 525038.679 175445.551
6 525038.551 175445.637 525038.544 175445.577
7 525038.473 175445.653 525038.459 175445.594
8 525038.385 175445.670 525038.374 175445.610
9 525038.306 175445.686 525038.289 175445.626
I am trying to find the shortest distance from the line to the benchmark such that if the line is above the benchmark the distance is positive and if it is below the benchmark the distance is negative. See image below:
I used the KDTree from scipy like so:
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
test = df.apply(lambda row: tree.query(row[["ref_point_x", "ref_point_y"]]), axis=1)
test=test.apply(pd.Series, index=["distance", "index"])
This seems to work except that it fails to capture the negative values as a result that the line is below the benchmark.
# recreating your example
columns = "benchmark_x benchmark_y ref_point_x ref_point_y".split(" ")
data = """525039.140 175445.518 525039.145 175445.539
525039.022 175445.542 525039.032 175445.568
525038.944 175445.558 525038.954 175445.588
525038.855 175445.576 525038.859 175445.576
525038.797 175445.587 525038.794 175445.559
525038.689 175445.609 525038.679 175445.551
525038.551 175445.637 525038.544 175445.577
525038.473 175445.653 525038.459 175445.594
525038.385 175445.670 525038.374 175445.610
525038.306 175445.686 525038.289 175445.626"""
data = [float(x) for x in data.replace("\n"," ").split(" ") if len(x)>0]
arr = np.array(data).reshape(-1,4)
df = pd.DataFrame(arr, columns=columns)
# adding your two new columns to the df
from scipy.spatial import KDTree
tree=KDTree(df[["benchmark_x", "benchmark_y"]])
df["distance"], df["index"] = tree.query(df[["ref_point_x", "ref_point_y"]])
Now to compare if one line is above the other or not, we have to evaluate y at the same x position. Therefore we need to interpolate the y points for the x positions of the other line.
df = df.sort_values("ref_point_x") # sorting is required for interpolation
xy_refpoint = df[["ref_point_x", "ref_point_y"]].values
df["ref_point_y_at_benchmark_x"] = np.interp(df["benchmark_x"], xy_refpoint[:,0], xy_refpoint[:,1])
And finally your criterium can be evaluated and applied:
df["distance"] = np.where(df["ref_point_y_at_benchmark_x"] < df["benchmark_y"], -df["distance"], df["distance"])
# or change the < to <,>,<=,>= as you wish

How do I convert numpy array to days, hours, mins?

Running with this series
X = number_of_logons_all.values
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
I get:
mean1=60785.792548, mean2=61291.266868
variance1=7483553053.651829, variance2=7603208729.348722
But I wanted something like this in my PyCharm console (pulled from another result):
>>> -103 days +04:37:13.802435724...
Tried to place the np.array in a pd.Dataframe() to get the expected value by adding
.apply(pd.to_timedelta, unit='s')
...this didn't work, so I tried
new = pd.DataFrame([mean1]).to_numpy(dtype='timedelta64[ns]')
...and (still) got something like this:
>>>> [[63394]]
Anyone out there who could assist me converting to an easily comprehended datetime result from my means calculation above?
Thx, in advance for your kind support.
You can use f-strings:
mean1, mean2 = 60785.792548, 61291.266868
variance1, variance2=7603208729.348722,7483553053.651829
print(f'mean1={pd.Timedelta(mean1, unit="s")}, mean2={pd.Timedelta(mean2, unit="s")}')
print(f'variance1={pd.Timedelta(variance1, unit="s")}, variance2={pd.Timedelta(variance2, unit="s")}')
mean1=0 days 16:53:05.792548, mean2=0 days 17:01:31.266868
variance1=88000 days 02:25:29.348722458, variance2=86615 days 04:44:13.651828766

Keras Prediction result (getting score,use of argmax)

I am trying to use the elmo model for text classification for my own dataset. The training is completed and the number of classes is 4(used keras model and elmo embedding).In the prediction, I got a numpy array. I am attaching the sample code and the result below...
import tensorflow as tf
import keras.backend as K
new_text_pr = np.array(data, dtype=object)[:, np.newaxis]
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model_elmo = build_model(classes)
model_elmo.load_weights(model+"/"+elmo_model)
import time
t = time.time()
predicted = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)
print(predicted)
# print(predicted[0][0])
print("result:",np.argmax(predicted[0]))
return np.argmax(predicted[0])
when I print the predicts variable I got this.
time: 1.561854362487793
[[0.17483692 0.21439584 0.24001297 0.3707543 ]
[0.15607062 0.24448264 0.4398888 0.15955798]
[0.06494818 0.3439018 0.42254424 0.16860574]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.0365712 0.4194748 0.3321385 0.21181548]
[0.05350104 0.18225929 0.56712115 0.19711846]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.09541835 0.19085276 0.41069734 0.30303153]
[0.03930932 0.40526104 0.45785302 0.09757669]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.09784496 0.2292052 0.44426462 0.22868524]
[0.06089798 0.31685832 0.47317514 0.14906852]
[0.03956613 0.46605557 0.3502095 0.14416872]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.15165758 0.22900137 0.50939053 0.10995051]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.11404029 0.21311268 0.46880838 0.2040386 ]
[0.07556026 0.20502563 0.52019936 0.19921473]
[0.11096822 0.23295449 0.36192006 0.29415724]
[0.05018891 0.16656907 0.60114646 0.18209551]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.09596984 0.18282187 0.5053091 0.2158991 ]
[0.09428936 0.13995855 0.62395805 0.14179407]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.08244281 0.15743142 0.5462735 0.21385226]
[0.07199708 0.2446867 0.44568574 0.23763043]
[0.1339082 0.27288827 0.43478844 0.15841508]
[0.07354636 0.24499843 0.44873005 0.23272514]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.08924995 0.36547357 0.40014726 0.14512917]
[0.05132649 0.28190497 0.5224545 0.14431408]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04849219 0.36724472 0.39698333 0.1872797 ]
[0.07206573 0.31368822 0.4667826 0.14746341]
[0.05948553 0.28048623 0.41831577 0.2417125 ]
[0.07582933 0.18771031 0.54879296 0.18766735]
[0.03858965 0.20433436 0.5596278 0.19744818]
[0.07443814 0.20681688 0.3933627 0.32538226]
[0.0639974 0.23687115 0.5357675 0.16336392]
[0.11005415 0.22901568 0.4279426 0.23298755]
[0.12625505 0.22987585 0.31619486 0.32767424]
[0.08893713 0.14554602 0.45740074 0.30811617]
[0.07906891 0.18683094 0.5214609 0.21263924]
[0.06316617 0.30398315 0.4475617 0.185289 ]
[0.07060979 0.17987429 0.4829593 0.26655656]
[0.0720717 0.27058697 0.41439256 0.24294883]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04745338 0.25831962 0.46751252 0.22671448]
[0.06624557 0.20708969 0.54820716 0.17845756]]
result:3
Anyone have any idea about what is the use of taking the 0th index value only. Considering this as a list of lists 0th index means first list and the argmax returns index the maximum value from the list. Then what is the use of other values in the lists?. Why isn't it considered?. Also is it possible to get the score from this? I hope the question is clear. Is it the correct way or is it wrong?
I have found the issue. just posting it others who met the same problem.
Answer: When predicting with Elmo model, it expects a list of strings. In code, the prediction data were split and the model predicted for each word. That's why I got this huge array. I have used a temporary fix. The data is appended to a list then an empty string is also appended with the list. The model will predict the both list values but I took only the first predicted data. This is not the correct way but I have done this as a quick fix and hoping to find a fix in the future
To find the predicted class for each test example, you need to use axis=1. So, in your case the predicted classes will be:
>>> predicted_classes = predicted.argmax(axis=1)
>>> predicted_classes
[3 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
2 2 2 2 2 2 3 2 2 2 2 2 1 2 2]
Which means that the first test example belongs to the third class, and the second test example belongs to the second class and so on.
The previous part answers your question (I think), now let's see what the np.argmax(predicted) does. Using np.argmax() alone without specifying the axis will flatten your predicted matrix and get the argument of the maximum number.
Let's see this simple example to know what I mean:
>>> x = np.matrix(np.arange(12).reshape((3,4)))
>>> x
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> x.argmax()
11
11 is the index of the 11 which is the biggest number in the whole matrix.

Resources