Python3: set a range of data - python-3.x

I feel this must be very basic but I cannot find a simple way.
I am using python3
I have many data files with x,y data where x goes from 0 to 140 (floating).
Let's say
0, 2.1
0.5,3.5
0.8,3.2
...
I want to import values of x within the range 25.4 to 28.1 and their correspondent values in y. Every file might have different length so the value x>25.4 might appear in different row.
I am looking for something equivalent to the following command in gnuplot:
set xrange [25.4:28.1]
This time I cannot use gnuplot because the data processing requires more than the capabilities of gnuplot.
I imported the data with Pandas but I cannot set a range.
Thank you.

r = range(start, stop, step) is the pattern for this in Python.
So, for example, to get:
r == [0, 1, 2]
You would write:
r = [x for x in range(3)]
And to get:
r == [0, 5, 10]
You would write:
r = [x for x in range(0, 11, 5)]
This doesn't get you very far because:
r = [0, .2, 4.3, 6.3]
r = [x for x in r if x in range(3, 10)]
# r == []
But you can do:
r = [0, .2, 4.3, 6.3]
r = [x for x in r if ((x > 3) & (x < 10))]
# r == [4.3, 6.3]
Pandas and Numpy give you a much more concise way of doing this. Consider the following demo of .between
import pandas as pd
import io
text = io.StringIO("""Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413""")
df = pd.read_csv(text, sep='\\s+')
df = df[df["Close"].between(449, 452)] # between
df
So for your df you can do the same: df = df[df["x"].between(min, max)]

Related

Python optimization of time-series data re-indexing based on multiple-parameter multi-varialbe input and singular value output

I am trying to optimize a funciton that is trying to maximize the correlation between two (pandas) time series arrays (X and Y). This is done by using three parameters (a, b, c) and a third time series array (Z). The Z array is used to reindex the values in the X array (based on the parameters a, b, c) in such a way as to maximize the correlation of the reindexed X array (Xnew) with the Y array.
Below is some pseudo-code to demonstrate what I amy trying to do. I have attempted this using LMfit and scipy optimize but I am not sure how to make this task work in those packages. For example in LMfit if I tried to minimize the MyOpt function (which passes back a single value of the correlation metric) then it complains that I have more parameters than outputs. However, if I pass back the time series of the corrlation metric (diff) the the parameter values remain fixed at their input values.
I know the reindexing function I am using works because using the rather crude methods similar to the code below give signifianct changes in the mean (diff) metric passed back.
My knowledge of these optimizaiton packages is not up to scratch for this job so if anyone has a suggestion on how to tackle this, I would be greatfull.
def GetNewIndex(Z, a, b ,c):
old_index = np.arange(0, len(Z))
index_adj = some_func(a,b,c)
new_index = old_index + index_adj
max_old = np.max(old_index)
new_index[new_index > max_old] = max_old
new_index[new_index < 0] = 0
return new_index
def MyOpt(params, X, Y ,Z):
a = params['A']
b = params['B']
c = params['C']
# estimate lag (in samples) based on ambient RH
new_index = GetNewIndex(Z, a, b, c)
# assign old values to new locations and convert back to pandas series
Xnew = np.take(X.values, new_index)
Xnew = pd.Series(Xnew, index=X.index)
cc = Y.rolling(1201, center=True).corr(Xnew)
cc = cc.interpolate(limit_direction='both', limit_area=None)
diff = 1-np.abs(cc)
return np.mean(diff)
#==================================================
X = some long pandas time series data
Y = some long pandas time series data
Z = some long pandas time series data
As = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
Bs = [0, 0 ,0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Cs = [5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6]
outs = []
for A, B, C in zip(As, Bs, Cs):
params={'A':A, 'B':B, 'C':C}
out = MyOpt(params, X, Y, Z)
outs.append(out)

SymPy result Filtering

I was recently working on a CodeForce problem
So, I was using SymPy to solve this.
My code is :
from sympy import *
x,y = symbols("x,y", integer = True)
m,n = input().split(" ")
sol = solve([x**2 + y - int(n), y**2 + x - int(m)], [x, y])
print(sol)
What I wanted to do:
Filter only Positive and integer value from SymPy
Ex: If I put 14 28 in the terminal it will give me tons of result, but I just want it to show [(5, 3)]
I don't think that this is the intended way to solve the code force problem (I think you're just supposed to loop over the possible values for one of the variables).
I'll show how to make use of SymPy here anyway though. Your problem is a diophantine system of equations. Although SymPy has a diophantine solver it only works for individual equations rather than systems.
Usually the idea of using a CAS for something like this though is to symbolically find something like a general result that then helps you to write faster concrete numerical code. Here are your equations with m and n as arbitrary symbols:
In [62]: x, y, m, n = symbols('x, y, m, n')
In [63]: eqs = [x**2 + y - n, y**2 + x - m]
Using the polynomial resultant we can eliminate either x or y from this system to obtain a quartic polynomial for the remaining variable:
In [31]: py = resultant(eqs[0], eqs[1], x)
In [32]: py
Out[32]:
2 2 4
m - 2⋅m⋅y - n + y + y
While there is a quartic general formula that SymPy can use (if you use solve or roots here) it is too complicated to be useful for a problem like the one that you are describing. Instead though the rational root theorem tells us that an integer root for y must be a divisor of the constant term:
In [33]: py.coeff(y, 0)
Out[33]:
2
m - n
Therefore the possible values for y are:
In [64]: yvals = divisors(py.coeff(y, 0).subs({m:14, n:28}))
In [65]: yvals
Out[65]: [1, 2, 3, 4, 6, 7, 8, 12, 14, 21, 24, 28, 42, 56, 84, 168]
Since x is m - y**2 the corresponding values for x are:
In [66]: solve(eqs[1], x)
Out[66]:
⎡ 2⎤
⎣m - y ⎦
In [67]: xvals = [14 - yv**2 for yv in yvals]
In [68]: xvals
Out[68]: [13, 10, 5, -2, -22, -35, -50, -130, -182, -427, -562, -770, -1750, -3122, -7042, -28210]
The candidate solutions are then given by:
In [69]: candidates = [(xv, yv) for xv, yv in zip(xvals, yvals) if xv > 0]
In [70]: candidates
Out[70]: [(13, 1), (10, 2), (5, 3)]
From there you can test which values are solutions:
In [74]: eqsmn = [eq.subs({m:14, n:28}) for eq in eqs]
In [75]: [c for c in candidates if all(eq.subs(zip([x,y],c))==0 for eq in eqsmn)]
Out[75]: [(5, 3)]
The algorithmically minded will probably see from the above example how to make a much more efficient way of implementing the solver.
I've figured out the answer to my question ! At first, I was trying to filter the result from solve(). But there is an easy way to do this.
Pseudo code:
solve() gives the intersection point of both Parabolic Equations as a List
I just need to filter() the other types of values. Which in my case is <sympy.core.add.Add>
def rem(_list):
return list(filter(lambda v: type(v) != Add, _list))
Yes, You can also use type(v) == int
Final code:
from sympy import *
# the other values were <sympy.core.add.Add> type. So, I just defined a function to filterOUT these specific types from my list.
def rem(_list):
return list(filter(lambda v: type(v) != Add, _list))
x,y = symbols("x,y", integer = True, negative = False)
output = []
m,n = input().split(' ')
# I need to solve these 2 equations separately. Otherwise, my defined function will not work without loop.
solX = rem(solve((x+(int(n)-x**2)**2 - int(m)), x))
solY = rem(solve((int(m) - y**2)**2 + y - int(n), y))
if len(solX) == 0 or len(solY) == 0:
print(0)
else:
output.extend(solX) # using "Extend" to add multiple values in the list.
output.extend(solY)
print(int((len(output))/2)) # Obviously, result will come in pairs. So, I need to divide the length of the list by 2.
Why I used this way :
I tried to solve it by algorithmic way, but it still had some float numbers. I just wanted to skip the loop thing here again !
As sympy solve() has already found the values. So, I skipped the other way and focused on filtering !
Sadly, code force compiler shows a runtime error! I guess it can't import sympy. However, it works fine in VSCode.

Find the index location of an element in a Numpy array

If I have:
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
for item in range(3):
choice = random.choice(x)
How can I get the index number of the random choice taken from the array?
I tried:
indexNum = np.where(x == choice)
print(indexNum[0])
But it didn't work.
I want the output, for example, to be something like:
chosenIndices = [1 5 8]
Another possibility is using np.where and np.intersect1d. Here random choice without repetition.
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
res=[]
cont = 0
while cont<3:
choice = random.choice(x)
ind = np.intersect1d(np.where(choice[0]==x[:,0]),np.where(choice[1]==x[:,1]))[0]
if ind not in res:
res.append(ind)
cont+=1
print (res)
# Output [8, 1, 5]
You can achieve this by converting the numpy array to list of tuples and then apply the index function.
This would work:
import random
import numpy as np
chosenIndices = []
x = np.array(([1,4], [2,5], [2,6], [3,4], [3,6], [3,7], [4,3], [4,5], [5,2]))
x = x.T
x = list(zip(x[0],x[1]))
item = 0
while len(chosenIndices)!=3:
choice = random.choice(x)
indexNum = x.index(choice)
if indexNum in chosenIndices: # if index already exist, then it will rerun that particular iteration again.
item-=1
else:
chosenIndices.append(indexNum)
print(chosenIndices) # Thus all different results.
Output:
[1, 3, 2]

Generate a list with two unique elements with specific length [duplicate]

Simple question here:
I'm trying to get an array that alternates values (1, -1, 1, -1.....) for a given length. np.repeat just gives me (1, 1, 1, 1,-1, -1,-1, -1). Thoughts?
I like #Benjamin's solution. An alternative though is:
import numpy as np
a = np.empty((15,))
a[::2] = 1
a[1::2] = -1
This also allows for odd-length lists.
EDIT: Also just to note speeds, for a array of 10000 elements
import numpy as np
from timeit import Timer
if __name__ == '__main__':
setupstr="""
import numpy as np
N = 10000
"""
method1="""
a = np.empty((N,),int)
a[::2] = 1
a[1::2] = -1
"""
method2="""
a = np.tile([1,-1],N)
"""
method3="""
a = np.array([1,-1]*N)
"""
method4="""
a = np.array(list(itertools.islice(itertools.cycle((1,-1)), N)))
"""
nl = 1000
t1 = Timer(method1, setupstr).timeit(nl)
t2 = Timer(method2, setupstr).timeit(nl)
t3 = Timer(method3, setupstr).timeit(nl)
t4 = Timer(method4, setupstr).timeit(nl)
print 'method1', t1
print 'method2', t2
print 'method3', t3
print 'method4', t4
Results in timings of:
method1 0.0130500793457
method2 0.114426136017
method3 4.30518102646
method4 2.84446692467
If N = 100, things start to even out but starting with the empty numpy arrays is still significantly faster (nl changed to 10000)
method1 0.05735206604
method2 0.323992013931
method3 0.556654930115
method4 0.46702003479
Numpy arrays are special awesome objects and should not be treated like python lists.
use resize():
In [38]: np.resize([1,-1], 10) # 10 is the length of result array
Out[38]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
it can produce odd-length array:
In [39]: np.resize([1,-1], 11)
Out[39]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1])
Use numpy.tile!
import numpy
a = numpy.tile([1,-1], 15)
use multiplication:
[1,-1] * n
If you want a memory efficient solution, try this:
def alternator(n):
for i in xrange(n):
if i % 2 == 0:
yield 1
else:
yield -1
Then you can iterate over the answers like so:
for i in alternator(n):
# do something with i
Maybe you're looking for itertools.cycle?
list_ = (1,-1,2,-2) # ,3,-3, ...
for n, item in enumerate(itertools.cycle(list_)):
if n==30:
break
print item
I'll just throw these out there because they could be more useful in some circumstances.
If you just want to alternate between positive and negative:
[(-1)**i for i in range(n)]
or for a more general solution
nums = [1, -1, 2]
[nums[i % len(nums)] for i in range(n)]

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Resources