Pine Script in TradingView. Adding HH/LL and LH/LL on default ZigZag indicator - pivot

Here is the default ZigZag indicator in PineScript TradingView. It displays the price, volume, and price difference of each pivot point.
Is there a way to display "HH" or "LH" if the current top pivot point is higher/lower than the previous top pivot point? Also display "HL" or "LL" on the bottom pivot points?
//#version=5
indicator("Zig Zag", overlay=true, max_lines_count=500, max_labels_count=500)
dev_threshold = input.float(title="Deviation (%)", defval=5.0, minval=0.00001, maxval=100.0)
depth = input.int(title="Depth", defval=10, minval=1)
line_color = input(title="Line Color", defval=#2962FF)
extend_to_last_bar = input(title="Extend to Last Bar", defval=true)
display_reversal_price = input(title="Display Reversal Price", defval=true)
display_cumulative_volume = input(title="Display Cumulative Volume", defval=true)
display_reversal_price_change = input(title="Display Reversal Price Change", defval=true, inline="price rev")
difference_price = input.string("Absolute", "", options=["Absolute", "Percent"], inline="price rev")
pivots(src, length, isHigh) =>
p = nz(src[length])
if length == 0
[time, p]
else
isFound = true
for i = 0 to math.abs(length - 1)
if isHigh and src[i] > p
isFound := false
if not isHigh and src[i] < p
isFound := false
for i = length + 1 to 2 * length
if isHigh and src[i] >= p
isFound := false
if not isHigh and src[i] <= p
isFound := false
if isFound and length * 2 <= bar_index
[time[length], p]
else
[int(na), float(na)]
[iH, pH] = pivots(high, math.floor(depth / 2), true)
[iL, pL] = pivots(low, math.floor(depth / 2), false)
calc_dev(base_price, price) =>
100 * (price - base_price) / base_price
price_rotation_aggregate(price_rotation, pLast, cum_volume) =>
str = ""
if display_reversal_price
str += str.tostring(pLast, format.mintick) + " "
if display_reversal_price_change
str += price_rotation + " "
if display_cumulative_volume
str += "\n" + cum_volume
str
caption(isHigh, iLast, pLast, price_rotation, cum_volume) =>
price_rotation_str = price_rotation_aggregate(price_rotation, pLast, cum_volume)
if display_reversal_price or display_reversal_price_change or display_cumulative_volume
if not isHigh
label.new(iLast, pLast, text=price_rotation_str, style=label.style_none, xloc=xloc.bar_time, yloc=yloc.belowbar, textcolor=color.red)
else
label.new(iLast, pLast, text=price_rotation_str, style=label.style_none, xloc=xloc.bar_time, yloc=yloc.abovebar, textcolor=color.green)
price_rotation_diff(pLast, price) =>
if display_reversal_price_change
tmp_calc = price - pLast
str = difference_price == "Absolute"? (math.sign(tmp_calc) > 0? "+" : "") + str.tostring(tmp_calc, format.mintick) : (math.sign(tmp_calc) > 0? "+" : "-") + str.tostring((math.abs(tmp_calc) * 100)/pLast, format.percent)
str := "(" + str + ")"
str
else
""
var line lineLast = na
var label labelLast = na
var int iLast = 0
var float pLast = 0
var bool isHighLast = true // otherwise the last pivot is a low pivot
var int linesCount = 0
var float sumVol = 0
var float sumVolLast = 0
pivotFound(dev, isHigh, index, price) =>
if isHighLast == isHigh and not na(lineLast)
// same direction
if isHighLast ? price > pLast : price < pLast
if linesCount <= 1
line.set_xy1(lineLast, index, price)
line.set_xy2(lineLast, index, price)
label.set_xy(labelLast, index, price)
label.set_text(labelLast, price_rotation_aggregate(price_rotation_diff(line.get_y1(lineLast), price), price, str.tostring(sumVol + sumVolLast, format.volume)))
[lineLast, labelLast, isHighLast, false, sumVol + sumVolLast]
else
[line(na), label(na), bool(na), false, float(na)]
else // reverse the direction (or create the very first line)
if na(lineLast)
id = line.new(index, price, index, price, xloc=xloc.bar_time, color=line_color, width=2)
lb = caption(isHigh, index, price, price_rotation_diff(pLast, price), str.tostring(sumVol, format.volume))
[id, lb, isHigh, true, sumVol]
else
// price move is significant
if math.abs(dev) >= dev_threshold
id = line.new(iLast, pLast, index, price, xloc=xloc.bar_time, color=line_color, width=2)
lb = caption(isHigh, index, price, price_rotation_diff(pLast, price), str.tostring(sumVol, format.volume))
[id, lb, isHigh, true, sumVol]
else
[line(na), label(na), bool(na), false, float(na)]
sumVol += nz(volume[math.floor(depth / 2)])
if not na(iH) and not na(iL) and iH == iL
dev1 = calc_dev(pLast, pH)
[id2, lb2, isHigh2, isNew2, sum2] = pivotFound(dev1, true, iH, pH)
if isNew2
linesCount := linesCount + 1
if not na(id2)
lineLast := id2
labelLast := lb2
isHighLast := isHigh2
iLast := iH
pLast := pH
sumVolLast := sum2
sumVol := 0
dev2 = calc_dev(pLast, pL)
[id1, lb1, isHigh1, isNew1, sum1] = pivotFound(dev2, false, iL, pL)
if isNew1
linesCount := linesCount + 1
if not na(id1)
lineLast := id1
labelLast := lb1
isHighLast := isHigh1
iLast := iL
pLast := pL
sumVolLast := sum1
sumVol := 0
else
if not na(iH)
dev1 = calc_dev(pLast, pH)
[id, lb, isHigh, isNew, sum] = pivotFound(dev1, true, iH, pH)
if isNew
linesCount := linesCount + 1
if not na(id)
lineLast := id
labelLast := lb
isHighLast := isHigh
iLast := iH
pLast := pH
sumVolLast := sum
sumVol := 0
else
if not na(iL)
dev2 = calc_dev(pLast, pL)
[id, lb, isHigh, isNew, sum] = pivotFound(dev2, false, iL, pL)
if isNew
linesCount := linesCount + 1
if not na(id)
lineLast := id
labelLast := lb
isHighLast := isHigh
iLast := iL
pLast := pL
sumVolLast := sum
sumVol := 0
var line extend_line = na
var label extend_label = na
if extend_to_last_bar == true and barstate.islast == true
isHighLastPoint = not isHighLast
curSeries = isHighLastPoint ? high : low
if na(extend_line) and na(extend_label)
extend_line := line.new(line.get_x2(lineLast), line.get_y2(lineLast), time, curSeries, xloc=xloc.bar_time, color=line_color, width=2)
extend_label := caption(not isHighLast, time, curSeries, price_rotation_diff(line.get_y2(lineLast), curSeries), str.tostring(sumVol, format.volume))
line.set_xy1(extend_line, line.get_x2(lineLast), line.get_y2(lineLast))
line.set_xy2(extend_line, time, curSeries)
price_rotation = price_rotation_diff(line.get_y1(extend_line), curSeries)
remaingRealTimeVol = 0.
for i = math.abs(math.floor(depth / 2) - 1) to 0
remaingRealTimeVol += volume[i]
label.set_xy(extend_label, time, curSeries)
label.set_text(extend_label, price_rotation_aggregate(price_rotation, curSeries, str.tostring(sumVol+remaingRealTimeVol, format.volume)))
label.set_textcolor(extend_label, isHighLastPoint? color.green : color.red)
label.set_yloc(extend_label, yloc= isHighLastPoint? yloc.abovebar : yloc.belowbar)

Related

Low of specific period (eg: 20 jan 2015 to 15 nov 2021)

This script draws ATH on chart with date and duration
I want to store low of period as well (like high shown in label) to show (high-low) range in the same label. The period does not end today, but as per script.
indicator("Previous Year(s) ATH", overlay = true)
num_years = input.int(1, title = "Number of years back", minval = 1)
var float ATH = high
var int ATH_time = time
var float[] ATH_vals = array.new_float()
var int[] ATH_time_vals = array.new_int()
var int[] ATH_time_change_vals = array.new_int()
ATH := math.max(ATH, high)
if ta.change(ATH) != 0
ATH_time := time
array.unshift(ATH_vals, ATH)
array.unshift(ATH_time_vals, time)
array.unshift(ATH_time_change_vals, ATH_time)
var float ATH1Y = na
if barstate.isconfirmed
search_time = time - 31536000000 * num_years
for i = 0 to array.size(ATH_time_vals) - 1
if array.get(ATH_time_vals, i) < search_time
ATH1Y := array.get(ATH_vals, i)
ATH1Y_time = array.get(ATH_time_change_vals, i)
y = year(ATH1Y_time)
m = month(ATH1Y_time)
d = dayofmonth(ATH1Y_time)
days_ago = (time - ATH1Y_time) / 86400000
date_text = str.tostring(y) + "/" + str.tostring(m) + "/" + str.tostring(d) + " : " + str.tostring(ATH1Y) + "\nDays Ago : " + str.tostring(math.round(days_ago, 2))
if ATH > ATH1Y and ATH[3] <= ATH1Y[3]
label.new(x = bar_index[3], y = ATH[3], text = date_text, style = label.style_label_lower_right)
break
ATH_val = ATH > ATH1Y ? na : ATH1Y
buy_signal = ATH > ATH1Y and ATH[3] <= ATH1Y[3]
plotshape(buy_signal, color = color.green, location = location.belowbar, size = size.small, style = shape.triangleup)
plot(ATH_val, title = "ATH", style = plot.style_linebr)

Longest Common Substring (find the substring)

I am looking for a way that my result is not the common number of the substring, what I need is the largest common substring.
for example:
s1: abcee12345
s2: abcrd12345
The result I need is:
value: 5, 12345
I need to make with memoization
My code is
def memoization(s1, s2):
mem = {}
def getKey(l1, l2, count):
key = str(l1) + "|" + str(l2) + "|" + str(count)
return key
def findLengthLCS(mem, s1, s2, l1, l2, count):
key = getKey(l1, l2, count)
if l1 == len(s1) or l2 == len(s2):
return count
if key not in mem:
c1 = count
if s1[l1] == s2[l2]:
c1 = findLengthLCS(mem, s1, s2, l1+1, l2+1, count+1)
c2 = findLengthLCS(mem, s1, s2, l1, l2+1, 0)
c3 = findLengthLCS(mem, s1, s2, l1+1, l2, 0)
mem[key] = max(c1, max(c2, c3))
return mem[key]
def getstring(s1, s2):
resultado = ""
i = len(s1)
k = len(s2)
while k > 0 and i >= 0:
key_i = getKey(i, k, 0)
key_i1 = getKey(i - 1, k, 0)
assert key_i in mem
assert key_i1 in mem
if mem[key_i] != mem[keºi1]:
resultado += s1[i]
k = k - 1
i = i - 1
return resultado
value = findLengthLCS(mem, s1, s2, 0, 0, 0)
resultadofinal = getstring(s1, s2)
return value, resultadofinal
try this
# Longest Common Substring – Memoization
mem = None
def LCStr(s1, s2, ancester):
if s1 == "" or s2 == "":
return ancester
if mem[len(s1)][len(s2)] != 0 and ancester == "":
return mem[len(s1)][len(s2)]
case1 = ""
if s1[-1] == s2[-1]:
ancester = s1[-1] + ancester
case1 = LCStr(s1[:-1], s2[:-1], ancester)
case2 = LCStr(s1, s2[:-1], "")
case3 = LCStr(s1[:-1], s2, "")
res = max_len(case1, case2, case3, ancester)
mem[len(s1)][len(s2)] = res
return res
def max_len(s1, s2, s3, s4):
m = max(len(s1), len(s2), len(s3), len(s4))
if len(s1) == m:
return s1
elif len(s2) == m:
return s2
elif len(s3) == m:
return s3
else:
return s4
X = "dxidjfhuxxc"
Y = "uxxcfidipxc"
mem = [[0] * (len(Y)+1) for _ in range(len(X)+1)]
ans = LCStr(X, Y, "")
print(len(ans), ans)
output
4 uxxc

Scipy optimize.minimize with multi- parameters

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import linalg, optimize
%matplotlib inline
Data load
data = pd.read_csv("D:/Stat/TimeSeries/KRW_month_0617_1.csv",index_col="Date") / 100
para = open("D:/Stat/TimeSeries/KRW_month_0617_1.txt").readlines()[0:2]
data.index = pd.to_datetime(data.index)
Parameters
cond = []
params = []
time = []
for i in para:
j = i.split()
for k in j:
cond.append(k)
cond = cond[1:]
for i in range(len(cond)):
cond[i] = round(float(cond[i]),4)
params = cond[0:23]
time = cond[23:]
maturity = np.array(time[1:])
timegap = 1/cond[23]
Functions We need
def Paramcheck(Params, checkStationary = 1):
result = 0
Kappa = np.array([[params[20],0,0], [0,params[21],0], [0,0,params[22]]])
Sigma = np.array([[params[1],0,0], [params[2],params[3],0], [params[4],params[5],params[6]]])
State = np.array([params[7], params[8], params[9]])
Lambda = params[0]
SigmaEps = np.identity(10)
for i in range(10):
SigmaEps[i][i] = params[i+10]
for i in range(len(Sigma)):
if Sigma[i][i] < 0:
result = 1
for j in SigmaEps:
if np.any(SigmaEps) < 0:
result = 1
if Lambda < 0.05 or Lambda > 2:
result = 1
elif State[0] < 0:
result = 1
elif Kappa[0][0] < 0:
result = 1
if result == 0 and checkStationary > 0:
if max(np.linalg.eigvals(-Kappa).real) > 0:
result = 2
return result
def CheckDet(x):
if x == np.inf or x == np.nan:
result = 1
elif x < 0:
result = 2
elif abs(x) < 10**-250:
result = 3
else:
result = 0
return result
def NS_factor(lambda_val, maturity):
col1 = np.ones(len(maturity))
col2 = (1 - np.exp(-lambda_val*maturity))/(lambda_val*maturity)
col3 = col2 - np.exp(-lambda_val*maturity)
factor = np.array([col1,col2,col3]).transpose()
return factor
def DNS_Kalman_filter(Params, *args):
N = Paramcheck(Params)
if N == 0:
Kappa = np.array([[params[20],0,0], [0,params[21],0], [0,0,params[22]]])
Sigma = np.array([[params[1],0,0], [params[2],params[3],0],
[params[4],params[5],params[6]]])
State = np.array([params[7], params[8], params[9]])
Lambda = params[0]
SigmaEps = np.identity(10)
for i in range(10):
SigmaEps[i][i] = params[i+10]
Obs_Yield = args[0]
Obs_Date = args[1]
Timegap = args[2]
Obs_Mty = args[3]
Finalstate = args[4]
Mty_length = len(Obs_Mty)
B = NS_factor(lambda_val = Lambda,maturity = Obs_Mty)
H_large = SigmaEps **2
N_obs = len(Obs_Date)
LLH_vec = np.zeros(N_obs)
phi1 = linalg.expm(-Kappa*Timegap)
phi0 = (np.identity(3)-phi1) # State
Eigenvalues = np.linalg.eig(Kappa)[0]
Eigen_vec = np.linalg.eig(Kappa)[1]
Eigen_vec_inv = np.linalg.inv(Eigen_vec)
S = Eigen_vec_inv # Sigma # Sigma.transpose() # Eigen_vec_inv.transpose()
Atilde = np.dot(Sigma[0], Sigma[0])
Btilde = np.dot(Sigma[1], Sigma[1])
Ctilde = np.dot(Sigma[2], Sigma[2])
Dtilde = np.dot(Sigma[0], Sigma[1])
Etilde = np.dot(Sigma[0], Sigma[2])
Ftilde = np.dot(Sigma[1], Sigma[2])
res1= Atilde* Obs_Mty* Obs_Mty/6
res2= Btilde*(1/(2*Lambda**2) - (1-np.exp(-Lambda*Obs_Mty))/(Lambda**3*Obs_Mty) + (1-
np.exp(-2*Lambda*Obs_Mty))/(4*Lambda**3*Obs_Mty))
res3= Ctilde*(1/(2*Lambda**2) + np.exp(-Lambda*Obs_Mty)/(Lambda**2)-
Obs_Mty*np.exp(-2*Lambda*Obs_Mty)/(4*Lambda) -
3*np.exp(-2*Lambda*Obs_Mty)/(4*Lambda**2) - 2*(1-np.exp(-
Lambda*Obs_Mty))/(Lambda**3*Obs_Mty) + 5*(1-
np.exp(-2*Lambda*Obs_Mty))/(8*Lambda**3*Obs_Mty))
res4= Dtilde*(Obs_Mty/(2*Lambda) + np.exp(-Lambda*Obs_Mty)/(Lambda**2) - (1-np.exp(-
Lambda*Obs_Mty))/(Lambda**3*Obs_Mty))
res5= Etilde*(3*np.exp(-Lambda*Obs_Mty)/(Lambda**2) + Obs_Mty/(2*Lambda)+Obs_Mty*np.exp(-
Lambda*Obs_Mty)/(Lambda) - 3*(1-np.exp(-Lambda*Obs_Mty))/(Lambda**3*Obs_Mty))
res6= Ftilde*(1/(Lambda**2) + np.exp(-Lambda*Obs_Mty)/(Lambda**2) -
np.exp(-2*Lambda*Obs_Mty)/(2*Lambda**2) - 3*(1-np.exp(-
Lambda*Obs_Mty))/(Lambda**3*Obs_Mty) + 3*(1-
np.exp(-2*Lambda*Obs_Mty))/(4*Lambda**3*Obs_Mty))
val = res1 + res2 + res3 + res4 + res5 + res6
V_mat = np.zeros([3,3])
V_lim = np.zeros([3,3])
for i in range(3):
for j in range(3):
V_mat[i][j] = S[i][j]*(1-np.exp(-(Eigenvalues[i] +
Eigenvalues[j])*Timegap))/(Eigenvalues[i] + Eigenvalues[j])
V_lim[i][j] = S[i][j]/(Eigenvalues[i] + Eigenvalues[j])
Q = (Eigen_vec # V_mat # Eigen_vec.transpose()).real
Sigma_lim = (Eigen_vec # V_lim # Eigen_vec.transpose()).real
for i in range(N_obs):
y = Obs_Yield[i]
xhat = phi0 + phi1 # State
y_implied = B # xhat
v = y - y_implied + val
Sigmahat = phi1 # Sigma_lim # phi1.transpose() + Q
F = B # Sigmahat # B.transpose() + H_large
detF = np.linalg.det(F)
if CheckDet(detF) > 0:
N = 3
break
Finv = np.linalg.inv(F)
State = xhat + Sigmahat # B.transpose() # Finv # v
Sigma_lim = Sigmahat - Sigmahat # B.transpose() # Finv # B # Sigmahat
LLH_vec[i] = np.log(detF) + v.transpose() # Finv # v
if N == 0:
if Finalstate:
yDate = Obs_Date[-1]
result = np.array([yDate,State])
else:
result = 0.5 * (sum(LLH_vec) + Mty_length*N_obs*np.log(2*np.pi))
else:
result = 7000000
return result
I made a code that does Arbitrage Free Nelson-Siegel model. Data is return rates of bond (1Y,1.5Y, ... ,20Y). I wanna optimize that function with scipy optimize.minimize function with fixed *args.
Suppose that Initial parmas are verified that it's close to optimized params from empirical experiments using Dynamic Nelson-Siegel Model.
LLC_new = 0
while True:
LLC_old = LLC_new
OPT = optimize.minimize(x0=params,fun=DNS_Kalman_filter, args=
(data.values,data.index,timegap,maturity,0))
params = OPT.x
LLC_new = round(OPT.fun,5)
print("Current LLC: %0.5f" %LLC_new)
if LLC_old == LLC_new:
OPT_para = params
FinalState = DNS_Kalman_filter(params,data.values,data.index,timegap,maturity,True)
break
Result is
Current LLC: -7613.70146
Current LLC: -7613.70146
LLC(log-likelihood value) isn't maximized. It's not a result I desire using Optimizer.
Is there any solution for that?
In R, there is optim() function works as similar as scipy.optimize.minimize() which works really well. I also have a R code for that very similar to this Python code.

python3, difflib SequenceMatcher

the following takes in two strings, compares differences and return them both as identicals as well as their differences, separated by spaces (maintaining the length of the longest sting.
The commented area in the code, are the 4 strings that should be returned.
from difflib import SequenceMatcher
t1 = 'betty: backstreetvboysareback"give.jpg"LAlarrygarryhannyhref="ang"_self'
t2 = 'bettyv: backstreetvboysareback"lifeislike"LAlarrygarryhannyhref="in.php"_self'
#t1 = 'betty : backstreetvboysareback" i e "LAlarrygarryhannyhref=" n "_self'
#t2 = 'betty : backstreetvboysareback" i e "LAlarrygarryhannyhref=" n "_self'
#o1 = ' g v .jpg g '
#o2 = ' v l f islike i .php '
matcher = SequenceMatcher(None, t1, t2)
blocks = matcher.get_matching_blocks()
bla1 = []
bla2 = []
for i in range(len(blocks)):
if i != len(blocks)-1:
bla1.append([t1[blocks[i].a + blocks[i].size:blocks[i+1].a], blocks[i].a + blocks[i].size, blocks[i+1].a])
bla2.append([t2[blocks[i].b + blocks[i].size:blocks[i+1].b], blocks[i].b + blocks[i].size, blocks[i+1].b])
cnt = 0
for i in range(len(bla1)):
if bla1[i][1] < bla2[i][1]:
num = bla2[i][1] - bla1[i][1]
t2 = t2[0:bla2[i][1]] + ' '*num + t2[bla2[i][1]:len(t2)]
bla2[i][0] = ' '*num + bla2[i][0]
bla2[i][1] = bla1[i][1]
if bla2[i][1] < bla1[i][1]:
num = bla1[i][1] - bla2[i][1]
t1 = t1[0:bla1[i][1]] + ' '*num + t1[bla1[i][1]:len(t1)]
bla1[i][0] = ' '*num + bla1[i][0]
bla1[i][1] = bla2[i][1]
if bla1[i][2] > bla2[i][2]:
num = bla1[i][2] - bla2[i][2]
t2 = t2[0:bla2[i][2]] + ' '*num + t2[bla2[i][2]:len(t2)]
bla2[i][0] = bla2[i][0] + ' '*num
bla2[i][2] = bla1[i][2]
if bla2[i][2] > bla1[i][2]:
num = bla2[i][2] - bla1[i][2]
t1 = t1[0:bla1[i][2]] + ' '*num + t1[bla1[i][2]:len(t1)]
bla1[i][0] = bla1[i][0] + ' '*num
bla1[i][2] = bla2[i][2]
t11 = []
t11 = t1[0:bla1[0][1]]
t11 += t1[bla1[0][2]:bla1[1][1]]
t11 += t1[bla1[1][2]:bla1[2][1]]
t11 += t1[bla1[2][2]:bla1[3][1]]
t11 += t1[bla1[3][2]:bla1[4][1]]
t11 += t1[bla1[5][2]:bla1[6][1]]
t11 += t1[bla1[6][2]:len(t1)]
t12 = []
t12 = t2[0:bla1[0][1]]
t12 += t2[bla1[0][2]:bla1[1][1]]
t12 += t2[bla1[1][2]:bla1[2][1]]
t12 += t2[bla1[2][2]:bla1[3][1]]
t12 += t2[bla1[3][2]:bla1[4][1]]
t12 += t2[bla1[5][2]:bla1[6][1]]
t12 += t2[bla1[6][2]:len(t2)]
After ranging the blocks into an organised format bla1, bla2 where each difference is stored as a string with its start and end position eg ['v', 33, 34] for each separate string. After this, I attempt to insert spaces to match the length and separation factors necessary and this is where the code starts to break.
Please if someone could take a look!
I have worked through resolving this, and since no one has posted a response I will post the progress and solution. The following code is progress ... it worked well when dealing with variations that had less offset but began to break when getting into larger differences, specifically in maintaining spacing (offset) in matching up the two.
from difflib import SequenceMatcher
import pdb
t1 = 'betty: backstreetvboysareback"give.jpg"LAlarrygarryhannyhref="ang"_self'
t2 = 'betty: backstreetvboysareback"lol.jpg"LAlarrygarryhannyhref="ang"_self'
#t2 = 'bettyv: backstreetvboysareback"lifeislike"LAlarrygarryhannyhref="in.php"_selff'
#t2 = 'LA'
#t2 = 'c give.'
#t2 = 'give.'
#t1 = 'betty : backstreetvboysareback" i e "LAlarrygarryhannyhref=" n "_self'
#t2 = 'betty : backstreetvboysareback" i e "LAlarrygarryhannyhref=" n "_self'
#o1 = ' g v .jpg g '
#o2 = ' v l f islike i .php '
matcher = SequenceMatcher(None, t1, t2)
blocks = matcher.get_matching_blocks()
#print(len(blocks))
bla1 = []
bla2 = []
#bla = (string), (first pos), (second pos), (pos1 + pos2), (pos + pos2 total positions added togeather)
dnt = False
for i in range(len(blocks)):
if i == 0:
if blocks[i].a != 0 and dnt == False:
bla1.append([t1[blocks[i].a:blocks[i].b], 0, blocks[i].a, 0, 0])
bla2.append([t2[blocks[i].a:blocks[i].b], 0, blocks[i].b, 0, 0])
dnt = True
if blocks[i].b != 0 and dnt == False:
bla2.append([t2[blocks[i].a:blocks[i].b], 0, blocks[i].b, 0, 0])
bla1.append([t1[blocks[i].a:blocks[i].b], 0, blocks[i].a, 0, 0])
dnt = True
if i != len(blocks)-1:
print(blocks[i])
bla1.append([t1[blocks[i].a + blocks[i].size:blocks[i+1].a], blocks[i].a + blocks[i].size, blocks[i+1].a, 0, 0])
bla2.append([t2[blocks[i].b + blocks[i].size:blocks[i+1].b], blocks[i].b + blocks[i].size, blocks[i+1].b, 0, 0])
#pdb.set_trace()
ttl = 0
for i in range(len(bla1)):
cnt = bla1[i][2] - bla1[i][1]
if cnt != 0:
bla1[i][3] = cnt
ttl = ttl + cnt
bla1[i][4] = ttl
ttl = 0
for i in range(len(bla2)):
cnt = bla2[i][2] - bla2[i][1]
if cnt != 0:
bla2[i][3] = cnt
ttl = ttl + cnt
bla2[i][4] = ttl
print(bla1)
print(bla2)
tt1 = ''
dif = 0
i = 0
while True:
if i == 0:
if bla1[i][3] >= bla2[i][3]: dif = bla1[i][3]
if bla1[i][3] < bla2[i][3]: dif = bla2[i][3]
tt1 += t1[:bla1[i][1]] + '_'*dif
if i <= len(bla1) -1:
if bla1[i][3] >= bla2[i][3]: dif = bla1[i][3]
if bla1[i][3] < bla2[i][3]: dif = bla2[i][3]
if len(bla1) != 1:
if i == 0: tt1 += t1[bla1[i][1] + bla1[i][3]:bla1[i+1][1]]
if i != 0 and i != len(bla1)-1: tt1 += '_'*dif + t1[bla1[i][1] + bla1[i][3]:bla1[i+1][1]]
if i == len(bla1)-1: tt1 += '_'*dif + t1[bla1[i][1] + bla1[i][3]:len(t1)]
i = i+1
print('t1 = ' + tt1)
else:
break
tt2 = ''
i = 0
dif = 0
while True:
if i == 0:
if bla1[i][3] >= bla2[i][3]: dif = bla1[i][3]
if bla1[i][3] < bla2[i][3]: dif = bla2[i][3]
tt2 += t2[:bla2[i][1]] + '_'*dif
if i <= len(bla2) -1:
if bla1[i][3] >= bla2[i][3]: dif = bla1[i][3]
if bla1[i][3] < bla2[i][3]: dif = bla2[i][3]
if len(bla2) != 1:
if i == 0: tt2 += t2[bla2[i][1] + bla2[i][3]:bla2[i+1][1]]
if i != 0 and i != len(bla1)-1: tt2 += '_'*dif + t2[bla2[i][1] + bla2[i][3]:bla2[i+1][1]]
if i == len(bla2)-1: tt2 += '_'*dif + t2[bla2[i][1] + bla2[i][3]:len(t2)]
i = i+1
print('t2 = ' + tt2)
else:
break
print()
Solution:
Unfortunately I have been too busy to continue coding this and have resorted to sub-processing diffutils ... this is a wonderful alternative to a lot of painstaking coding!

Extrapolation -- awk based

I need help in the following: I have a data file (columns separated by "\t" tabular) like this data.dat
# y1 y2 y3 y4
17.1685 21.6875 20.2393 26.3158
These are x values of 4 points for a linear fit. The four y values are constant: 0, 200, 400, 600.
I can create a linear fit of the point pairs (x,y): (x1,y1)=(17.1685,0), (x2,y2)=(21.6875,200), (x3,y3)=(20.2393,400), (x4,y4)=(26.3158,600).
Now I would like to make a linear fit on three of these point paris, (x1,y1), (x2,y2), (x3,y3) and (x2,y2), (x3,y3), (x4,y4) and (x1,y1), (x3,y3), (x4,y4) and (x1,y1), (x2,y2), (x4,y4).
If I have these three of points with a linear fit I would like to know the value of the x value of the extrapolated point being out of these three fitted points.
I have so far this awk code:
#!/usr/bin/awk -f
BEGIN{
z[1] = 0;
z[2] = 200;
z[3] = 400;
z[4] = 600;
}
{
split($0,str,"\t");
n = 0.0;
for(i=1; i<=NF; i++)
{
centr[i] = str[i];
n += 1.0;
# printf("%d\t%f\t%.1f\t",i,centr[i],z[i]);
}
# print "";
if (n > 2)
{
lsq(n,z,centr);
}
}
function lsq(n,x,y)
{
sx = 0.0
sy = 0.0
sxx = 0.0
syy = 0.0
sxy = 0.0
eps = 0.0
for (i=1;i<=n;i++)
{
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
}
if ( (n==0) || ((n*sxx-sx*sx)==0) )
{
next;
}
# print "number of data points = " n;
a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
b = (n*sxy-sx*sy)/(n*sxx-sx*sx)
for(i=1;i<=n;i++)
{
ycalc[i] = a+b*x[i]
dy[i] = y[i]-ycalc[i]
eps += dy[i]*dy[i]
}
print "# Intercept =\t"a"
print "# Slope =\t"b"
for (i=1;i<=n;i++)
{
printf("%8g %8g %8g \n",x[i],y[i],ycalc[i])
}
} # function lsq()
So,
If we extrapolate to the place of 4th
0 17.1685 <--(x1,y1)
200 21.6875 <--(x2,y2)
400 20.2393 <--(x3,y3)
600 22.7692 <<< (x4 = 600,y1 = 22.7692)
If we extrapolate to the place of 3th
0 17.1685 <--(x1,y1)
200 21.6875 <--(x2,y2)
400 23.6867 <<< (x3 = 400,y3 = 23.6867)
600 26.3158 <--(x4,y4)
0 17.1685
200 19.35266 <<<
400 20.2393
600 26.3158
0 18.1192 <<<
200 21.6875
400 20.2393
600 26.3158
My current output is the following:
$> ./prog.awk data.dat
# Intercept = 17.4537
# Slope = 0.0129968
0 17.1685 17.4537
200 21.6875 20.0531
400 20.2393 22.6525
600 26.3158 25.2518
Assuming the core calculation in the lsq function is OK (it looks about right, but I haven't scrutinized it), then that gives you the slope and intercept for the least sum of squares line of best fit for the input data set (parameters x, y, n). I'm not sure I understand the tail end of the function.
For your 'take three points and calculate the fourth' problem, the simplest way is to generate the 4 subsets (logically, by deleting one point from the set of four on each of four calls), and redo the calculation.
You need to call another function that takes the line data (slope, intercept) from lsq and interpolates (extrapolates) the value at another y value. That's a straight-forward calculation (x = m * y + c), but you need to determine which y value is missing from the set of 3 you pass in.
You could 'optimize' (meaning 'complicate') this scheme by dropping one value at a time from the 'sums of squares' and 'sums' and 'sum of products' values, recalculating the slope, intercept, and then calculating the missing point again.
(I'll also observe that normally it would be the x-coordinates with the fixed values 0, 200, 400, 600 and the y-coordinates would be the values read. However, that's just a matter of orientation, so it is not crucial.)
Here's at least plausibly working code. Since awk automatically splits on white space, there's no need for you to split on tabs specifically; the read loop takes this into account.
The code needs serious refactoring; there is a ton of repetition in it - however, I also have a job that I'm supposed to do.
#!/usr/bin/awk -f
BEGIN{
z[1] = 0;
z[2] = 200;
z[3] = 400;
z[4] = 600;
}
{
for (i = 1; i <= NF; i++)
{
centr[i] = $i
}
if (NF > 2)
{
lsq(NF, z, centr);
}
}
function lsq(n, x, y)
{
if (n == 0) return
sx = 0.0
sy = 0.0
sxx = 0.0
syy = 0.0
sxy = 0.0
for (i = 1; i <= n; i++)
{
print "x[" i "] = " x[i] ", y[" i "] = " y[i]
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
}
if ((n*sxx - sx*sx) == 0) return
# print "number of data points = " n;
a = (sxx*sy-sxy*sx)/(n*sxx-sx*sx)
b = (n*sxy-sx*sy)/(n*sxx-sx*sx)
for (i = 1; i <= n; i++)
{
ycalc[i] = a+b*x[i]
}
print "# Intercept = " a
print "# Slope = " b
print "Line: x = " a " + " b " * y"
for (i = 1; i <= n; i++)
{
printf("x = %8g, yo = %8g, yc = %8g\n", x[i], y[i], ycalc[i])
}
print ""
print "Different subsets\n"
for (drop = 1; drop <= n; drop++)
{
print "Subset " drop
sx = sy = sxx = sxy = syy = 0
j = 1
for (i = 1; i <= n; i++)
{
if (i == drop) continue
print "x[" j "] = " x[i] ", y[" j "] = " y[i]
sx += x[i]
sy += y[i]
sxx += x[i]*x[i]
sxy += x[i]*y[i]
syy += y[i]*y[i]
j++
}
if (((n-1)*sxx - sx*sx) == 0) continue
a = (sxx*sy-sxy*sx)/((n-1)*sxx-sx*sx)
b = ((n-1)*sxy-sx*sy)/((n-1)*sxx-sx*sx)
print "Line: x = " a " + " b " * y"
xt = x[drop]
yt = a + b * xt;
print "Interpolate: x = " xt ", y = " yt
}
}
Since awk doesn't provide an easy way to pass back multiple values from a function, nor does it provide structures other than arrays (sometimes associative), it is not perhaps the best language for this task. On the other hand, it can be made to do the job. You might be able to bundle the Least Squares calculation in a function that returns an array containing the slope and intercept, and then use that. Your turn to explore options.
Given the script lsq.awk and the input file lsq.data shown, I get the output shown:
$ cat lsq.data
17.1685 21.6875 20.2393 26.3158
$ awk -f lsq.awk lsq.data
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
x[4] = 600, y[4] = 26.3158
# Intercept = 17.4537
# Slope = 0.0129968
Line: x = 17.4537 + 0.0129968 * y
x = 0, yo = 17.1685, yc = 17.4537
x = 200, yo = 21.6875, yc = 20.0531
x = 400, yo = 20.2393, yc = 22.6525
x = 600, yo = 26.3158, yc = 25.2518
Different subsets
Subset 1
x[1] = 200, y[1] = 21.6875
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 18.1192 + 0.0115708 * y
Interpolate: x = 0, y = 18.1192
Subset 2
x[1] = 0, y[1] = 17.1685
x[2] = 400, y[2] = 20.2393
x[3] = 600, y[3] = 26.3158
Line: x = 16.5198 + 0.0141643 * y
Interpolate: x = 200, y = 19.3526
Subset 3
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 600, y[3] = 26.3158
Line: x = 17.7985 + 0.0147205 * y
Interpolate: x = 400, y = 23.6867
Subset 4
x[1] = 0, y[1] = 17.1685
x[2] = 200, y[2] = 21.6875
x[3] = 400, y[3] = 20.2393
Line: x = 18.163 + 0.007677 * y
Interpolate: x = 600, y = 22.7692
$
Edit: In the previous version of the answer, the subsets were multiplying by n instead of (n-1). The values in the revised output seem to agree with what you expect. The residual issues are presentational, not computational.

Resources