Caret does not end with error or success and continues to run - caret

I'm a beginner and I'm trying to use caret- "enet".
The code is:
training = train(GGG, aaa, method = "enet" )
PRECICT = predict(training, newdata = PPP, type = "raw")
or type = "prob"
Where "GGG" and "PPP" are matrices and "aaa" is a vector.
It does not end with error or success and continues to run.
May someone help to understand the reasons?
Also, I would like to ask what other machine learning method to use. The aforementioned "GGG" and "PPP" are matrices with beta values(numbers between 0 and 1) from Illumina 450k array and aaa is a vector containing the age of subjects.
Thank you in advance.

Related

Python error: TypeError: float() argument must be a string or a real number, not 'list'

Hello I am new to programming and taking online courses at the moment. I am stuck on this specific exercise:
I have a sample.txt that contains a bunch of text with numbers all throughout the text. My goal is to parse the lines for the numbers only and then find the sum.
Here is the sample.txt:
Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative
7 and rewarding activity. You can write programs for
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem. This book assumes that
everyone needs to know how to program ...
Here is my code:
import re
name = input('Enter File: ')
if len(name) < 1 :
name = 'sample.txt'
handle = open(name)
numlist = list()
for line in handle :
line = line.rstrip()
items = re.findall('[0-9]+', line)
if len(items) > 0 :
#print(items)
num = float(items[0: ])
numlist.append(num)
print(sum(numlist))
You are trying to convert a list to a float.
num = float(items[0:])
# note, the [0:] is redundant, it will just return the whole list
You can either convert the whole list at once using map
nums = list(map(float, items))
Or you can convert them one at a time like this
nums = []
for item in items:
nums.append(float(item))
I think the better way is to convert the whole list at once using the map function
https://www.geeksforgeeks.org/python-map-function/

Setting seeds in multi-threading loop in Julia

I want to generate random numbers in Julia using multi-threading. I am using the
Threads.#threads macro to accomplish it. However, I struggle fixing the number of seeds to obtain the same result every time I run the code. Here is my trial:
Random.seed!(1234)
a = [Float64[] for _ in 1:10]
Threads.#threads for i = 1:10
push!(a[Threads.threadid()],rand())
end
sum(reduce(vcat, a))
The script above delivers different results every time I run it. By contrast, I get the same results if I use a plain for loop:
Random.seed!(12445)
b = []
for i = 1:10
push!(b,rand())
end
sum(b)
I have the impression that the solution to this issue must be easy. Still, I couldn't find it. Any help is much appreciated.
Thank you.
You need to generate a separate random stream for each thread.
The simplest way is to have a random number generator with a different seed:
using Random
rngs = [MersenneTwister(i) for i in 1: Threads.nthreads()];
Threads.#threads for i = 1:10
val = rand(rngs[Threads.threadid()])
# do something with val
end
If you do not want to risk correlation for different random number seeds you could actually jump around a single number generator:
julia> rngs2 = Future.randjump.(Ref(MersenneTwister(0)), big(10)^20 .* (1:Threads.nthreads()))
4-element Vector{MersenneTwister}:
MersenneTwister(0, (200000000000000000000, 0))
MersenneTwister(0, (400000000000000000000, 0))
MersenneTwister(0, (600000000000000000000, 0))
MersenneTwister(0, (800000000000000000000, 0))
Ciao Fabrizio. In BetaML I solved this problem with:
"""
generateParallelRngs(rng::AbstractRNG, n::Integer;reSeed=false)
For multi-threaded models, return n independent random number generators (one per thread) to be used in threaded computations.
Note that each ring is a _copy_ of the original random ring. This means that code that _use_ these RNGs will not change the original RNG state.
Use it with `rngs = generateParallelRngs(rng,Threads.nthreads())` to have a separate rng per thread.
By default the function doesn't re-seed the RNG, as you may want to have a loop index based re-seeding strategy rather than a threadid-based one (to guarantee the same result independently of the number of threads).
If you prefer, you can instead re-seed the RNG here (using the parameter `reSeed=true`), such that each thread has a different seed. Be aware however that the stream of number generated will depend from the number of threads at run time.
"""
function generateParallelRngs(rng::AbstractRNG, n::Integer;reSeed=false)
if reSeed
seeds = [rand(rng,100:18446744073709551615) for i in 1:n] # some RNGs have issues with too small seed
rngs = [deepcopy(rng) for i in 1:n]
return Random.seed!.(rngs,seeds)
else
return [deepcopy(rng) for i in 1:n]
end
end
The function above deliver the same results also independently of the number of threads used in Julia and can then be used for example like here:
using Test
TESTRNG = MersenneTwister(123)
println("** Testing generateParallelRngs()...")
x = rand(copy(TESTRNG),100)
function innerFunction(bootstrappedx; rng=Random.GLOBAL_RNG)
sum(bootstrappedx .* rand(rng) ./ 0.5)
end
function outerFunction(x;rng = Random.GLOBAL_RNG)
masterSeed = rand(rng,100:9999999999999) # important: with some RNG it is important to do this before the generateParallelRngs to guarantee independance from number of threads
rngs = generateParallelRngs(rng,Threads.nthreads()) # make new copy instances
results = Array{Float64,1}(undef,30)
Threads.#threads for i in 1:30
tsrng = rngs[Threads.threadid()] # Thread safe random number generator: one RNG per thread
Random.seed!(tsrng,masterSeed+i*10) # But the seeding depends on the i of the loop not the thread: we get same results indipendently of the number of threads
toSample = rand(tsrng, 1:100,100)
bootstrappedx = x[toSample]
innerResult = innerFunction(bootstrappedx, rng=tsrng)
results[i] = innerResult
end
overallResult = mean(results)
return overallResult
end
# Different sequences..
#test outerFunction(x) != outerFunction(x)
# Different values, but same sequence
mainRng = copy(TESTRNG)
a = outerFunction(x, rng=mainRng)
b = outerFunction(x, rng=mainRng)
mainRng = copy(TESTRNG)
A = outerFunction(x, rng=mainRng)
B = outerFunction(x, rng=mainRng)
#test a != b && a == A && b == B
# Same value at each call
a = outerFunction(x,rng=copy(TESTRNG))
b = outerFunction(x,rng=copy(TESTRNG))
#test a == b
Assuming you are on Julia 1.6 you can do e.g. the following:
julia> using Random
julia> foreach(i -> Random.seed!(Random.default_rng(i), i), 1:Threads.nthreads())
The point is that currently Julia already has a separate random number generator per thread so you do not need to generate your own (of course you could do it as in the other answers, but you do not have to).
Also note that in the future versions of Julia the:
Threads.#threads for i = 1:10
push!(a[Threads.threadid()],rand())
end
part is not guaranteed to produce reproducible results. In Julia 1.6 Threads.#threads uses static scheduling, but as you can read in its docstring it is subject to change.

Shortest path trough a set of points

I have a set of points (represented by complex values), and I need to find the shortest path through these. It looks a bit like the travelling salesman problem, but I can't seem to find (or understand) a solution that isn't in O(n!). I know how to compute short enough solutions in O(n^3), O(n²), but I wanted to know if it was possible to have THE best one. Thank you !
There's the code I use for a "Short Enough Path"
def insert(x,liste,taille):
max_add = 10**9
n = len(liste) -1
for i in range(n):
test = abs(liste[i] -x) + abs(liste[i+1] - x) - taille[i]
if test < max_add:
max_add = test
i_max = i
taille[i_max] = abs(liste[i_max]-x)
taille.insert(i_max+1,abs(liste[i_max+1] - x))
liste.insert(i_max+1,x)
def sort(x,i=0):
taille = [0]
tri = [x[i]]*2
for y in x[:i]+x[i+1:]:
inserer(y,tri,taille)
return tri, taille
def the_best(liste):
n = len(liste)
shortest = 10**9
for i in range(n):
a,b = sort(liste,i)
if sum(b) < shortest:
back = a,b
return back
`
Of course the "the_best" function is in O(n^3) so I usually use the "sort" function only
The list called "taille" is built like this:
taille[i] = abs(liste[i] - liste[i+1])
liste[-1] = liste[0]
From what I understand in your description, this is indeed the TSP problem. It is a well-known NP-hard problem, and as such an efficient algorithm to solve it does not exist (even if it does, we don't know of it yet). It's one of the famous open problems in Computer Science.
Indeed, do give it a try to solve it, but do not hold your breath :)
General reading: https://en.wikipedia.org/wiki/Travelling_salesman_problem
You may also want to give a quick read to: https://en.wikipedia.org/wiki/P_versus_NP_problem

How to create idendical variables in MATLAB from an array of variable names?

I have the following code in Matlab:
a = zeros(23,1)
b = zeros(23,1)
c = zeros(23,1)
How can I write it more compactly? I was looking for a solution that is something like this:
str = {'a','b','c'}
for i = str{i}
i = zeros(23,1)
end
But I can't find a way to do it properly without an error message. Can someone help please?
Here is a compact way using deal :
[a, b, c] = deal(zeros(23,1));
You can also use a struct if the variable name is important:
str = {'a','b','c'};
data = struct
for ii = 1:numel(str)
data.(str{ii}) = zeros(23,1);
end
The struct is more efficient than the table. You can now address data.a, data.b, etc.
But if the name is not useful, it's best to put your data into a cell array:
N = 3;
data = cell(N,1);
for ii = 1:N
data{ii} = zeros(23,1);
end
or simply:
data = cell(3,1);
[data{:}] = deal(zeros(23,1));
Now you address your arrays as data{1}, data{2}, etc., and they're always easy to address in loops.
What you're tempted to do is very bad practise, but can be done like this
str = {'a','b','c'};
for ii = 1:numel(str)
eval( [str{ii} ' = zeros(23,1)'] );
end
Why is this bad practise?
Your code legibility has just gone way down, you can't clearly see where variables are declared.
eval should be avoided
You could use deal to make things a bit nicer, but this doesn't use the array of variable names
[a, b, c] = deal( zeros(23, 1) );
Even better, it's likely you can optimise your code by using a matrix or table instead of separate 1D arrays. The table option means you can still use your variable name array, but you're not using eval for anything!
% Matrix
M = zeros( 23, 3 ); % Index each column as a/b/c using M(:,1) etc
% Table, index using T.a, T.b, T.c
T = array2table( zeros(23,3), 'VariableNames', {'a','b','c'} );

Find similar texts based on paraphrase detection [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am interested in finding similar content(text) based on paraphrasing. How do I do this?
Are there any specific tools which can do this? In python preferably.
I believe the tool you are looking for is Latent Semantic Analysis.
Given that my post is going to be quite lengthy, I'm not going to go into much detail explaining the theory behind it - if you think that it is indeed what you are looking for, the I recommend you look it up. The best place to start would be here:
http://staff.scm.uws.edu.au/~lapark/lt.pdf
In summary, LSA attempts to uncover the underlying / latent meaning of words and phrases based on the assumption that similar words appear in similar documents. I'll be using R to demonstrate how it works.
I'm going to set up a function that is going to retrieve similar documents based on their latent meaning:
# Setting up all the needed functions:
SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){
# Query Vector
LookupPhrase = function(phrase,LSAS){
lsatm = as.textmatrix(LSAS)
QV = function(phrase){
q = query(phrase,rownames(lsatm))
t(q)%*%LSAS$tk%*%diag(LSAS$sk)
}
q = QV(phrase)
qd = 0
for (i in 1:nrow(LSAS$dk)){
qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,]))
}
qd
}
# Handling Synonyms
Syns = function(word){
wl = gsub("(.*[[:space:]].*)","",
gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",
unlist(strsplit(PlainTextDocument(synonyms(word)),","))))
wl = wl[wl!=""]
return(wl)
}
ex = unlist(strsplit(expression," "))
for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
ex = unique(wordStem(ex))
cache = LookupPhrase(paste(ex,collapse=" "),LSAS)
if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])}
if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) }
if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))}
if(Out=="ValuesUnsorted"){return(cache)}
}
Note that that we make use of synonyms here when assembling our query vector. This approach isn't perfect because some of the synonyms in the qdap library are remote at best... This may interfere with your search query, so to achieve more accurate but less generalizable results, you can simply get rid of the synonyms bit and manually select all relevant terms that make up your query vector.
Let's try it out. I'll also be using the US Congress dataset from the package RTextTools:
library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)
data(USCongress)
text = as.character(USCongress$text)
corp = Corpus(VectorSource(text))
parameters = list(minDocFreq = 1,
wordLengths = c(2,Inf),
tolower = TRUE,
stripWhitespace = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stemming = TRUE,
stopwords = TRUE,
tokenize = NULL,
weighting = function(x) weightSMART(x,spec="ltn"))
tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)
# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced)
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)
l = 50
exp = "support trade"
SemanticLink(text,exp,n=5,lsaSpace,Out="Text")
[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes."
As you can see, that while "support trade" may not appear as such in the above example, the function has retrieved a set of documents which are relevant to the query. The function is designed to retrieve documents with semantic linkages rather than exact matches.
We can also see how "close" these documents are to the query vector by plotting the cosine distances:
plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted")
,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "),
xlab="observations",ylab="Cosine")
I don't have enough reputation yet to produce the plot though, sorry.
As you would see, the first 2 entries appear to be more associated with the query vector than the rest (there are about 5 that are particularly relevant though), even though reading though them you wouldn't have though so. I would say that this is the effect of using synonyms to build your query vectors. Ignoring that however, the graph allows us how many other documents are remotely similar to the query vector.
EDIT:
Just recently, I've had to solve the problem you are trying to solve, but the above function just wouldn't work well, simply because the data was atrocious - the text was short, there was very little of it and not many topics were explored. So to find relevant entries in such situations, I've developed another function that is purely based on regular expressions.
Here it goes:
HLS.Extract = function(pattern,text=active.text){
require(qdap)
require(tm)
require(RTextTools)
p = unlist(strsplit(pattern," "))
p = unique(wordStem(p))
p = gsub("(.*)i$","\\1y",p)
Syns = function(word){
wl = gsub("(.*[[:space:]].*)","",
gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",
unlist(strsplit(PlainTextDocument(synonyms(word)),","))))
wl = wl[wl!=""]
return(wl)
}
trim = function(x){
temp_L = nchar(x)
if(temp_L < 5) {N = 0}
if(temp_L > 4 && temp_L < 8) {N = 1}
if(temp_L > 7 && temp_L < 10) {N = 2}
if(temp_L > 9) {N = 3}
x = substr(x,0,nchar(x)-N)
x = gsub("(.*)","\\1\\\\\\w\\*",x)
return(x)
}
# SINGLE WORD SCENARIO
if(length(p)<2){
# EXACT
p = trim(p)
ndx_exact = grep(p,text,ignore.case=T)
text_exact = text[ndx_exact]
# SEMANTIC
p = unlist(strsplit(pattern," "))
express = new.exp = list()
express = c(p,Syns(p))
p = unique(wordStem(express))
temp_exp = unlist(strsplit(express," "))
temp.p = double(length(seq(temp_exp)))
for(j in seq(temp_exp)){
temp_exp[j] = trim(temp_exp[j])
}
rgxp = paste(temp_exp,collapse="|")
ndx_s = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
text_s = as.character(text[ndx_s])
f.object = list("ExactIndex" = ndx_exact,
"SemanticIndex" = ndx_s,
"ExactText" = text_exact,
"SemanticText" = text_s)
}
# MORE THAN 2 WORDS
if(length(p)>1){
require(combinat)
# EXACT
for(j in seq(p)){p[j] = trim(p[j])}
fp = factorial(length(p))
pmns = permn(length(p))
tmat = matrix(0,fp,length(p))
permut = double(fp)
temp = double(length(p))
for(i in 1:fp){
tmat[i,] = pmns[[i]]
}
for(i in 1:fp){
for(j in seq(p)){
temp[j] = paste(p[tmat[i,j]])
}
permut[i] = paste(temp,collapse=" ")
}
permut = gsub("[[:space:]]",
"[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)
ndx_exact = grep(paste(permut,collapse="|"),text)
text_exact = as.character(text[ndx_exact])
# SEMANTIC
p = unlist(strsplit(pattern," "))
express = list()
charexp = permut = double(length(p))
for(i in seq(p)){
express[[i]] = c(p[i],Syns(p[i]))
express[[i]] = unique(wordStem(express[[i]]))
express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
for(j in seq(express[[i]])){
express[[i]][j] = trim(express[[i]][j])
}
charexp[i] = paste(express[[i]],collapse="|")
}
charexp = gsub("(.*)","\\(\\1\\)",charexp)
charexpX = double(length(p))
for(i in 1:fp){
for(j in seq(p)){
temp[j] = paste(charexp[tmat[i,j]])
}
permut[i] = paste(temp,collapse=
"[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
}
rgxp = paste(permut,collapse="|")
ndx_s = grep(rgxp,text,ignore.case=T)
text_s = as.character(text[ndx_s])
temp.f = function(x){
if(length(x)==0){x=0}
}
temp.f(ndx_exact); temp.f(ndx_s)
temp.f(text_exact); temp.f(text_s)
f.object = list("ExactIndex" = ndx_exact,
"SemanticIndex" = ndx_s,
"ExactText" = text_exact,
"SemanticText" = text_s,
"Synset" = express)
}
return(f.object)
cat(paste("Exact Matches:",length(ndx_exact),sep=""))
cat(paste("\n"))
cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}
Trying it out:
HLS.Extract("buy house",
c("we bought a new house",
"I'm thinking about buying a new home",
"purchasing a brand new house"))[["SemanticText"]]
$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"
As you can see, the function is quite flexible. It would also pick up "home buying". It didn't pick up "we bought a new house" though, because "bought" is an irregular verb - it's the kind of thing that LSA would have picked up.
So you may like to try both and see which one works better. The SemanticLink function also requires a ton of memory, and when you have a particularly large corpus, you won't be able to use it
Cheers
I recommend you to read answers to this question, especially first two answers are really good.
I can also recommend Natural language processing toolkit (haven't personally tried)
For similarity between news articles, you could extract keywords using part of speech tagging. NLTK provides a good POS tagger. Using nouns and noun phrases as keywords, represent each news article as a keyword vector.
Then use cosine similarity or some such text similarity measure to quantify similarity.
Further enhancements include handling synonyms, word stemming, handling adjectives if required, using TF-IDF as keyword weights in the vector, etc.

Resources