Regular expression to stop at first match immediately - python-3.x

So this is a given string where you can see in some places there are missing values.
We have to fill those missing values in a certain specified way.
s = "_, _, 30, _, _, _, 50, _, _ "
My concern for the first bit of the problem is to extract the " _, _, 30 " part from the string ( so that i can take it apart, modify and replace the modifed bit in the original string ). I tried to do it using:
import re
res = re.findall("_.*[0-9]",s)
print(res)
The output I am getting is:
_, _, 30, _, _, _, 50
whereas the desired result is:
_, _, 30
How can i do it using re module?

Your problem is coming from the fact, that on default regex operators are greedy - which means they return the longest match, there are 2 ways to solve your problem:
(1) Just to move from greedy to non-greedy operator:
>>> re.findall("_.*?[0-9]+",s)
['_, _, 30', '_, _, _, 50']
(2) Replace "any" with non-numeric:
>>> re.findall(r"[^0-9]*[0-9]+", s)
['_, _, 30', ', _, _, _, 50']

Related

python - nested loops robust to nulls/blanks

Here's a slice of my data:
[
['2002','','28','102','5'],
['2003','32','15','88','2'],
]
A year followed by four possible types: A,B,C,D,
I want to loop through each of these so that I get a 'long' form of data, ex:
{'year':'2002','type':B}x28 (28 of these)
{'year':'2002','type':C}x102
{'year':'2002','type':D}x5
And so forth; note that the first occurrence of A should be skipped because it has a blank.
I got:
Invalid literal int() with base 10:''
Which is odd because I made sure to cast the readout into an int using int():
master_array = []
for i in range(len(tsv_data)):
for j in range(int(tsv_data[i][1])):
temp = {}
temp['type']='A'
master_array.append(temp)
I just moved on and tried:
master_array = []
for i in range(len(tsv_data)):
try:
for j in range(int(tsv_data[i][1])):
temp = {}
temp['type']='A'
temp['year']=tsv_data[i][0]
master_array.append(temp)
except:
try:
for j in range(int(tsv_data[i][2])):
temp = {}
temp['type']='B'
temp['year']=tsv_data[i][0]
master_array.append(temp)
except:
try:
for j in range(int(tsv_data[i][3])):
temp = {}
temp['type']='C'
temp['year']=tsv_data[i][0]
master_array.append(temp)
except:
for j in range(int(tsv_data[i][4])):
temp = {}
temp['type']='D'
temp['year']=tsv_data[i][0]
master_array.append(temp)
Which ran, but it gave me a tiny fraction of what the actually list should be. The behavior I was expect was for try/except to skip all the ones with blanks and iterate over the cells with values.
Question
How can I read the values in each cell and use them as iteration cues for my master_array and robust to blanks?
The result should be one item for each year and type configuration, based on whatever the value was.
Only take the range if the value is actual digits. This does that and is a more simple and readable implementation.
master_array = []
for y, a, b, c, d in tsv_data:
if a.isdigit():
master_array.extend([{"year": y, "type": "A"} for _ in range(int(a))])
if b.isdigit():
master_array.extend([{"year": y, "type": "B"} for _ in range(int(b))])
if c.isdigit():
master_array.extend([{"year": y, "type": "C"} for _ in range(int(c))])
if d.isdigit():
master_array.extend([{"year": y, "type": "D"} for _ in range(int(d))])
You could do this with the simpler truth check, but it is better to check it is a digit before the conversion.
if a: # Will work, but better to check int.
master_array.extend([{"year": y, "type": "A"} for _ in range(int(a))])

python 3, differences between two strings

I'd like to record the location of differences from both strings in a list (to remove them) ... preferably recording the highest separation point for each section, as these areas will have dynamic content.
Compare these
total chars 178. Two unique sections
t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'
and
total chars 211. Two unique sections
t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'
I know difflib can do this but the output is bad.
I'd like to store (in a list) the char positions, perferably the larger seperation values.
pattern location
t1 = 'WhereTisthetotalnumberof 24 ght5y5wsjhhhhjhkmhm 43 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap 151 xxxxxxx 158 proximation,although'
t2 = 'WhereTisthetotalnumberof 24 dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs 76 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre 155 xxxxxxx 162 sultsduetodifferinglevelsofapproximation,although'
output:
output list = [24, 76, 151, 162]
Update
Response post #Olivier
position of all Y's seperated by ***
t1
WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although
t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although
output after matcher.get_matching_blocks()
and string = ''.join([t1[a:a+n] for a, _, n in blocks])
WhereTisthetotalnumberof***y*** Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapproximation,although
Using difflib is probably your best bet as you are unlikely to come up with a more efficient solution than the algorithms it provides. What you want is to use SequenceMatcher.get_matching_blocks. Here is what it will output according to the doc.
Return list of triples describing matching subsequences. Each triple
is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The
triples are monotonically increasing in i and j.
Here is a way you could use this to reconstruct a string from which you removed the delta.
from difflib import SequenceMatcher
x = "abc_def"
y = "abc--ef"
matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()
# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]
string = ''.join([x[a:a+n] for a, _, n in blocks])
# string: "abcef"
Edit: It was also pointed out that in a case where you had two strings like such.
t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'
Then the above code would return 'WordWordyWordWord. This is because get_matching_blocks will catch that 'y' that is present in both strings between the expected blocks. A solution around this is to filter the returned blocks by length.
string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])
If you want more complex analysis of the returned blocks you could also do the following.
def block_filter(substring):
"""Outputs True if the substring is to be merged, False otherwise"""
...
string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])

Unpack multiple variables from sequence

I am expecting the code below to print chr7.
import strutils
var splitLine = "chr7 127471196 127472363 Pos1 0 +".split()
var chrom, startPos, endPos = splitLine[0..2]
echo chrom
Instead it prints #[chr7, 127471196, 127472363].
Is there a way to unpack multiple values from sequences at the same time?
And what would the tersest way to do the above be if the elements weren't contiguous? For example:
var chrom, startPos, strand = splitLine[0..1, 5]
Gives the error:
read_bed.nim(8, 40) Error: type mismatch: got (seq[string], Slice[system.int], int literal(5))
but expected one of:
system.[](a: array[Idx, T], x: Slice[system.int])
system.[](s: string, x: Slice[system.int])
system.[](a: array[Idx, T], x: Slice[[].Idx])
system.[](s: seq[T], x: Slice[system.int])
var chrom, startPos, strand = splitLine[0..1, 5]
^
This can be accomplished using macros.
import macros
macro `..=`*(lhs: untyped, rhs: tuple|seq|array): auto =
# Check that the lhs is a tuple of identifiers.
expectKind(lhs, nnkPar)
for i in 0..len(lhs)-1:
expectKind(lhs[i], nnkIdent)
# Result is a statement list starting with an
# assignment to a tmp variable of rhs.
let t = genSym()
result = newStmtList(quote do:
let `t` = `rhs`)
# assign each component to the corresponding
# variable.
for i in 0..len(lhs)-1:
let v = lhs[i]
# skip assignments to _.
if $v.toStrLit != "_":
result.add(quote do:
`v` = `t`[`i`])
macro headAux(count: int, rhs: seq|array|tuple): auto =
let t = genSym()
result = quote do:
let `t` = `rhs`
()
for i in 0..count.intVal-1:
result[1].add(quote do:
`t`[`i`])
template head*(count: static[int], rhs: untyped): auto =
# We need to redirect this through a template because
# of a bug in the current Nim compiler when using
# static[int] with macros.
headAux(count, rhs)
var x, y: int
(x, y) ..= (1, 2)
echo x, y
(x, _) ..= (3, 4)
echo x, y
(x, y) ..= #[4, 5, 6]
echo x, y
let z = head(2, #[4, 5, 6])
echo z
(x, y) ..= head(2, #[7, 8, 9])
echo x, y
The ..= macro unpacks tuple or sequence assignments. You can accomplish the same with var (x, y) = (1, 2), for example, but ..= works for seqs and arrays, too, and allows you to reuse variables.
The head template/macro extracts the first count elements from a tuple, array, or seqs and returns them as a tuple (which can then be used like any other tuple, e.g. for destructuring with let or var).
For anyone that's looking for a quick solution, here's a nimble package I wrote called unpack.
You can do sequence and object destructuring/unpacking with syntax like this:
someSeqOrTupleOrArray.lunpack(a, b, c)
[a2, b2, c2] <- someSeqOrTupleOrArray
{name, job} <- tim
tom.lunpack(job, otherName = name)
{job, name: yetAnotherName} <- john
Currently pattern matching in Nim only works with tuples. This also makes sense, because pattern matching requires a statically known arity. For instance, what should happen in your example, if the seq does not have a length of three? Note that in your example the length of the sequence can only be determined at runtime, so the compiler does not know if it is actually possible to extract three variables.
Therefore I think the solution which was linked by #def- was going in the right direction. This example uses arrays, which do have a statically known size. In this case the compiler knows the tuple arity, i.e., the extraction is well defined.
If you want an alternative (maybe convenient but unsafe) approach you could do something like this:
import macros
macro extract(args: varargs[untyped]): typed =
## assumes that the first expression is an expression
## which can take a bracket expression. Let's call it
## `arr`. The generated AST will then correspond to:
##
## let <second_arg> = arr[0]
## let <third_arg> = arr[1]
## ...
result = newStmtList()
# the first vararg is the "array"
let arr = args[0]
var i = 0
# all other varargs are now used as "injected" let bindings
for arg in args.children:
if i > 0:
var rhs = newNimNode(nnkBracketExpr)
rhs.add(arr)
rhs.add(newIntLitNode(i-1))
let assign = newLetStmt(arg, rhs) # could be replaced by newVarStmt
result.add(assign)
i += 1
#echo result.treerepr
let s = #["X", "Y", "Z"]
s.extract(a, b, c)
# this essentially produces:
# let a = s[0]
# let b = s[1]
# let c = s[2]
# check if it works:
echo a, b, c
I do not have included a check for the seq length yet, so you would simply get out-of-bounds error if the seq does not have the required length. Another warning: If the first expression is not a literal, the expression would be evaluated/calculated several times.
Note that the _ literal is allowed in let bindings as a placeholder, which means that you could do things like this:
s.extract(a, b, _, _, _, x)
This would address your splitLine[0..1, 5] example, which btw is simply not a valid indexing syntax.
yet another option is package definesugar:
import strutils, definesugar
# need to use splitWhitespace instead of split to prevent empty string elements in sequence
var splitLine = "chr7 127471196 127472363 Pos1 0 +".splitWhitespace()
echo splitLine
block:
(chrom, startPos, endPos) := splitLine[0..2]
echo chrom # chr7
echo startPos # 127471196
echo endPos # 127472363
block:
(chrom, startPos, strand) := splitLine[0..1] & splitLine[5] # splitLine[0..1, 5] not supported
echo chrom
echo startPos
echo strand # +
# alternative syntax
block:
(chrom, startPos, *_, strand) := splitLine
echo chrom
echo startPos
echo strand
see https://forum.nim-lang.org/t/7072 for recent discussion

Detecting the index in a string that is not printable character with Scala

I have a method that detects the index in a string that is not printable as follows.
def isPrintable(v:Char) = v >= 0x20 && v <= 0x7E
val ba = List[Byte](33,33,0,0,0)
ba.zipWithIndex.filter { v => !isPrintable(v._1.toChar) } map {v => v._2}
> res115: List[Int] = List(2, 3, 4)
The first element of the result list is the index, but I wonder if there is a simpler way to do this.
If you want an Option[Int] of the first non-printable character (if one exists), you can do:
ba.zipWithIndex.collectFirst{
case (char, index) if (!isPrintable(char.toChar)) => index
}
> res4: Option[Int] = Some(2)
If you want all the indices like in your example, just use collect instead of collectFirst and you'll get back a List.
For getting only the first index that meets the given condition:
ba.indexWhere(v => !isPrintable(v.toChar))
(it returns -1 if nothing is found)
You can use directly regexp to found unprintable characters by unicode code points.
Resource: Regexp page
In such way you can directly filter your string with such pattern, for instance:
val text = "this is \n sparta\t\r\n!!!"
text.zipWithIndex.filter(_._1.matches("\\p{C}")).map(_._2)
> res3: Vector(8, 16, 17, 18)
As result you'll get Vector with indices of all unprintable characters in String. Check it out
If desired only the first occurrence of non printable char
Method span applied on a List delivers two sublists, the first where all the elements hold a condition, the second starts with an element that falsified the condition. In this case consider,
val (l,r) = ba.span(b => isPrintable(b.toChar))
l: List(33, 33)
r: List(0, 0, 0)
To get the index of the first non printable char,
l.size
res: Int = 2
If desired all the occurrences of non printable chars
Consider partition of a given List for a criteria. For instance, for
val ba2 = List[Byte](33,33,0,33,33)
val (l,r) = ba2.zipWithIndex.partition(b => isPrintable(b._1.toChar))
l: List((33,0), (33,1), (33,3), (33,4))
r: List((0,2))
where r includes tuples with non printable chars and their position in the original List.
I am not sure whether list of indexes or tuples is needed and I am not sure whether 'ba' needs to be an list of bytes or starts off as a string.
for { i <- 0 until ba.length if !isPrintable(ba(i).toChar) } yield i
here, because people need performance :)
def getNonPrintable(ba:List[Byte]):List[Int] = {
import scala.collection.mutable.ListBuffer
var buffer = ListBuffer[Int]()
#tailrec
def go(xs: List[Byte], cur: Int): ListBuffer[Int] = {
xs match {
case Nil => buffer
case y :: ys => {
if (!isPrintable(y.toChar)) buffer += cur
go(ys, cur + 1)
}
}
}
go(ba, 0)
buffer.toList
}

Scala split string to tuple

I would like to split a string on whitespace that has 4 elements:
1 1 4.57 0.83
and I am trying to convert into List[(String,String,Point)] such that first two splits are first two elements in the list and the last two is Point. I am doing the following but it doesn't seem to work:
Source.fromFile(filename).getLines.map(string => {
val split = string.split(" ")
(split(0), split(1), split(2))
}).map{t => List(t._1, t._2, t._3)}.toIterator
How about this:
scala> case class Point(x: Double, y: Double)
defined class Point
scala> s43.split("\\s+") match { case Array(i, j, x, y) => (i.toInt, j.toInt, Point(x.toDouble, y.toDouble)) }
res00: (Int, Int, Point) = (1,1,Point(4.57,0.83))
You could use pattern matching to extract what you need from the array:
case class Point(pts: Seq[Double])
val lines = List("1 1 4.34 2.34")
val coords = lines.collect(_.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
})
You are not converting the third and fourth tokens into a Point, nor are you converting the lines into a List. Also, you are not rendering each element as a Tuple3, but as a List.
The following should be more in line with what you are looking for.
case class Point(x: Double, y: Double) // Simple point class
Source.fromFile(filename).getLines.map(line => {
val tokens = line.split("""\s+""") // Use a regex to avoid empty tokens
(tokens(0), tokens(1), Point(tokens(2).toDouble, tokens(3).toDouble))
}).toList // Convert from an Iterator to List
case class Point(pts: Seq[Double])
val lines = "1 1 4.34 2.34"
val splitLines = lines.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
}
And for the curious, the # in pattern matching binds a variable to the pattern, so points # _* is binding the variable points to the pattern *_ And *_ matches the rest of the array, so points ends up being a Seq[String].
There are ways to convert a Tuple to List or Seq, One way is
scala> (1,2,3).productIterator.toList
res12: List[Any] = List(1, 2, 3)
But as you can see that the return type is Any and NOT an INTEGER
For converting into different types you use Hlist of
https://github.com/milessabin/shapeless

Resources