Split without separator with diferent arrays - python-3.x

Could you, please, help me? I need to split a string that doesn't have a separator. I need to split the string in different types.
For example, the following strings should generate the same list as output:
"ak = bib+c*(data+1005)
"
" ak= bib +c* (data +1005 )
"
" ak =bib + c * (data + 1005)"
The output should be:
['ak', '=', 'bib', '+', 'c', '*', '(', 'data', '+', '1005', ')']
Thank you!

You can use re.findall with a pattern that matches either a word or a non-space character:
import re
re.findall(r'\w+|\S', "ak = bib+c*(data+1005) ")
This returns:
['ak', '=', 'bib', '+', 'c', '*', '(', 'data', '+', '1005', ')']

Related

regex in python: Can you filter string by deliminator with exceptions?

I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:
s='"12345","X","description of x","X,Y",,,"345355"'
output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']
I am using split to delimitate by commas:
s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')
This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes
Split_Output
Is there a way I can delaminate by commas except for in quotes?
I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.
Thanks!
Regex_Output
split by " (quote) instead of by , (comma) then it will split the string into a list with extra commas, then you can just remove all elements that are commas
s='"12345","X","description of x","X,Y",,,"345355"'
temp = s.split('"')
print(temp)
#> ['', '12345', ',', 'X', ',', 'description of x', ',', 'X,Y', ',,,', '345355', '']
values_to_remove = ['', ',', ',,,']
result = list(filter(lambda val: not val in values_to_remove, temp))
print(result)
#> ['12345', 'X', 'description of x', 'X,Y', '345355']
this should work:
In [1]: import re
In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'
In [3]: pattern = r"(?<=[\",]),(?=[\",])"
In [4]: re.split(pattern, s)
Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']
Explanation:
(?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
(?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.
You can replace all quotes with empty string.
s='"12345","X","description of x","X,Y",,,"345355"'
n = ''
i = 0
while i < len(s):
if i >= len(s):
break
if i<len(s) and s[i] == '"':
i+=1
while i<len(s) and s[i] != '"':
n+=s[i]
i+=1
i+=1
if i < len(s) and s[i] == ",":
n+=", "
i+=1
n.split(", ")
output: ['12345', 'X', 'description of x', 'X,Y', '', '', '345355']

define a map function that only returns city names longer than 5 characters. Instead of other names - a line with a hyphen ("-")

The names of cities are entered in one line separated by a space.
You need to define a map function that only
returns city names longer than 5 characters.
Instead of other names - a line with a hyphen ("-").
Generate a list of the obtained values ​​and display it
on the screen in one line separated by a space.
cities = [i.replace(i, '-') for i in input().split() if len(i) < 5]
print(cities)
Input
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
Output
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
if you can use a lambda function then:
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
print(list(map(lambda x: '-' if len(x)<5 else x,cities)))
output:
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
also if you want to do it without function map, only list comprehension:
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
print([ i.replace(i, '-') if len(i)<5 else i for i in cities ])
output:
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
as mentioned by #JonSG, this option without the map function would be more efficient without replacing:
['-' if len(i) < 5 else i for i in cities]

How to split a string by multiple delimiters and also store it in Python?

I need to split a string by multiple delimiters.
My string is HELLO+WORLD-IT*IS=AMAZING.
I would like the result be
["HELLO", "+", "WORLD", "-", "IT", "*", "IS", "=", "AMAZING"
I hear that re.findall() may handle it but I can't find out the solution.
Using re.split works in this case. Put every delimiter in a capturing group:
pattern = "(\+|-|\*|=)"
result = re.split(pattern, string)
Given:
s='HELLO+WORLD-IT*IS=AMAZING'
You can split on any break between a word and non word character as a general case with the word boundary assertion \b:
>>> re.split(r'\b', s)
['', 'HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING', '']
And remove the '' at the start and end like so:
>>> re.split(r'\b', ur_string)[1:-1]
['HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING']
Or if you know that is the full set of delimiters that you want to use for a split, define a character class of them and capture the delimiter:
>>> re.split(r'([+\-*=])', s)
['HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING']
Since \b is a zero width assertion (it does not consume characters to match) you don't have to capture what the delimiter was that caused the split. The assertion of \b is also true at the start and end of the string so those blanks need to be removed.
Since - is used in a character class to define a range of characters such as [0-9] you have to escape the - in [+\-*=].

how to extract n digit numbers in a text including special characters

i have a text full of regular expression and I want to extract the numbers that have 4 digits,
mytext ="""A text including special characters like 1000+(100)=1100 """
numbers = []
seperators=[
'(', ')', '[', ']', '{', '}', ';', ':', '=', '+', '-', '/', '*', '&', '%', '$', '#', '#', '^', '*', '~', '`', '"', '>', '|', '\\', '?', '.', '<', "'"]
how to use split function to extract numbers?
for word in mytext2.split(seperators):
if word.isdigit():
numbers.append(int(word))
#print(numbers)
for mynumbers in numbers:
if mynumbers >999 and 10000>mynumbers: #for 4 digits
print(mynumbers)
#this should print all the 4 digit numbers
text = "A text including special characters like 1000+(100)=1100 "
import re
numbers = [int(number) for number in re.findall(r'\b\d{4}\b', text)]
print(numbers)
# Outputs [1000, 1001]
mytext ="""Alain Fabien Maurice Marcel Delon (French: [al d l ] ɛ̃ ə ɔ̃; born 8 November 1935) is a French actor and businessman. He is known as
one of Europe's most prominent actors and screen sex symbols from the 1960s and 1970s. He achieved critical acclaim for roles in
films such as Rocco and His Brothers (1960), Plein Soleil (1960), L'Eclisse (1962), The Leopard (1963), The Yellow Rolls-
Royce (1965), Lost Command (1966), and Le Samouraï (1967). Over the course of his career Delon worked with many wellknown directors, including Luchino Visconti, Jean-Luc Godard, Jean-Pierre Melville, Michelangelo Antonioni, and Louis Malle. He
acquired Swiss citizenship in 1999"""
numbers = []
seperators=['#','(',')','$','%','^','&','*','+']
mytext2=mytext
mytext2=mytext2.replace('(',' ' )
mytext2=mytext2.replace(')',' ' )
mytext2=mytext2.replace('[',' ' )
mytext2=mytext2.replace(']',' ' )
mytext2=mytext2.replace('{',' ' )
mytext2=mytext2.replace('}',' ' )
mytext2=mytext2.replace(';',' ' )
mytext2=mytext2.replace(':',' ' )
mytext2=mytext2.replace('=',' ' )
mytext2=mytext2.replace('+',' ' )
mytext2=mytext2.replace('-',' ' )
mytext2=mytext2.replace('/',' ' )
mytext2=mytext2.replace('*',' ' )
mytext2=mytext2.replace('&',' ' )
mytext2=mytext2.replace('%',' ' )
mytext2=mytext2.replace('$',' ' )
mytext2=mytext2.replace('#',' ' )
mytext2=mytext2.replace('#',' ' )
mytext2=mytext2.replace('^',' ' )
mytext2=mytext2.replace('*',' ' )
mytext2=mytext2.replace('~',' ' )
mytext2=mytext2.replace('`',' ' )
mytext2=mytext2.replace('"',' ' )
mytext2=mytext2.replace('>',' ' )
mytext2=mytext2.replace('|',' ' )
mytext2=mytext2.replace('\\',' ' )
mytext2=mytext2.replace('?',' ' )
mytext2=mytext2.replace('.',' ' )
mytext2=mytext2.replace('<',' ' )
mytext2=mytext2.replace("'",' ' )
#print(mytext2)
for word in mytext2.split():
if word.isdigit():
numbers.append(int(word))
#print(numbers)
for mynumbers in numbers:
if mynumbers >999 and 10000>mynumbers:
print(mynumbers)
this code prints all the n digit numbers in the text, if your text more special characters you should add them in the first part to be replaced.

Unable to append to array element

I'm trying to append a string to the end of an array element by using the following function:
def spread(farm):
#writing potential fruit spread
for y in range(0,len(farm[0])):
for x in range(0,len(farm[0])):
#if this current space is a tree, write the potential spaces accordingly
if farm[y][x] == "F" or farm[y][x] == "W" or farm[y][x] == "G" or farm[y][x] == "J" or farm[y][x] == "M":
for b in [-1,0,1]:
#making sure the y-coord is within the bounds of the farm
if y+b >= 0 and y+b < len(farm[0]):
for a in [-1,0,1]:
#making sure the x-coord is within the bounds of the farm and the selected space is not a tree
if x+a >= 0 and x+a < len(farm[0]) and farm[y+b][x+a] != "F" and farm[y+b][x+a] != "W" and farm[y+b][x+a] != "G" and farm[y+b][x+a] != "J" and farm[y+b][x+a] != "M":
#if space is blank, write over the space outright
if farm[y+b][x+a] == "_":
farm[y+b][x+a] = farm[y][x].lower()
else:
#wherein my troubles lie :(
farm[y+b][x+a] = farm[y+b][x+a] + farm[y][x].lower()
return farm
with the following input, an array (in farm):
[['_' '_' '_' 'F' '_' '_' '_']
['_' '_' '_' 'W' '_' '_' '_']
['_' '_' '_' '_' '_' '_' '_']
['_' '_' '_' 'J' '_' '_' '_']
['_' '_' '_' '_' '_' '_' '_']
['_' 'G' '_' '_' '_' 'F' '_']
['W' '_' '_' '_' '_' '_' 'G']]
What the function is supposed to do is to simulate spreading fruit trees. Every tree (represented by a capital letter) will spread to the adjacent squares (represented by a lowercase character or underscore). However, the very last line handles the case in which the selected array element is not an underscore. What is supposed to happen is that it will append the string to the end of the array element instead of replacing it, but instead appends nothing. The output is supposed to look something like this:
[['_' '_' 'fw' 'F' 'fw' '_' '_']
['_' '_' 'fw' 'W' 'fw' '_' '_']
['_' '_' 'wj' 'wj' 'wj' '_' '_']
['_' '_' 'j' 'J' 'j' '_' '_']
['g' 'g' 'jg' 'j' 'jf' 'f' 'f']
['gw' 'G' 'g' '_' 'f' 'F' 'fg']
['W' 'gw' 'g' '_' 'f' 'fg' 'G']]
But instead it outputs this:
[['_' '_' 'f' 'F' 'f' '_' '_']
['_' '_' 'f' 'W' 'f' '_' '_']
['_' '_' 'w' 'w' 'w' '_' '_']
['_' '_' 'j' 'J' 'j' '_' '_']
['g' 'g' 'j' 'j' 'j' 'f' 'f']
['g' 'G' 'g' '_' 'f' 'F' 'f']
['W' 'g' 'g' '_' 'f' 'f' 'G']]
What am I doing wrong?
As noted, Numpy has its own string type, which limits the length of the contained text, so that the data can be stored in a neat "rectangular" way without indirection. (This means for example that Numpy can simply do math to calculate where any element will be, rather than chasing pointers for multiple indices.)
It is possible to work around this by explicitly specifying dtype=object when we create the array. This means that Numpy will store pointers to Python objects in its internal representation; this loses a lot of the benefits, but may still allow you to write overall faster and more elegant code depending on the task.
Let's try to implement that here. My first suggestion will be to use empty '' strings for the empty spots on the farm, rather than '_'; this removes a bunch of special cases from our logic (and as we all know, "special cases aren't special enough to break the rules").
Thus, we start with:
farm = np.array([
['', '', '', 'F', '', '', ''],
['', '', '', 'W', '', '', ''],
['', '', '', '', '', '', ''],
['', '', '', 'J', '', '', ''],
['', '', '', '', '', '', ''],
['', 'G', '', '', '', 'F', ''],
['W', '', '', '', '', '', 'G']
], dtype='object')
The primary way that Numpy helps us here is that it can efficiently:
Apply operatiors and functions to each element of the array elementwise.
Slice the array in one or more dimensions.
My approach is as follows:
Create a function that tells us what saplings get planted from the local tree.
Use Numpy to create an array of all the saplings that get planted from their corresponding trees, across the farm.
Create a function to plant saplings at a location offset from their source trees, by slicing the sapling array and "adding" (+, but it's string concatenation of course) the new saplings to a corresponding slice of the farm.
Iterate over the directions that the saplings can be planted, to do all the planting.
So, let's go through that....
The first step is pretty straightforward:
# Determine the saplings that will be planted, if any, from a given source plot.
# Handling the case where multiple trees are already present, is left as an exercise.
def sapling_for(plot):
return plot.lower() if plot in 'FGJMW' else ''
Now we need to apply that to the entire array. Applying operators like + is automatic. (If you have two arrays x and y with the same number of dimensions and the same size in each dimension, you can just add them with x + y and everything is added up elementwise. Notice that x * y is not "matrix multiplication", but element-wise multiplication.) However, for user-defined functions, we need a bit of extra work - we can't just pass our farm to sapling_for (after all, it doesn't have a .lower() method, for just one of many problems). It looks like:
saplings = np.vectorize(sapling_for)(farm)
Okay, not too difficult. Onward to the slicing. This is a bit tricky. We can easily enough get, for example, the north-west slice of the saplings: it is saplings[:-1, :-1] (i.e., everything except the last row and column). Notice we are not doing two separate index operations - this is Deep NumPy Magic (TM), and we need to do things NumPy's way.
My idea here is that we can represent saplings "spreading" to the southeast by taking this northwest slice and adding it to a southeast slice of the farm: farm[1:, 1:] += saplings[:-1, :-1]. We could simply do that eight times for each compass direction of spread. But what about a generalization?
It's a little trickier, since e.g. 1: doesn't mean anything by itself. It does, however, have a built-in Python representation: the native slice type, which we can also use for Numpy indexing (and for indexing built-in sequence types, too!). So I wrote a helper function to create those:
def get_slice(dx):
return slice(dx, None, None) if dx >= 0 else slice(None, dx, None)
These are similar to range objects: the parameters are the start point, end point and "step". The idea here is that a negative value will give a slice taking that many items off the end, while a positive value will take them off the front.
That lets us write a general function to add a slice of one array to a shifted position (in the opposite corner) of a "base" array:
def add_shifted(base, to_add, dx, dy):
base[get_slice(dx), get_slice(dy)] += to_add[get_slice(-dx), get_slice(-dy)]
Hopefully the logic is clear enough. Finally, we can iterate over all the (dx, dy) pairs that make sense for spreading the saplings: everywhere within one space, except for (0, 0).
for dx in (-1, 0, 1):
for dy in (-1, 0, 1):
if dx != 0 or dy != 0:
add_shifted(farm, saplings, dx, dy)
And we're done.
A NumPy array with a string dtype will silently truncate any strings you try to store that are too big for the dtype. Use a list of lists.

Resources