How do I detect special strings and classify data from strings in PHP? - helper

Text: Damage{amount=10;cause=custom}
Output like:
$damage = ["amount" => 10, "cause" => "custom"];
I am in need of distinguishing strings like this to detect Damage, its amount is 10 and cause is custom

Related

How to split a String by bodySize in Groovy Script

Before anything else, I hope that this world situation is not affecting you too much and that you can be as long as possible at home and in good health.
You see, I'm very, very new to Groovy Script and I have a question: How can I separate a String based on its body size?
Assuming that the String has a size of 3,000 characters getting the body like
def body = message.getBody (java.lang.String) as String
and its size like
def bodySize = body.getBytes (). Length
I should be able to separate it into 500-character segments and save each segment in a different variable (which I will later set in a property).
I read some examples but I can't adjust them to what I need.
Thank you very much in advance.
Assuming it's ok to have a List of segment strings, you can simply do:
def segments = body.toList().collate(500)*.join()
This splits the body into a list of characters, collates these into 500 length groups, and then joins each group back to a String.
As a small example
def body = 'abcdefghijklmnopqrstuvwxyz'
def segments = body.toList().collate(5)*.join()
Then segments equals
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']

Identify numbers, in a large data string, that are prefixed to an alphabet upto 2 positions in between other characters

I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}

Matching all elements from a list wth each entry of a input

I have two text files. One is a list(key-value pairs) of items and the other is a input file that the key-value pairs are to be matched. If a match is found it is marked with its corresponding value in the input file.
For example:
my list file:
food = ###food123
food microbiology = ###food mircobiology456
mirco organism = ###micro organims789
geo tagging = ###geo tagging614
gross income = ###gross income630
fermentation = fermentation###929
contamination = contamination##878
Salmonella species = Salmonella species###786
Lactic acid bacteria = Lactic acid bacteria###654
input file:
There are certain attributes necessary for fermentation of meat.
It should be fresh, with low level of contamination, the processing should be hygienic and the refrigeration should be resorted at different stages.
Elimination of pathogens like Coliform, Staphylococci, Salmonella species may be undertaken either by heat or by irradiation. There is need to lower the water activity and this can be achieved by either drying or addition of the salts.
Inoculation with an effective, efficient inoculum consisting of Lactic acid bacteria and, or Micrococci which produces lactic acid and also contributes to the flavor development of the product.
Effective controlled time, temperature humidity during the production is essential.
And, Salt ensures the low pH value and extends the shelf-life of the fermented meats like Sausages.
Expected Output:
There are certain attributes necessary for ((fermentation###929)) of meat.
It should be fresh, with low level of ((contamination##878)), the processing should be hygienic and the refrigeration should be resorted at different stages.
Elimination of pathogens like Coliform, Staphylococci, ((Salmonella species###786)) may be undertaken either by heat or by irradiation. There is need to lower the water activity and this can be achieved by either drying or addition of the salts.
Inoculation with an effective, efficient inoculum consisting of ((Lactic acid bacteria###654)) and, or Micrococci which produces lactic acid and also contributes to the flavor development of the product.
Effective controlled time, temperature humidity during the production is essential.
And, Salt ensures the low pH value and extends the shelf-life of the fermented meats like Sausages.
For this I am using python3, parsing the list file, and storing it in a hash. Hash has all the elements of the list as key-value pairs. Then each line of input file is matched with all keys present in hash and when a match is found the corresponding hash value is replaced as shown in the output.
This method works fine when the size of input and list is small, but when both the list and input size grows its taking lot of time.
How can I improve the time complexity of this matching method?
Algorithm I am using :
#parse list and store in hash
for l in list:
ll = l.split("=")
hash[ll[0]] = ll[1]
#iterate input and match with each key
keys = hash.keys()
for line in lines:
if(line != ""):
for key in keys:
my_regex = r"([,\"\'\( \/\-\|])" + key + r"([ ,\.!\"।\'\/\-)])"
if((re.search(my_regex, line, re.IGNORECASE|re.UNICODE))):
line = re.sub(my_regex, r"\1" + "((" + hash[key] + "))" + r"\2",line)

How to sort latin after local language in python 3?

There are many situations where the user's language is not a "latin" script (examples include: Greek, Russian, Chinese). In most of these cases a sorting is done by
first sorting the special characters and numbers (numbers in local language though...),
secondly the words in the local language-script
at the end, any non native characters such as French, English or German "imported" words, in a general utf collation.
Or even more specific for the rest...:
is it possible to select the sort based on script?
Example1: Chinese script first then Latin-Greek-Arabic (or even more...)
Example2: Greek script first then Latin-Arabic-Chinese (or even more...)
What is the most effective and pythonic way to create a sort like any of these? (by «any» I mean either the simple «selected script first» and rest as in unicode sort, or the more complicated «selected script first» and then a specified order for rest of the scripts)
Interesting question. Here’s some sample code that classifies strings
according to the writing system of the first character.
import unicodedata
words = ["Japanese", # English
"Nihongo", # Japanese, rōmaji
"にほんご", # Japanese, hiragana
"ニホンゴ", # Japanese, katakana
"日本語", # Japanese, kanji
"Японский язык", # Russian
"जापानी भाषा" # Hindi (Devanagari)
]
def wskey(s):
"""Return a sort key that is a tuple (n, s), where n is an int based
on the writing system of the first character, and s is the passed
string. Writing systems not addressed (Devanagari, in this example)
go at the end."""
sort_order = {
# We leave gaps to make later insertions easy
'CJK' : 100,
'HIRAGANA' : 200,
'KATAKANA' : 200, # hiragana and katakana at same level
'CYRILLIC' : 300,
'LATIN' : 400
}
name = unicodedata.name(s[0], "UNKNOWN")
first = name.split()[0]
n = sort_order.get(first, 999999);
return (n, s)
words.sort(key=wskey)
for s in words:
print(s)
In this example, I am sorting hiragana and katakana (the two Japanese
syllabaries) at the same level, which means pure-katakana strings will
always come after pure-hiragana strings. If we wanted to sort them such
that the same syllable (e.g., に and ニ) sorted together, that would be
trickier.

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:
pattern , string
e.g. -
*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry
To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for instance, semi-vowels/round-vowels in phonology).
Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.
e.g. -
h*ll* --> hallo, hello, holla ...
'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.
What are the simple and efficient ways to do this?
Are there any libraries/frameworks for achieving this?
I've no specific language in mind but prefer to use java for this program.
This is particularly well suited to Python itertools, set and re operations:
import re
import itertools
VOWELS = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET = '/usr/share/dict/words'
SENSIBLES = set()
def digraphs(word, digraph=r'..'):
'''
>>> digraphs('bar')
set(['ar', 'ba'])
'''
base = re.findall(digraph, word)
base.extend(re.findall(digraph, word[1:]))
return set(base)
def expand(pattern, wildcard, elements):
'''
>>> expand('h?', '?', 'aeiou')
['ha', 'he', 'hi', 'ho', 'hu']
'''
tokens = re.split(re.escape(wildcard), pattern)
results = set()
for perm in itertools.permutations(elements, len(tokens)):
results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
return sorted(results)
def enum(pattern):
not_sensible = digraphs(pattern, r'[^*\]]{2}')
for p in expand(pattern, '*', VOWELS):
for q in expand(p, ']', SEMI_VOWELS):
if (digraphs(q) - not_sensible).issubset(SENSIBLES):
print q
## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
for digraph in digraphs(word.rstrip()):
SENSIBLES.add(digraph)
enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')
As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:
ee 1024 times
su 567 times
...
xy 45 times
xz 0 times
The table will be small as you'll only have about 26*26 = 676 values to store.
You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).
Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

Resources