sphinx search with special characters (č š đ ž,..) - search

I am using a sphinx search server.
Currently if I want to find words as "čebula", I have to write "čebula" in the search.
I want to configure it to be able to find results with keywords "cebula" or "ćebula" etc.
I also want the same to work with for example "fällen".
It should work both ways, so searching f.e. "cesnja" should also find a result "češnja".
This is my current settings in sphinx.conf:
source src
{
type = mysql
sql_host = ****
sql_user = ****
sql_pass = ****
sql_db = ****
sql_port = 3306
sql_query = \
SELECT id, Title, Content \
FROM Post
sql_ranged_throttle = 0
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
}
index idx1
{
source = src
path = /usr/local/sphinx/var/data/idx1
docinfo = extern
mlock = 0
morphology = metaphone
expand_keywords = 1
enable_star = 1
min_prefix_len = 3
min_word_len = 1
charset_type = utf-8
charset_table = 0..9, a..z, _, A..Z->a..z,-, U+002C, \
U+010C->U+010D, U+0106->U+0107, U+0160->U+0161, U+017D->U+017E, \
U+010D->c,U+0107->c, U+0161->s, U+017E->z, \
U+010D, U+0107, U+0161, U+017E
}
I've tried using different morphologys but no luck
index idx2: idx1
{
morphology = metaphone, stem_en, soundex
}

The solution to your problem is the charset_table configuration variable. The documentation for that can be found here:
http://sphinxsearch.com/docs/current.html#conf-charset-table
Here is the charset_table I use:
charset_table = 0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i,
U+00CF->i, U+00D1->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o,
U+00D6->o, U+00D8->o, U+00D9->u, U+00DA->u, U+00DB->u, U+00DC->u,
U+00DD->y, U+00E0->a, U+00E1->a, U+00E2->a, U+00E3->a, U+00E4->a,
U+00E5->a, U+00E7->c, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e,
U+00EC->i, U+00ED->i, U+00EE->i, U+00EF->i, U+00F1->n, U+00F2->o,
U+00F3->o, U+00F4->o, U+00F5->o, U+00F6->o, U+00F8->o, U+00F9->u,
U+00FA->u, U+00FB->u, U+00FC->u, U+00FD->y, U+00FF->y, U+0100->a,
U+0101->a, U+0102->a, U+0103->a, U+0104->a, U+0105->a, U+0106->c,
U+0107->c, U+0108->c, U+0109->c, U+010A->c, U+010B->c, U+010C->c,
U+010D->c, U+010E->d, U+010F->d, U+0112->e, U+0113->e, U+0114->e,
U+0115->e, U+0116->e, U+0117->e, U+0118->e, U+0119->e, U+011A->e,
U+011B->e, U+011C->g, U+011D->g, U+011E->g, U+011F->g, U+0120->g,
U+0121->g, U+0122->g, U+0123->g, U+0124->h, U+0125->h, U+0128->i,
U+0129->i, U+012A->i, U+012B->i, U+012C->i, U+012D->i, U+012E->i,
U+012F->i, U+0130->i, U+0134->j, U+0135->j, U+0136->k, U+0137->k,
U+0139->l, U+013A->l, U+013B->l, U+013C->l, U+013D->l, U+013E->l,
U+0142->l, U+0143->n, U+0144->n, U+0145->n, U+0146->n, U+0147->n,
U+0148->n, U+014C->o, U+014D->o, U+014E->o, U+014F->o, U+0150->o,
U+0151->o, U+0154->r, U+0155->r, U+0156->r, U+0157->r, U+0158->r,
U+0159->r, U+015A->s, U+015B->s, U+015C->s, U+015D->s, U+015E->s,
U+015F->s, U+0160->s, U+0161->s, U+0162->t, U+0163->t, U+0164->t,
U+0165->t, U+0168->u, U+0169->u, U+016A->u, U+016B->u, U+016C->u,
U+016D->u, U+016E->u, U+016F->u, U+0170->u, U+0171->u, U+0172->u,
U+0173->u, U+0174->w, U+0175->w, U+0176->y, U+0177->y, U+0178->y,
U+0179->z, U+017A->z, U+017B->z, U+017C->z, U+017D->z, U+017E->z,
U+01A0->o, U+01A1->o, U+01AF->u, U+01B0->u, U+01CD->a, U+01CE->a,
U+01CF->i, U+01D0->i, U+01D1->o, U+01D2->o, U+01D3->u, U+01D4->u,
U+01D5->u, U+01D6->u, U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u,
U+01DB->u, U+01DC->u, U+01DE->a, U+01DF->a, U+01E0->a, U+01E1->a,
U+01E6->g, U+01E7->g, U+01E8->k, U+01E9->k, U+01EA->o, U+01EB->o,
U+01EC->o, U+01ED->o, U+01F0->j, U+01F4->g, U+01F5->g, U+01F8->n,
U+01F9->n, U+01FA->a, U+01FB->a, U+0200->a, U+0201->a, U+0202->a,
U+0203->a, U+0204->e, U+0205->e, U+0206->e, U+0207->e, U+0208->i,
U+0209->i, U+020A->i, U+020B->i, U+020C->o, U+020D->o, U+020E->o,
U+020F->o, U+0210->r, U+0211->r, U+0212->r, U+0213->r, U+0214->u,
U+0215->u, U+0216->u, U+0217->u, U+0218->s, U+0219->s, U+021A->t,
U+021B->t, U+021E->h, U+021F->h, U+0226->a, U+0227->a, U+0228->e,
U+0229->e, U+022A->o, U+022B->o, U+022C->o, U+022D->o, U+022E->o,
U+022F->o, U+0230->o, U+0231->o, U+0232->y, U+0233->y, U+1E00->a,
U+1E01->a, U+1E02->b, U+1E03->b, U+1E04->b, U+1E05->b, U+1E06->b,
U+1E07->b, U+1E08->c, U+1E09->c, U+1E0A->d, U+1E0B->d, U+1E0C->d,
U+1E0D->d, U+1E0E->d, U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d,
U+1E13->d, U+1E14->e, U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e,
U+1E19->e, U+1E1A->e, U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1E1E->f,
U+1E1F->f, U+1E20->g, U+1E21->g, U+1E22->h, U+1E23->h, U+1E24->h,
U+1E25->h, U+1E26->h, U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h,
U+1E2B->h, U+1E2C->i, U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1E30->k,
U+1E31->k, U+1E32->k, U+1E33->k, U+1E34->k, U+1E35->k, U+1E36->l,
U+1E37->l, U+1E38->l, U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l,
U+1E3D->l, U+1E3E->m, U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m,
U+1E43->m, U+1E44->n, U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n,
U+1E49->n, U+1E4A->n, U+1E4B->n, U+1E4C->o, U+1E4D->o, U+1E4E->o,
U+1E4F->o, U+1E50->o, U+1E51->o, U+1E52->o, U+1E53->o, U+1E54->p,
U+1E55->p, U+1E56->p, U+1E57->p, U+1E58->r, U+1E59->r, U+1E5A->r,
U+1E5B->r, U+1E5C->r, U+1E5D->r, U+1E5E->r, U+1E5F->r, U+1E60->s,
U+1E61->s, U+1E62->s, U+1E63->s, U+1E64->s, U+1E65->s, U+1E66->s,
U+1E67->s, U+1E68->s, U+1E69->s, U+1E6A->t, U+1E6B->t, U+1E6C->t,
U+1E6D->t, U+1E6E->t, U+1E6F->t, U+1E70->t, U+1E71->t, U+1E72->u,
U+1E73->u, U+1E74->u, U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u,
U+1E79->u, U+1E7A->u, U+1E7B->u, U+1E7C->v, U+1E7D->v, U+1E7E->v,
U+1E7F->v, U+1E80->w, U+1E81->w, U+1E82->w, U+1E83->w, U+1E84->w,
U+1E85->w, U+1E86->w, U+1E87->w, U+1E88->w, U+1E89->w, U+1E8A->x,
U+1E8B->x, U+1E8C->x, U+1E8D->x, U+1E8E->y, U+1E8F->y, U+1E96->h,
U+1E97->t, U+1E98->w, U+1E99->y, U+1EA0->a, U+1EA1->a, U+1EA2->a,
U+1EA3->a, U+1EA4->a, U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a,
U+1EA9->a, U+1EAA->a, U+1EAB->a, U+1EAC->a, U+1EAD->a, U+1EAE->a,
U+1EAF->a, U+1EB0->a, U+1EB1->a, U+1EB2->a, U+1EB3->a, U+1EB4->a,
U+1EB5->a, U+1EB6->a, U+1EB7->a, U+1EB8->e, U+1EB9->e, U+1EBA->e,
U+1EBB->e, U+1EBC->e, U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e,
U+1EC1->e, U+1EC2->e, U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e,
U+1EC7->e, U+1EC8->i, U+1EC9->i, U+1ECA->i, U+1ECB->i, U+1ECC->o,
U+1ECD->o, U+1ECE->o, U+1ECF->o, U+1ED0->o, U+1ED1->o, U+1ED2->o,
U+1ED3->o, U+1ED4->o, U+1ED5->o, U+1ED6->o, U+1ED7->o, U+1ED8->o,
U+1ED9->o, U+1EDA->o, U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o,
U+1EDF->o, U+1EE0->o, U+1EE1->o, U+1EE2->o, U+1EE3->o, U+1EE4->u,
U+1EE5->u, U+1EE6->u, U+1EE7->u, U+1EE8->u, U+1EE9->u, U+1EEA->u,
U+1EEB->u, U+1EEC->u, U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u,
U+1EF1->u, U+1EF2->y, U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y,
U+1EF7->y, U+1EF8->y, U+1EF9->y

Here is a resource that can provide you with charsets for most languages. Just copy and paste these charsets for the languages you want to support - just check for overlaps between the charsets. Especially between Latin and other european languages.
http://sphinxsearch.com/wiki/doku.php?id=charset_tables

Related

Python: How to extract addresses from a sentence/paragraph (non-Regex approach)?

I was working on a project which needed me to extract addresses from a sentence.
For e.g. Input sentence: Hi, Mr. Sam D. Richards lives here Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call me on 12345678
I am trying to extract just the address i.e. Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345
What I have tried so far:
I tried Pyap which also works on Regex so it is not able to generalize it better for addresses of countries other than US/Canada/UK. I realized that we cannot use Regex as there is no pattern to the address or the sentences whatsoever. Also tried locationtagger which only manages to return the country or the city.
Is there any better way of doing it?
If there is no obvious pattern for regex, you can try an ML-based approach. There is a well known problem named entity recognition (NER), and it is typically solved as a sequence tagging problem: a model is trained to predict for each token (e.g. a word or a subword) whether it is a part of address or not.
You can look for a model that is already trained to extract addresses (e.g. here https://huggingface.co/models?search=address), or fine-tune a BERT-based model on your own dataset (here is a recipe).
Addresses have a well known structure. With a grammar parser it should be possible to parse them.
PyParsing has a feature of scanning that searches for pattern without parsing all the rest of the file. You can try this feature. I have an example for you, that detects three addresses in the example string.
#!/bin/python3
from pyparsing import *
GermanWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ", alphas + "ß")
GermanWordComposition = GermanWord + (ZeroOrMore(Optional(Literal("-")) + GermanWord))
GermanName = GermanWordComposition
GermanStreet = GermanWordComposition
GermanHouseNumber = Word(nums) + Optional(Word(alphas, exact=1) + FollowedBy(White()))
GermanAddressSeparator = Literal(",") | Literal("in")
GermanPostCode = Word(nums, exact=5)
GermanTown = GermanWordComposition
German_Address = GermanName + GermanAddressSeparator + GermanStreet + GermanHouseNumber \
+ GermanAddressSeparator + GermanPostCode + GermanTown
EnglishWord = Word("ABCDEFGHIJKLMNOPQRSTUVWXYZ", alphanums)
EnglishNumber = Word(nums)
EnglishComposition = OneOrMore(EnglishWord)
EnglishExtension = Word("-/", exact=1) + (EnglishComposition | EnglishNumber)
EnglishAddressSeparator = Literal(",")
EnglishFloor = (Literal("1st") | Literal("2nd") | Literal("3rd") | (Combine(EnglishNumber + Literal("th")))) + Literal("Floor")
EnglishWhere = EnglishComposition
EnglishStreet = EnglishComposition
EnglishAddress = EnglishComposition + Optional(EnglishExtension) \
+ EnglishAddressSeparator + Optional(EnglishFloor) \
+ Optional(EnglishAddressSeparator + EnglishWhere) \
+ Optional(EnglishAddressSeparator + EnglishWhere) \
+ EnglishAddressSeparator + EnglishStreet + EnglishAddressSeparator + EnglishNumber
Address = EnglishAddress | German_Address
test_1 = "I am writing to Peter Meyer, Moritzstraße 22, 54543 Musterdorf a letter. But the letter arrived at \
Hubert Figge, Große Straße 14 in 45434 Berlin. In the letter was written: Hi, Mr. Sam D. Richards lives here \
Shop No / 123, 3rd Floor, ABC Building, Behind CDE Mart, Aloha Road, 12345. If you need any help, call \
me on 12345678."
for i in Address.scanString(test_1):
print(i)

Remove leading dollar sign from data and improve current solution

I have string like so:
"Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
I'd like to retieve the datte from the table name, so far I use:
"\$[0-9]+"
but this yields $20220316. How do I get only the date, without $?
I'd also like to get the table name: n_Cars_12345678$20220316
So far I have this:
pattern_table_info = "\(([^\)]+)\)"
pattern_table_name = "(?<=table ).*"
table_info = re.search(pattern_table_info, message).group(1)
table = re.search(pattern_table_name, table_info).group(0)
However I'd like to have a more simpler solution, how can I improve this?
EDIT:
Actually the table name should be:
n_Cars_12345678
So everything before the "$" sign and after "table"...how can this part of the string be retrieved?
You can use a regex with two capturing groups:
table\s+([^()]*)\$([0-9]+)
See the regex demo. Details:
table - a word
\s+ - one or more whitespaces
([^()]*) - Group 1: zero or more chars other than ( and )
\$ - a $ char
([0-9]+) - Group 2: one or more digits.
See the Python demo:
import re
text = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
rx = r"table\s+([^()]*)\$([0-9]+)"
m = re.search(rx, text)
if m:
print(m.group(1))
print(m.group(2))
Output:
n_Cars_1234567
20220316
You can write a single pattern with 2 capture groups:
\(table (\w+\$(\d+))\)
The pattern matches:
\(table
( Capture group 1
\w+\$ match 1+ word characters and $
(\d+) Capture group 2, match 1+ digits
) Close group 1
\) Match )
See a Regex demo and a Python demo.
import re
s = "Job 1233:name_uuid (table n_Cars_1234567$20220316) done. Records: 24, with errors: 0."
m = re.search(r"\(table (\w+\$(\d+))\)", s)
if m:
print(m.group(1))
print(m.group(2))
Output
n_Cars_1234567$20220316
20220316

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

Lua string.gsub with Multiple Patterns

I am working on renaming the Movie titles that has unwanted letters. The string.gsub can replace a string with "" nil value but I have around 200 string patterns that need to be replaces with "".
Right now I have to string.gsub for every pattern. I was thinking is there is a way to put all the string patterns in to single string.gsub line. I have searched around the web for the solution but still didn't got anything.
The movie title is like this B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT
and I want to remove the extra characters like 2013, Hindi, 720p, DvDRip, CROPPED, AAC, x264, RickyKT.
You can pass to string.gsub a table as the third argument like this:
local movie = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
movie = movie:gsub("%S+", {["2013"] = "", ["Hindi"] = "", ["720p"] = "",
["DvDRip"] = "", ["CROPPED"] = "", ["AAC"] = "",
["x264"] = "", ["RickyKT"] = ""})
print(movie)
Put all of the patterns in a table and then enumerate the table, calling string.gsub() for each pattern:
str = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
patterns = {"pattern1", "pattern2", "pattern3"}
for i,v in ipairs(patterns) do
str = string.gsub(str, v, "")
end
This will require many invocations of string.gsub(), but the code should be much more maintainable than having a lot of string.gsub() calls.
To avoid to write keys and values on a table for every new entry, i'd write a function to handle a numerically indexed table (the patterns being the values).
This way I dont need to write {["pattern_n"] = ""} for every new pattern.
Ex:
PATTERNS = {"2013", "Hindi", "720p", "DvDRip", "CROPPED", "AAC", "x264", "RickyKT"}
function replace(match)
local ret = nil
for i, v in ipairs(PATTERNS) do
if v:find(match) then
ret = ""
end
end
return ret
end
local movie = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
movie = movie:gsub("%S+", replace)
print(movie)
You could do it in a simple function, that way you do not need to write the code each time per string, or just put string.gsub, and the replacement value for the string you need
Function:
local large_name = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
function clean_name(str)
local v = string.gsub(str, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
return v
end
print(clean_name(large_name))
Only string.gsub for value
local large_name = "B.A.Pass 2013 Hindi 720p DvDRip CROPPED AAC x264 RickyKT"
local clean_name = string.gsub(large_name, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
print(clean_name)
The replacement pattern places the first value (name of the movie) separated by a space and prints it, also identifies the year as the second value, to avoid error in the titles, so it is not necessary to place all the values ​​that can exist within the name of the movie and will avoid many false positives
I add a testing function to test different movie names
local testing = {"Whiplash 2014 [1080p]",
"Anon (2018) [WEBRip] [1080p] [YTS.AM]",
"Maze Runner The Death Cure 2018 [WEBRip] [1080p] [YTS.AM]",
"12 Strong [2018] [WEBRip] [1080p] [YTS.AM]",
"Kingsman The Secret Service (2014) [1080p]",
"The Equalizer [2014] [1080p]",
"Annihilation 2018 [WEBRip] [1080p] [YTS.AM]",
"The Shawshank Redemption '94",
"Assassin's Creed 2016 HC 720p HDRip 850 MB - iExTV",
"Captain Marvel (2019) [WEBRip] [1080p] [YTS.AM]",}
for k,v in pairs(testing) do
local result = string.gsub(v, "(.-)%s([%(%[']?%d%d%d?%d?[%)%]]?)%s*(.*)", "%1")
print(result)
end
Output:
Whiplash
Anon
Maze Runner The Death Cure
12 Strong
Kingsman The Secret Service
The Equalizer
Annihilation
The Shawshank Redemption
Assassin's Creed
Captain Marvel

data.frame slicing

I hope this question is not too simple for this board.
I have created a data.frame df:
CAS Name CID
89 13010-47-4 Lomustine 3950
90 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
91 130636-43-0 Nifekalant 268083
92 130929-57-6 Entacapone 5281081
and a vector vec
[1] 5282380 18471829 45923789 44308022 44266812 24883465 24867475 24867460
I would like to extract the rows of df which contains any number of vec. I tried to solve this problem by this code:
df$GC[(df$CID %in% vec)] = 1
df[df$GC==1,]
But the problem with this solution is, that I only get the rows, which contain only one number in the CID column. Rows which contain several values in CID like line 90 do not appear.
Is there an elegant solution for this problem?
Thanks in advance
Given your comment on EDi's answer (which I like) I thought I'd make a suggestion.
Squeezing comma separated values into a single column of a data frame is awkward and (in my experience) just leads to frustration. I often find it simpler to keep it in a separate data structure, a list:
dat <- read.table(text = " CAS Name CID
13010-47-4 Lomustine 3950
130209-82-4 Latanoprost 5311221,5282380,46705340,3890
130636-43-0 Nifekalant 268083
130929-57-6 Entacapone 5281081",sep = "",header = TRUE)
cid <- sapply(dat$CID,strsplit,",",USE.NAMES = FALSE)
In this form, things are often easier to work with:
ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
dat[sapply(cid,function(x) {any(x %in% as.character(ID))}),]
CAS Name CID
1 13010-47-4 Lomustine 3950
2 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
You can always use rownames in dat and the names of the list to keep each item straight, if you're worried about orderings changing.
(Also note that my anonymous function is assuming that ID will be found eventually by R's scoping rules; you can alter the function to pass in ID explicitly if you like.)
One way is to use grep():
> txt <- " CAS Name CID
+ 13010-47-4 Lomustine 3950
+ 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
+ 130636-43-0 Nifekalant 268083
+ 130929-57-6 Entacapone 5281081
+ "
> con <- textConnection(txt)
> df <- read.table(con, header = TRUE)
> close(con)
> ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
> grep(paste("\\b", ID, "\\b", sep="", collapse = "|"), dat$CID)
[1] 1 2

Resources