filter list based on string pattern of hyphen separated values - python-3.x

I have a list of strings with varying formats. Some of them start with a hyphen separated date time followed by a string that's a mixture of 16 numbers of letters. I would like to filter for only the strings that match this format. I've provided input and out put examples below. I'm not a regex expert, could someone please suggest a slick way to do this with python?
Input:
example_list=['2022-05-05-16-59-25-5840ZQ37F231D95W',
'wereD/22fdas/',
'mnkljlj/124kljf/oaahreljah',
'2022-09-11-16-59-25-5840XY37F231D95Z']
output:
['2022-05-05-16-59-25-5840ZQ37F231D95W',
'2022-09-11-16-59-25-5840XY37F231D95Z']
update:
using the suggestion below with re.match and list comprehension worked fine, thanks!
import re
[x for x in example_list if re.match("^\d{4}(-\d\d){5}-[A-Z\d]{16}$",x)]

Try this:
^\d{4}(-\d\d){5}-[A-Z\d]{16}$
See live demo.
Regex breakdown:
^ start of input
\d{4} 4 digits
(-\d\d){5} 5 lots of a dash then 2 digits
[A-Z\d]{16} 16 of a caps letter or a digit
$ end of input

Related

Python or PySpark Regular Expression for leading or trailing defined string

I am working through a huge list of package names for customers which need to be parsed to find out price information. Sample package names are as follows:
Jan24_Package1_USD2_Rest_Of_String
Jan25_Package2_2USD_Rest_Of_String
Jan26_Package3_USD_2_Rest_Of_String
Jan24_Package4_2_USD_Rest_Of_String
So for first and third string USD is leading the value 2 and for the rest ones USD is trailing. Looking for a regular expression which will find output 2 in all use cases.
I was trying with group 3 (\d+) for the following
(USD)(_*)(\d+)(_*)
This works fine for string 1 and 3, but it doesn't work with string 2 and 4.
Looking for a solution here. Thanks a lot.
It could be solved using two possible cases (capture group 2 or 3 in regexp):
import re
strings = ['Jan24_Package1_USD2_Rest_Of_String',
'Jan25_Package2_2USD_Rest_Of_String',
'Jan26_Package3_USD_2_Rest_Of_String',
'Jan24_Package4_2_USD_Rest_Of_String']
for string in strings:
match = re.search(r'.*_(USD_?(\d+)|(\d+)_?USD)', string)
if match:
#print group 2 or group 3 if group 2 is empty
if match.group(2):
print(match.group(2))
else:
print(match.group(3))

Python pandas Dataframe Regex

I need a RegEx that can detect if the value of a cell in pandas dataframe is on the right date format. The value of the cell should be formatted like this "2018-04-01T06:21:48+00:00".
Thanks,
Peter
I think the following would match the format you are looking for:
\d{4}-\d{2}-\d{2}[T]\d{2}:\d{2}:\d{2}[+]\d{2}:\d{2}
From a high level
\d{4} matches 4 digit
\d{2} matches 2 digits
[T] matches only the 'T' character
[+] matches only the '+' character
This would not check the validity of the date/time - you will need another function for that.
Give the expression above a try and let us know how you make out.

Remove all text and characters except some

I have here some text strings
"16cg-301 -request","16cg-3368 - for review","16cg-3684 - for process"
what i would like to do is to remove all the text and characters except the number and the letters "cg" and - which is within the reference code.
If the string you want to extract is always before the first space in the full string then you can use SEARCH and LEFT to extract your reference code:
=LEFT(A1,SEARCH(" ",A1)-1)
This formula would take 16cg-3368 from 16cg-3368 - for review.
I suggest using something like suggested here
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
With a replace regex similar to this
[^\dcg]*
or a match regex like this
^([0-9cg- ]+).*
else you could also work with a strange formule similar to this
=CONCATENATE(IF(NOT(ISERROR(SEARCH(MID(A2;1;1);"01234567890cg-")>0));MID(A2;1;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;2;1);"01234567890cg-")>0));MID(A2;2;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;3;1);"01234567890cg-")>0));MID(A2;3;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;4;1);"01234567890cg-")>0));MID(A2;4;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;5;1);"01234567890cg-")>0));MID(A2;5;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;6;1);"01234567890cg-")>0));MID(A2;6;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;7;1);"01234567890cg-")>0));MID(A2;7;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;8;1);"01234567890cg-")>0));MID(A2;8;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;9;1);"01234567890cg-")>0));MID(A2;9;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;10;1);"01234567890cg-")>0));MID(A2;10;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;11;1);"01234567890cg-")>0));MID(A2;11;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;12;1);"01234567890cg-")>0));MID(A2;12;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;13;1);"01234567890cg-")>0));MID(A2;13;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;14;1);"01234567890cg-")>0));MID(A2;14;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;15;1);"01234567890cg-")>0));MID(A2;15;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;16;1);"01234567890cg-")>0));MID(A2;16;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;17;1);"01234567890cg-")>0));MID(A2;17;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;18;1);"01234567890cg-")>0));MID(A2;18;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;19;1);"01234567890cg-")>0));MID(A2;19;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;20;1);"01234567890cg-")>0));MID(A2;20;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;21;1);"01234567890cg-")>0));MID(A2;21;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;22;1);"01234567890cg-")>0));MID(A2;22;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;23;1);"01234567890cg-")>0));MID(A2;23;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;24;1);"01234567890cg-")>0));MID(A2;24;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;25;1);"01234567890cg-")>0));MID(A2;25;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;26;1);"01234567890cg-")>0));MID(A2;26;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;27;1);"01234567890cg-")>0));MID(A2;27;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;28;1);"01234567890cg-")>0));MID(A2;28;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;29;1);"01234567890cg-")>0));MID(A2;29;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;30;1);"01234567890cg-")>0));MID(A2;30;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;31;1);"01234567890cg-")>0));MID(A2;31;1);"");IF(NOT(ISERROR(SEARCH(MID(A2;32;1);"01234567890cg-")>0));MID(A2;32;1);""))
only works by now for less than 33 signs.
problem here will be that you will get unexpected behavior like this:
123cg-123 - Process => 123cg-123-c
after rereading , I think you should try an other approach than described in the question ;-)
If you want to return everything up to and including the last digit, then try:
=LEFT(A1,LOOKUP(2,1/ISNUMBER(-MID(A1,seq,1)),seq))
seq is a named formula: Formula â–º Define Name
Name: seq
Refers to: =ROW(INDEX($1:$65535,1,1):INDEX($1:$65535,255,1))
seq returns an array of sequential numbers from 1 to 255.
mid(a1,seq,1)
returns an array consisting of the individual characters in the string in A1. The leading minus sign converts the digits from strings to numbers.
The lookup function will then return the position of the last digit

How to count a specific word separated by paragraphs?

So I want to be able to count the number of times a certain sequence such as "AGCT" appears in a document full of letters. However I don't just want the total amount in the document, I want how many times it shows up separated by ">".
So for example if the document contained: asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>...
It would tell me:
2
1
1
since the sequence "AGCT" appears twice before the first ">" and once after the next one and once more after the third one and so on.
I do not know how to do this and any help would be appreciated.
You can use a combination of string methods and Python's llist comprehension like this:
Split your text in paragraphs, and for each paragraph count the ocurrences of the wanted substring. It is actually more concise in Python than in English:
>>> mytext = "asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>"
>>> count = [para.count("agc") for para in mytext.split(">") ]
>>> count
[2, 1, 1, 0]

Return the characters after Nth character in a string

I need help! Can someone please let me know how to return the characters after the nth character?
For example, the strings I have is "001 baseball" and "002 golf", I want my code to return baseball and golf, not the number part. Since the word after the number is not always the same length, I cannot use = Right(String, n)
Any help will be greatly appreciated
If your numbers are always 4 digits long:
=RIGHT(A1,LEN(A1)-5) //'0001 Baseball' returns Baseball
If the numbers are variable (i.e. could be more or less than 4 digits) then:
=RIGHT(A1,LEN(A1)-FIND(" ",A1,1)) //'123456 Baseball’ returns Baseball
Mid(strYourString, 4) (i.e. without the optional length argument) will return the substring starting from the 4th character and going to the end of the string.
Alternately, you could do a Text to Columns with space as the delimiter.
Since there is the [vba] tag, split is also easy:
str1 = "001 baseball"
str2 = Split(str1)
Then use str2(1).
Another formula option is to use REPLACE function to replace the first n characters with nothing, e.g. if n = 4
=REPLACE(A1,1,4,"")

Resources