extract substrings using python regex - python-3.x

I would like to use a regular expression that matches any text between two strings:
sample_string= "Message ID: SM9MatRNTnMAYaylR0QgOH///qUUveBCbw==
2021-07-10T20:48:23.997Z john s (X Y Bank) -
john.s#xy.com:
[EVENT] 347376954900491 (john.s#xy.com) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)
Message ID: nsabNaqeXfuEj9mBEhvS0n///qUUveAhbw==
2021-07-10T20:48:23.997Z john s (X Y Bank) -
john.s#xy.comsays
[EVENT] 347376954900491 (john.s#xy.com) invited 347376954900486
(kerren.n#xy.com) to room (CSTest|john s|16091907435583)
Message ID: Nu/EYTkTQ5qdbqzZ0Rig8n///qUUvQ42dA==
2021-07-10T20:48:23.997Z john s (X Y Bank) -
john.s#xy.comsays
Catchyou later
Message ID: dy2yaByqhm+n88Gd3VQOhH///qUUrz8odA==
2021-07-10T20:48:23.997Z kerren n (X Y Bank) -
nancy.n#xy.comsays
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").
* * *
Generated by Content Export Service | Stream Type: SymphonyPost |
Stream ID: ZZo5pRRPFC18uzlonFjya3///qUUveBHdA== | Room Type: Private |
Conversation Scope: internal | Owning Company: X Y Bank | File
Generated Date: 2021-07-10T20:48:23.997Z | Content Start Date:
2021-07-10T20:48:23.997Z | Content Stop Date: 2021-07-10T20:48:23.997Z
* * *
*** (780787) Disclaimer:
(incorporated in paris with Ref. No. ZC18, is authorised by Prudential Regulation
Authority (PRA) and regulated by Financial Conduct Authority and PRA. oyp and
its affiliates (We) monitor this confidential message meant for your
information only. We make no recommendation or offer. You should get
independent advice. We accept no liability for loss caused hereby. See market
commentary disclaimers (
http://wholesalebanking.com/en/utility/Pages/d-mkt.aspx ),
Dodd-Frank and EMIR disclosures (
http://wholesalebanking.com/en/capabilities/financialmarkets/Pages/default.aspx
) "
In this example, I would like to extract everything after emailID and keyword Messaage ID:
so expected output would be:
extracted_list =[':
[EVENT] 347376954900491 (john.s#xy.com) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)','says
[EVENT] 347376954900491 (john.s#xy.com) invited 347376954900486
(kerren.n#xy.com) to room (CSTest|john s|16091907435583)','says Catchyou later','says
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").']
Note: everything after *** at last is not the part of text
What I tried so far is:
text = re.findall(r'\S+#\S+\s+(.*)Message ID', sample_string)
print (text)
##output: []

You can use
(?s)\S+#\S+?((?:says?|:)?\s.*?)\s+(?:Message ID|\* +\* +\*)
See the regex demo.
Details:
(?s) - same as re.DOTALL, inline modifier to make . match across line breaks
\S+ - one or more non-whitespace chars (can be replaced with [^\s#]+)
# - a # char
\S+? - one or more non-whitespace chars as few as possible
((?:says?|:)?\s.*?) - Group 1: an optional says/say/: and then a whitespace and then any zero or more chars as few as possible
\s+ - one or more whitespaces
(?:Message ID|\* +\* +\*) - either Message ID or * * * like substring.

Related

using regex in python to return line and some lines are skipped

I am new to Python and programming in general. My code gets data from a pdf. Then iterates over the returned lines to pull only lines meeting a regex expression. Most lines have data in the next line that need to be added to the current line which is being iterated over. The code I wrote does this successfully except that lines that do not have data in the next line to be added, returns the current line but skips the very next line.
What can I add or change to make sure that it does not skip the line.
An example of the lines that need to be iterated over are:
#Input text
CATS THIRD PARTY PAYMENT 1,664.58 0320 2,130.05
MUTUAL/IBS /GAL /0000010318908
IB TRANSFER TO 2,000.00- 0323 130.05
578441575425 10H32 28662338
FEE-INTER ACCOUNT TRANSFER ## 5.50- 0323 124.55
8419752
IB PAYMENT FROM 9,000.00 0325 9,124.55
JENNIFER LIVINGSTONE
IB PAYMENT FROM 1,000.00 0401 10,124.55
JENNIFER LIVINGSTONE
MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55 (This line has no description in the following line)
CREDIT TRANSFER 9,000.00 0401 19,103.55 (This line gets skipped)
ABSA BANK rent
IB TRANSFER TO 19,000.00- 0403 103.55
578441575425 11H45 286623383
bpdf = 'test pdf.pdf'
with pdfplumber.open(bpdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
new_trn_line = re.compile(r'(\D+)(\d.*) (\d.*) (\d.*\.\d{2})')
def transactions(sentences):
for lines in sentences.split('\n'):
yield lines
my_list = transactions(text)
my_data = []
for each_line in my_list:
if new_trn_line.search(each_line):
my_next_line = next(my_list)
if not new_trn_line.search(my_next_line):
my_data.append(new_trn_line.search(each_line).group(1) + my_next_line + " " +
new_trn_line.search(each_line).group(2) + " " + new_trn_line.search(each_line).group(3))
elif re.search(new_trn_line,text):
my_data.append(each_line)
else:
continue
my_data
#Output
['CATS THIRD PARTY PAYMENT MUTUAL/IBS /GAL /0000010318908 1,664.58 0320',
'IB TRANSFER TO 578441575425 10H32 286623383 2,000.00- 0323',
'FEE-INTER ACCOUNT TRANSFER ## 8419752 5.50- 0323',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 9,000.00 0325',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 1,000.00 0401',
'MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55',
'IB TRANSFER TO 578441575425 11H45 286623383 19,000.00- 0403'\]
If you compare it to the input you will see CREDIT TRANSFER 9,000.00 0401 19,103.55 gets skipped
Your current pattern would not match (\D+)(\d.*) (\d.*) (\d.\*.\d{2}) as \* matches an asterix which is not in the example data.
If you have read the whole text from the .pdf, perhaps you could use a single pattern with capture groups, and assemble the results.
The pattern matches
^ Start of string
([^\d\n]+) Capture group 1 Match 1+ chars other than digits or newlines
(\d[\d,.-]* \d+) Capture group 2 Match a digit followed by optional repetitions of digits, comma's, dots or a hyphen. Then match a space and 1+ digits
\d[\d,.]*\.\d{2} match a digit followed by digits, comma's and dots and then match . followed by 2 digits
( Capture group 3
(?: Non capture group to repeat as a whole part
\n(?!.*\b\d[\d,.]*\.\d{2}$) Assert that the line does not end on a digit with a decimal part having a dot and 2 digits
.* Match the whole line
)* close the non capture group and optionally repeat to also match just a single previous line
) Close group 3
See a regex101 demo
import re
pattern = r"^([^\d\n]+)(\d[\d,.-]* \d+) \d[\d,.]*\.\d{2}((?:\n(?!.*\b\d[\d,.]*\.\d{2}$).*)*)"
my_data = []
text = ("CATS THIRD PARTY PAYMENT 1,664.58 0320 2,130.05\n"
"MUTUAL/IBS /GAL /0000010318908\n"
"IB TRANSFER TO 2,000.00- 0323 130.05\n"
"578441575425 10H32 28662338\n"
"FEE-INTER ACCOUNT TRANSFER ## 5.50- 0323 124.55\n"
"8419752\n"
"IB PAYMENT FROM 9,000.00 0325 9,124.55\n"
"JENNIFER LIVINGSTONE\n"
"IB PAYMENT FROM 1,000.00 0401 10,124.55\n"
"JENNIFER LIVINGSTONE\n"
"MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55\n"
"CREDIT TRANSFER 9,000.00 0401 19,103.55\n"
"ABSA BANK rent\n"
"IB TRANSFER TO 19,000.00- 0403 103.55\n"
"578441575425 11H45 286623383")
matches = re.finditer(pattern, text, re.MULTILINE)
for _, match in enumerate(matches):
my_data.append(f"{match.group(1)}{match.group(3).strip()} {match.group(2)}")
print(my_data)
Note sure why in your example output only MONTHLY MANAGEMENT FEE ends with 10,103.55 because all the other output lines seems to end with 4 digits.
Output
[
'CATS THIRD PARTY PAYMENT MUTUAL/IBS /GAL /0000010318908 1,664.58 0320',
'IB TRANSFER TO 578441575425 10H32 28662338 2,000.00- 0323',
'FEE-INTER ACCOUNT TRANSFER ## 8419752 5.50- 0323',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 9,000.00 0325',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 1,000.00 0401',
'MONTHLY MANAGEMENT FEE ## 21.00- 0331',
'CREDIT TRANSFER ABSA BANK rent 9,000.00 0401',
'IB TRANSFER TO 578441575425 11H45 286623383 19,000.00- 0403'
]

regex to match paragraph in between 2 substrings

I have a string look like this:
string=""
( 2021-07-10 01:24:55 PM GMT )TEST
---
Badminton is a racquet sport played using racquets to hit a shuttlecock across
a net. Although it may be played with larger teams, the most common forms of
the game are "singles" (with one player per side) and "doubles" (with two
players per side).
( 2021-07-10 01:27:55 PM GMT )PATRICKWARR
---
Good morning, I am doing well. And you?
---
---
* * *""
I am trying to split the String up into parts as:
text=['Badminton is a racquet sport played using racquets to hit a
shuttlecock across a net. Although it may be played with larger teams,
the most common forms of the game are "singles" (with one player per
side) and "doubles" (with two players per side).','Good morning, I am
doing well. And you?']
What I have tried as:
text=re.findall(r'\( \d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} PM GMT \)\w+ [\S\n]--- .*',string)
I'm not able get how to extract multiple lines.
You can use
(?m)^\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n---\s*\n(.*(?:\n(?!(?:\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n)?---).*)*)
See the regex demo. Details:
^ - start of line
{left_rx} - left boundary
--- - three hyphens
\s*\n - zero or more whitespaces and then an LF char
(.*(?:\n(?!(?:{left_rx})?---).*)*) - Group 1:
.* - zero or more chars other than line break chars as many as possible
(?:\n(?!(?:{left_rx})?---).*)* - zero or more (even empty, due to .*) lines that do not start with the (optional) left boundary pattern followed with ---
The boundary pattern defined in left_rx is \(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n, it is basically the same as the original, I used \s* to match any zero or more whitespaces or \s+ to match one or more whitespaces between "words".
See the Python demo:
import re
text = '''string=""\n( 2021-07-10 01:24:55 PM GMT )TEST \n--- \nBadminton is a racquet sport played using racquets to hit a shuttlecock across\na net. Although it may be played with larger teams, the most common forms of\nthe game are "singles" (with one player per side) and "doubles" (with two\nplayers per side). \n \n \n\n \n\n( 2021-07-10 01:27:55 PM GMT )PATRICKWARR \n--- \nGood morning, I am doing well. And you? \n \n \n\n \n \n \n--- \n \n \n \n \n--- \n \n* * *""'''
left_rx = r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*\)\w+\s*\n"
rx = re.compile(fr"^{left_rx}---\s*\n(.*(?:\n(?!(?:{left_rx})?---).*)*)", re.M)
print ( [x.strip().replace('\n', ' ') for x in rx.findall(text)] )
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock across a net. Although it may be played with larger teams, the most common forms of the game are "singles" (with one player per side) and "doubles" (with two players per side).', 'Good morning, I am doing well. And you?']
One of the approaches:
import re
# Replace all \n with ''
string = string.replace('\n', '')
# Replace the date string '( 2021-07-10 01:27:55 PM GMT )PATRICKWARR ' and string like '* * *' with ''
string = re.sub(r"\(\s*\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} [AP]M GMT\s*\)\w+|\*+", '', string)
data = string.split('---')
data = [item.strip() for item in data if item.strip()]
print (data)
Output:
['Badminton is a racquet sport played using racquets to hit a shuttlecock acrossa net. Although it may be played with larger teams, the most common forms ofthe game are "singles" (with one player per side) and "doubles" (with twoplayers per side).', 'Good morning, I am doing well. And you?']

unable to get rid of all emojis

I need help removing emojis. I looked at some other stackoverflow questions and this is what I am de but for some reason my code doesn't get rid of all the emojis
d= {'alexveachfashion': 'Fashion Style * Haute Couture * Wearable Tech * VR\n👓👜⌚👠\nSoundCloud is Live #alexveach\n👇New YouTube Episodes ▶️👇', 'andrewvng': 'Family | Fitness | Friends | Gym | Food', 'runvi.official': 'Accurate measurement via SMART insoles & real-time AI coaching. Improve your technique & BOOST your performance with every run.\nSoon on Kickstarter!', 'triing': 'Augmented Jewellery™️ • Montreal. Canada.', 'gedeanekenshima': 'Prof na Etec Albert Einstein, Mestranda em Automação e Controle de Processos, Engenheira de Controle e Automação, Técnica em Automação Industrial.', 'jetyourdaddy': '', 'lavonne_sun': '☄️🔭 ✨\n°●°。Visual Narrative\nA creative heart with a poetic soul.\n————————————\nPARSONS —— Design & Technology', 'taysearch': 'All the World’s Information At Your Fingertips. (Literally) Est. 1991🇺🇸 🎀#PrincessofSearch 🔎Sample 👇🏽 the Search Engine Here 🗽', 'hijewellery': 'Fine 3D printed jewellery for tech lovers #3dprintedjewelry #wearabletech #jewellery', 'yhanchristian': 'Estudante de Engenharia, Maker e viciado em café.', 'femka': 'Fashion Futurist + Fashion Tech Lab Founder #technoirlab + Fashion Designer / Parsons & CSM Grad / Obsessed with #fashiontech #future #cryptocurrency', 'sinhbisen': 'Creator, TRiiNG, augmented jewellery label ⭕️ Transhumanist ⭕️ Corporeal cartographer ⭕️', 'stellawearables': '#StellaWearables ✉️Info#StellaWearables.com Premium Wearable Technology That Monitors Personal Health & Environments ☀️🏝🏜🏔', 'ivoomi_india': 'We are the manufacturers of the most innovative technologies and user-friendly gadgets with a global presence.', 'bgutenschwager': "When it comes to life, it's all about the experience.\nGoogle Mapper 🗺\n360 Photographer 📷\nBrand Rep #QuickTutor", 'storiesofdesign': 'Putting stories at the heart of brands and businesses | Cornwall and London | #storiesofdesign', 'trume.jp': '草創期から国産ウオッチの製造に取り組み、挑戦を続けてきたエプソンが世界に放つ新ブランド「TRUME」(トゥルーム)。目指すのは、最先端技術でアナログウオッチを極めるブランド。', 'themarinesss': "I didn't choose the blog life, the blog life chose me | Aspiring Children's Book Author | www.slayathomemum.com", 'ayowearable': 'The world’s first light-based wearable that helps you sleep better, beat jet lag and have more energy! #goAYO Get yours at:', 'wearyourowntechs': 'Bringing you the latest trends, Current Products and Reviews of Wearable Technology. Discover how they can enhance your Life and Lifestyle', 'roxfordwatches': 'The Roxford | The most stylish and customizable fitness smartwatch. Tracks your steps/calories/dist/sleep. Comes with FOUR bands, and a travel case!', 'playertek': "Track your entire performance - every training session, every match. \nBecause the best players don't hide.", '_kate_hartman_': '', 'hmsmc10': 'Health & Wellness 🍎\nBoston, MA 🏙\nSuffolk MPA ‘17 🎓 \n.\nJust Strong Ambassador 🏋🏻\u200d♀️', 'gadgetxtreme': 'Dedicated to reviewing gadgets, technologies, internet products and breaking tech news. Follow us to see daily vblogs on all the disruptive tech..', 'freedom.journey.leader': '📍MN\n🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿 \n📧Ashleybp5#gmail.com \n#homeschool #bossmom #builder #momlife', 'arts_food_life': 'Life through my phone.', 'medgizmo': 'Wearable #tech: #health #healthcare #wellness #gadgets #apps. Images/links provided as information resource only; doesn’t mean we endorse referenced', 'sawearables': 'The home of wearable tech in South Africa!\n--> #WearableTech #WearableTechnology #FitnessTech Find your wearable #', 'shop.mercury': 'Changing the way you charge.⚡️\nGet exclusive product discounts, and help us reach our goal below!🔋', 'invisawear': 'PRE-ORDERS NOW AVAILABLE! Get yours 25% OFF here: #girlboss #wearabletech'}
for key in d:
print("---with emojis----")
print(d[key])
print("---emojis removed----")
x=''.join(c for c in d[key] if c <= '\uFFFF')
print(x)
output example
---with emojis----
📍MN
🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿
📧Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---emojis removed----
MN
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---with emojis----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!🔋
---emojis removed----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!
There is no technical definition of what an "emoji" is. Various glyphs may be used to render printable characters, symbols, control characters and the like. What seems like an "emoji" to you may be part of normal script to others.
What you probably want to do is to look at the Unicode category of each character and filter out various categories. While this does not solve the "emoji"-definition-problem per se, you get much better control over what you are actually doing without removing, for example, literally all characters of languages spoken by 2/3 of the planet.
Instead of filtering out certain categories, you may filter everything except the lower- and uppercase letters (and numbers). However, be aware that ꙭ is not "the googly eyes emoji" but the CYRILLIC SMALL LETTER DOUBLE MONOCULAR O, which is a normal lowercase letter to millions of people.
For example:
import unicodedata
s = "🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿"
# Just filter category "symbol"
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', ))
print(t)
...results in
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
This may not be emoji-free enough, yet the • is technically a form of punctuation. So filter this as well
# Filter symbols and punctuations. You may want 'Cc' as well,
# to get rid of control characters. Beware that newlines are a
# form of control-character.
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', 'Po'))
print(t)
And you get
Wife Homeschooling Mom to 5 D Y I lover Small town living in MN

Specific string sorting [Python 2.7]

I am fairly new to python, and I was trying to sort this string in a certain way (Taken off a database):
6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y
This is the standard format for these types of strings:
MDR_REPORT_KEY|DEVICE_EVENT_KEY|IMPLANT_FLAG|DATE_REMOVED_FLAG|DEVICE_SEQUENCE_NO|DATE_RECEIVED|BRAND_NAME|GENERIC_NAME|MANUFACTURER_D_NAME|MANUFACTURER_D_ADDRESS_1|MANUFACTURER_D_ADDRESS_2|MANUFACTURER_D_CITY|MANUFACTURER_D_STATE_CODE|MANUFACTURER_D_ZIP_CODE|MANUFACTURER_D_ZIP_CODE_EXT|MANUFACTURER_D_COUNTRY_CODE|MANUFACTURER_D_POSTAL_CODE|EXPIRATION_DATE_OF_DEVICE|MODEL_NUMBER|CATALOG_NUMBER|LOT_NUMBER|OTHER_ID_NUMBER|DEVICE_OPERATOR|DEVICE_AVAILABILITY|DATE_RETURNED_TO_MANUFACTURER|DEVICE_REPORT_PRODUCT_CODE|DEVICE_AGE_TEXT|DEVICE_EVALUATED_BY_MANUFACTUR
Is there any way I can print out this string sorted with the specific datatype next to the value?
For example as an output I would like to have
Report key: 6392079
Device sequence number: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
etc.etc. with the other values. I think I would need to use the "|" as a divider to separate the data, but I'm not sure how to. I also cannot use sorting with the index number, because there are many variations of the string above which are all different lengths.
Also as you can see in the string some of the data such as device_event_key, implant_flag, date_removed_flag, and device_sequence number are absent, but there are still corresponding empty vertical slashes.
Any help would be greatly appreciated, thanks.
#nsortur, you can try the below code to get the output.
I have used the concept of list comprehension, zip() function and split(), join() methods defined on string objects.
You can try to run code online at
http://rextester.com/MBDXB29573 (Code perfectly works with Python2/Python3).
string1 = "6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y"
keys = ["Report key", "Device sequence number","Date received", "Brand name"];
values = [key.strip() for key in string1.split("|") if key.strip()];
output = "\n".join([key + ": " + str(value) for key, value in zip(keys, values)]);
print(output);
Output:
Report key: 6392079
Device sequence number: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
Use zip to merge the two lists into tuple pairs:
data = '6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y'
format = 'MDR_REPORT_KEY|DEVICE_EVENT_KEY|IMPLANT_FLAG|DATE_REMOVED_FLAG|DEVICE_SEQUENCE_NO|DATE_RECEIVED|BRAND_NAME|GENERIC_NAME|MANUFACTURER_D_NAME|MANUFACTURER_D_ADDRESS_1|MANUFACTURER_D_ADDRESS_2|MANUFACTURER_D_CITY|MANUFACTURER_D_STATE_CODE|MANUFACTURER_D_ZIP_CODE|MANUFACTURER_D_ZIP_CODE_EXT|MANUFACTURER_D_COUNTRY_CODE|MANUFACTURER_D_POSTAL_CODE|EXPIRATION_DATE_OF_DEVICE|MODEL_NUMBER|CATALOG_NUMBER|LOT_NUMBER|OTHER_ID_NUMBER|DEVICE_OPERATOR|DEVICE_AVAILABILITY|DATE_RETURNED_TO_MANUFACTURER|DEVICE_REPORT_PRODUCT_CODE|DEVICE_AGE_TEXT|DEVICE_EVALUATED_BY_MANUFACTUR'
for label, value in zip(format.split('|'), data.split('|')):
print("%s: %s" % (label.replace('_', ' ').capitalize(), value))
This outputs:
Mdr report key: 6392079
Device event key:
Implant flag:
Date removed flag:
Device sequence no: 1.0
Date received: 03/09/2017
Brand name: PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
Generic name: INSULIN INFUSION PUMP / SENSOR AUGMENTED
Manufacturer d name: MEDTRONIC MINIMED
Manufacturer d address 1: 18000 DEVONSHIRE STREET
Manufacturer d address 2:
Manufacturer d city: NORTHRIDGE
Manufacturer d state code: CA
Manufacturer d zip code: 91325
Manufacturer d zip code ext:
Manufacturer d country code: US
Manufacturer d postal code: 91325
Expiration date of device:
Model number: MMT-723LNAH
Catalog number: MMT-723LNAH
Lot number:
Other id number:
Device operator: 0LP
Device availability: R
Date returned to manufacturer: 01/29/2014
Device report product code: OYC
Device age text:
Device evaluated by manufactur: Y
This can be achieved by simple split() method of the str, split('|') would have empty strings for the empty values between two |, and then match it with dict having attribute as key and value as value of dict
query = '6392079|||| 1.0|03/09/2017|PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP|INSULIN INFUSION PUMP / SENSOR AUGMENTED|MEDTRONIC MINIMED|18000 DEVONSHIRE STREET||NORTHRIDGE|CA|91325||US|91325||MMT-723LNAH|MMT-723LNAH|||0LP|R|01/29/2014|OYC||Y'
def get_detail(str_):
key_finder = {'Report Key': 0, 'Device Sequence Number': 4, 'Device Recieved': 5, 'Brand Name': 6}
split_by = str_.split('|')
print('Report Key : {}'.format(split_by[key_finder['Report Key']]))
print('Device Seq Num : {}'.format(split_by[key_finder['Device Sequence Number']]))
print('Device Recieved : {}'.format(split_by[key_finder['Device Recieved']]))
print('Brand Name : {}'.format(split_by[key_finder['Brand Name']]))
>>> get_detail(query)
Report Key : 6392079
Device Seq Num : 1.0
Device Recieved : 03/09/2017
Brand Name : PARADIGM REAL-TIME REVEL INSULIN INFUSION PUMP
This works because the splited string will be indexed from 0, so the Report Key will have the value in 0th index of the splitted string and so on for other values. This will be matched with the dict key_finder which has the stored index for each value.

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources