Trouble using regex patterns any Python to find content in a document

Trouble using regex patterns any Python to find content in a document - python-3.x

I have a list of regex expressions that I want to find in certain docs.
x = ['\bin\sapp\sdata\b','\bin\sapp\sdata\b','\benough\sdata\b']
The patterns repeat themselves so I converted them to a set (see the first and second values in the list)
y = set(x)
When I try to find them in a specific doc it doesn't find them since it doesn't take them as a repr version:
import pandas as pd
import re
results = list()
doc = 'they wanted in app data and we did not provide it'
for value in y:
results.append(re.findall(pattern = value,string=doc))
results = list(filter(None, results))
results
How do I overcome this?
Thanks

The problem was with the python 3.7 version. The error I got was "bad escape \l at position 0" Once I changed the re to regex it worked perfectly fine, even with the "messed up coding

Related

ipython: print numbers with thousands separator

I am using ipython 5.8.0 on Debian 10.
This is how output looks like:
In [1]: 50*50
Out[1]: 2500
Is it possible to configure ipython to print all numbers with thousands separators? ie:
In [1]: 50*50
Out[1]: 2'500
In [2]: 5000*5000
Out[2]: 25'000'000
And perhaps, is it possible to make ipython also understand thousands separators on input?
In [1]: 5'000*5'000
Out[1]: 25'000'000
UPDATE
The accepted answer from #Chayim Friedman works for integers, but does not work for float:
In [1]: 500.1*500
Out[1]: 250050.0
Also, when it works, it uses , as the character for thousand separator:
In [1]: 500*500
Out[1]: 250,000
Can I use ' instead?

Using ' as thousands separator in input is quite problematic because Python uses ' to delimit strings, but you can use _ (PEP 515, Underscores in Numeric Literals):
Regarding output, this is slightly harder, but can be done using IPython extensions.
Put the following Python code in a new file at ~/.ipython/extensions/thousands_separator.py:
default_int_printer = None
def print_int(number, printer, cycle):
printer.text(f'{number:,}') # You can use `'{:,}'.format(number)` if you're using a Python version older than 3.6
def load_ipython_extension(ipython):
global default_int_printer
default_int_printer = ipython.display_formatter.formatters['text/plain'].for_type(int, print_int)
def unload_ipython_extension(ipython):
ipython.display_formatter.formatters['text/plain'].for_type(int, default_int_printer)
This code tells IPython to replace the default int formatter with one that prints thousand separators when this extension is loaded, and restore the original when it is unloaded.
Edit: If you want a different separator, for instance ', replace the f'{number:,}' with f'{number:,}'.replace(',', "'").
You can load the extension using the magic command %load_ext thousands_separator and unload it using %unload_ext thousands_separator, but if you want it always, you can place it in the default profile.
Run the following code in the terminal:
ipython3 profile create
It will report that a file ~/.ipython/profile_default/ipython_config.py was created. Enter it, and search for the following string:
## A list of dotted module names of IPython extensions to load.
#c.InteractiveShellApp.extensions = []
Replace it with the following:
# A list of dotted module names of IPython extensions to load.
c.InteractiveShellApp.extensions = [
'thousands_separator'
]
This tells IPython to load this extension by default.
Done!
Edit: I saw that you want to a) use ' as separator, and b) do the same for floats:
Using different separator is quite easy: just str.replace():
def print_int(number, printer, cycle):
printer.text(f'{number:,}'.replace(',', "'"))
Doing the same for floats is also easy: just setup print_int so it prints floats to. I also suggest to change the name to print_number.
Final code:
default_int_printer = None
default_float_printer = None
def print_number(number, printer, cycle):
printer.text(f'{number:,}'.replace(',', "'"))
def load_ipython_extension(ipython):
global default_int_printer
global default_float_printer
default_int_printer = ipython.display_formatter.formatters['text/plain'].for_type(int, print_number)
default_float_printer = ipython.display_formatter.formatters['text/plain'].for_type(float, print_number)
def unload_ipython_extension(ipython):
ipython.display_formatter.formatters['text/plain'].for_type(int, default_int_printer)
ipython.display_formatter.formatters['text/plain'].for_type(float, default_float_printer)

After update: you can subclass int:
class Int(int):
def __repr__(self):
return "{:,}".format(self)
Int(1000)
# 1,000

I don't believe you can achieve all that you are looking for without rewriting the iPython interpreter, which means changing the Python language specification, to be able to input numbers with embedded ' characters and have them ignored. But you can achieve some of it. Subclassing the int class is a good start. But you should also overload the various operators you plan on using. For example:
class Integer(int):
def __str__(self):
# if you want ' as the separator:
return "{:,}".format(self).replace(",", "'")
def __add__(self, x):
return Integer(int(self) + x)
def __mul__(self, x):
return Integer(int(self) * x)
"""
define other operations: __sub__, __floordiv__, __mod__, __neg__, etc.
"""
i1 = Integer(2)
i2 = Integer(1000) + 4.5 * i1
print(i2)
print(i1 * (3 + i2))
Prints:
1'009
2'024
Update
It seems that for Python 3.7 you need to override the __str__ method rather than the __repr__ method. This works for Python 3.8 and should work for later releases as well.
Update 2
import locale
#locale.setlocale(locale.LC_ALL, '') # probably not required
print(locale.format_string("%d", 1255000, grouping=True).replace(",", "'"))
Prints:
1'255'000
An alternative if you have package Babel from the PyPi repository:
from babel import Locale
from babel.numbers import format_number
locale = Locale('en', 'US')
locale.number_symbols['group'] = "'"
print(format_number(1255000, locale='en_US'))
Prints:
1'255'000
Or if you prefer to custom-tailor a locale just for this purpose and leave the standard en_US locale unmodified. This also shows how you can parse input values:
from copy import deepcopy
from babel import Locale
from babel.numbers import format_number, parse_number
my_locale = deepcopy(Locale('en', 'US'))
my_locale.number_symbols['group'] = "'"
print(format_number(1255000, locale=my_locale))
print(parse_number("1'125'000", locale=my_locale))
Prints:
1'255'000
1125000

Based on PEP-0378, you can use the following code:
a = 1200
b = 500
c = 10
#res = a
#res = a*b
res = a*b*c
dig = len(str(res)) # to figure out how many digits are required in result
print(format(res, "{},d".format(dig)))
It will produce:
6,000,000

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!

Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz

Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

Find a specific item from a list using python

I have a list of 20000 Products with their Description
This shows the variety of the products
I want to be able to write a code that searches a particular word say 'TAPA'
and give a output of all the TAPAs
I found this Find a specific word from a list in python , but it uses startswith which finds only the first item for example:
new = [x for x in df1['A'] if x.startswith('00320')]
## output ['00320671-01 Guide rail 25N/1660', '00320165S02 - Miniature rolling table']
How shall i find for the second letter, third or any other item
P.S- the list consists of strings, integers, floats

You can use string.find(substring) for this purpose. So in your case this should work:
new = [x for x in df1['A'] if x.find('00320') != -1]
The find() method returns the lowest index of the substring found else returns -1.
To know more about usage of find() refer to Geeksforgeeks.com - Python String | find()
Edit 1:
As suggested by #Thierry in comments, a cleaner way to do this is:
new = [x for x in df1['A'] if '00320' in x]

You can use the built-in functions of Pandas to find partial string matches and generate lists:
new = df1['A'][df1['A'].astype(str).str.contains('00320')]['A'].tolist()
An advantage of pandas str.contains() is that the use of regex is possible.

Loop json results

I'm totally new to python. I have this code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
deal_name = json_data ['data'][0]['title']
print(deal_name)
It prints the first title for me, but I would like it to loop through all titles in the json. But I can't figure out how. Can anyone guide me in the right direction?

You want to read up on dictionaries and lists. It seems like your json_data["data"] contains a list, so:
Seeing you wrote this:
deal_name = json_data ['data'][0]['title']
print(deal_name)
What you are looking for is:
for i in range(len(json_data["data"])):
print(json_data["data"][i]["title"])

Print it with a for loop
1. for item in json_data['data']: will take each element in the list json_data['data']
2. Then we print the title property of the object using the line print(item['title'])
Code:
import requests
won = 'https://api.pipedrive.com/v1/deals?status=won&start=0&api_token=xxxx'
json_data = requests.get(won).json()
for item in json_data['data']:
print(item['title'])
If you are ok with printing the titles as a list you can use List Comprehensions, Please refer the link in references to learn more.
print([x['title'] for x in json_data['data']])
References:
Python Loops
Python Lists
Python Comprehensions

Parse a config-file with Python having more then one variable with the same name

Is there a way to parse a config file like this with Python3?
path = .MyAppData
prefer = newer
path = Dokumente
Please don't blame me. ;) I didn't build the software producing config files like this. But they make sense in that special context.
I know ConfigParser and configobj for Python3 but don't see a way to do this.

The ConfigParser initialiser supports the strict=False argument, which allows duplicates. But which value is retained in that case isn't mentioned in the documentation as far as I can tell.
One simple solution is to convert the lines into a dictionary yourself;
In [1]: txt = '''path = .MyAppData
...: prefer = newer
...: path = Dokumente'''
In [2]: txt.splitlines()
Out[2]: ['path = .MyAppData', 'prefer = newer', 'path = Dokumente']
(After splitting the text in lines, you might want to filter out comments and empty lines.)
In [3]: [ln.split('=') for ln in txt.splitlines()]
Out[3]: [['path ', ' .MyAppData'], ['prefer ', ' newer'], ['path ', ' Dokumente']]
In [4]: vars = [ln.split('=') for ln in txt.splitlines()]
(At this point you might want to add a filter for the inner lists so that you only have lists of length 2, indicating a succesfull split.)
In [5]: {a.strip(): b.strip() for a, b in vars}
Out[5]: {'path': 'Dokumente', 'prefer': 'newer'}
In the dict comprehension (In [5]), later assignments will override earlier ones.
Of course, if prefer = older, you'd have to reverse the lines before the dict comprehension.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trouble using regex patterns any Python to find content in a document - python-3.x

The problem was with the python 3.7 version. The error I got was "bad escape \l at position 0" Once I changed the re to regex it worked perfectly fine, even with the "messed up coding

Related

ipython: print numbers with thousands separator

Python: Print entire line of string match and not cut off after the period

Find a specific item from a list using python

Loop json results

Parse a config-file with Python having more then one variable with the same name

Categories

Resources