Python Math - Floating with financial output comes out incorrect - python-3.x

In a small portion of code for a retail auditing calculator, I'm attempting to allow the input of a retail value and multiply it by up to 2 entered quantities The expected (intended) result is $X*Y=$Z.
I've attempted to modify the code a couple of says and seem to be stuck on how this math is (isn't) working correctly.
I've attempted a number of different configurations in the code and the most I've achieved is the following:
#Retail value of item, whole number (i.e. $49.99 entered as 4999)
rtlVAL = input("Retail Value: ")
#Quantity of Items - can be multiplied for full stack items, default if no entry is '1'
qt1 = float(input("Quantity 1: ")) #ex. 4
qt2 = float(input("Quantity 2: ") or "1") #ex " "
#Convert the Retail Value to finacial format (i.e 4999 to $49.99)
rtl = float("{:.2}".format (rtlVAL))
# Screen Output
qtyVAL = int(qt1)*int(qt2)
print("$" + str(qtyVAL*rtl))
The entered values are:
Retail Value: 4999
Quantity 1: 4
Quantity 2: (blank)
The expected performance is 4999 * 4 * (because no entry defaults to value of 1) and the expected result is $199.96
The result of this code is $196.0, so not only is it the wrong conclusion but it's missing the two decimal places.
I'm not entirely certain why the math comes up wrong in context to expectation.
What am I missing here?
On line 9, I've tried the following:
rtl = float("{:.2f}".format (rtlVAL))
rtl = int("{:.2f}".format (rtlVAL))
The return was
ValueError: Unknown format code 'f' for object of type 'str'
if I change line 13 to:
print("$" + float(qtyVAL*rtl))
I get
TypeError: must be str, not float
using either of the prior alterations in conjunction with the latter will return the ValueError:
Python 3.4 and 3.6
I did search a few other SO questions regarding Python, Math, Floating point, and formatting but the questions were looking for and presenting something far more advances and entangled than this so i wasn't able to glean an answer to make a contextual application or it applied mainly to Python 2.7 wherein some of the code such as raw input() is simply input() and altered by int(input())in Python 3.x to step out of str value (as far as I understand for this purpose.
I did not see this as a duplicate, but if I missed something in that I do apologize - it isn't intentional.

No need to mess around with number formats:
rtl = float(rtlVAL)/100
Just divide the retail value by 100 to get the dollar value
EDIT:
Incidentally, the reason it was coming up with 196 was because your number format was taking the first two digits of rtlVAL - 49 in your case - and then multiplying by that.

Related

Web scraping Selenium Python project, TypeError: can't multiply sequence by non-int of type 'list'

I'm new to programming with Python and I'm trying to learn by doing projects.
I'm working on a web scraping project with Python Selenium. I need to get a variable price from a website for airplane tickets, and need to calculate how many times the standard (bottom price) fits into the new price. I need a result like: 1.5x, 1,6, 2.0x etc.
I managed to get the variable prices into the terminal, but when I run the calculation it gives me this error:
standard_price = 0.0411
new_price = driver.find_element(By.XPATH, '/html/body/div[1]/div/div/div[1]/div/div[2]/div[2]/div/span/div/div[3]/div/div[1]/div[2]/div[1]/div[2]/div').text
price_content = price.get_attribute('innerHTML')
emp_str = ""
for m in price_content:
if m.isdigit():
emp_str = emp_str + m
print(standard_price * emp_str)
I put the 'isdigit' function so it will only output numbers. When I run it, it gives me 4 digit number, so that is good.
But when i run the code
print(standard_price * new_price)
I get this error: TypeError: can't multiply sequence by non-int of type 'list'
I know that it has to do something with the fact that standard_price is a float, and new_price is a list, I tried a lot of things, also checked out older questions on the forum but with no results.
Is there a practical workaround?
print(standard_price * float(new_price))

How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.
After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()
#Srikar Appalaraju is right. Take the following example image:
Now use the following code:
text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()
Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:
lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
.apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
line_conf = []
for i in range(len(lines)):
if lines[i].strip():
line_conf.append((lines[i], round(confs[i],3)))
with the following desired output:
[('Ying Thai Kitchen', 91.667),
('2220 Queen Anne AVE N', 88.2),
('Seattle WA 98109', 90.333),
('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
('‘uw .yingthaikitchen.com', 40.0),
('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
('Order#:17 Table 2', 94.0),
('Date: 7/4/2013 7:28 PM', 86.25),
('Server: Jack (1.4)', 83.0),
('44 Ginger Lover $9.50', 89.0),
('[Pork] [24#]', 43.0),
('Brown Rice $2.00', 95.333),
('Total 2 iten(s) $11.50', 89.5),
('Sales Tax $1.09', 95.667),
('Grand Total $12.59', 95.0),
('Tip Guide', 95.0),
('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
('Thank you very much,', 90.75),
('Cone back again', 92.667)]
The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is
text.groupby(['block_num','par_num','line_num'])['text'].apply(list)
We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?
Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item
But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

Pandas .rolling.corr using date/time offset

I am having a bit of an issue with pandas's rolling function and I'm not quite sure where I'm going wrong. If I mock up two test series of numbers:
df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)
test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))
Then it's easy to have a look at their rolling annual correlation:
test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()
which produces:
All good so far. If I then try to do the same thing using a datetime offset:
test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()
I get a wildly different (and obviously wrong) result:
Is there something wrong with pandas or is there something wrong with me?
Thanks in advance for any light you can shed on this troubling conundrum.
It's very tricky, I think the behavior of window as int and offset is different:
New in version 0.19.0 are the ability to pass an offset (or
convertible) to a .rolling() method and have it produce variable sized
windows based on the passed time window. For each time point, this
includes all preceding values occurring within the indicated time
delta.
This can be particularly useful for a non-regular time frequency index.
You should checkout the doc of Time-aware Rolling.
r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)
r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()
This code would produce similar shape of plots for r2.corr().plot() and r3.corr().plot(), but note that the calculation results still different: r2.corr(test_df['Series2']) == r3.corr(test_df['Series2']).
I think for regular time frequency index, you should just stick to r1.
This mainly because the result of two rolling 365 and 365D are different.
For example
sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]:
1990-01-01 NaN
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
sub['Series2'].rolling('2D').sum()
Out[16]:
1990-01-01 -0.043692
1990-01-02 -0.355230
1990-01-03 0.844281
1990-01-04 2.515529
1990-01-05 1.508412
Since there are a lot NaN in rolling 365, so the corr of two series in two way are quit different.

Predicting a users next action based on current day and time

I'm using Microsoft Azure Machine Learning Studio to try an experiment where I use previous analytics captured about a user (at a time, on a day) to try and predict their next action (based on day and time) so that I can adjust the UI accordingly. So if a user normally visits a certain page every Thursday at 1pm, then I would like to predict that behaviour.
Warning - I am a complete novice with ML, but have watched quite a few videos and worked through tutorials like the movie recommendations example.
I have a csv dataset with userid,action,datetime and would like to train a matchbox recommendation model, which, from my research appears to be the best model to use. I can't see a way to use date/time in the training. The idea being that if I could pass in a userid and the date, then the recommendation model should be able to give me a probably result of what that user is most likely to do.
I get results from the predictive endpoint, but the training endpoint gives the following error:
{
"error": {
"code": "ModuleExecutionError",
"message": "Module execution encountered an error.",
"details": [
{
"code": "18",
"target": "Train Matchbox Recommender",
"message": "Error 0018: Training dataset of user-item-rating triples contains invalid data."
}
]
}
}
Here is a link to a public version of the experiment
Any help would be appreciated.
Thanks.
Maybe this answer could be helpful, you may also take a look on this where you can read:
The problem is probably with the range of rating data. There's an upper limit for rating range, because the training gets expensive if the range between smallest and largest rating is too large.
[...]
One option would be to scale the ratings to a narrower range.
According to this MSDN, please note that you cannot have a gap between the min and max note higher than 100.
So you have to make a pre-processing on your csv file column data (userid, action, datetime etc...) in order to keep all column data in the [0-99] range.
Please see bellow a Python implementation (to share the logic):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
big_gap_arr = [-250,-2350,850,-120,-1235,3212,1,5,65,48,265,1204,65,23,45,895,5000,3,325,3244,5482] #data with big gap
abs_min = abs(min(big_gap_arr)) #get the absolute minimal value
max_diff= ( max(big_gap_arr) + abs_min ) #get the maximal diff
specific_range_arr=[]
for each_value in big_gap_arr:
new_value = ( 99/1. * float( abs_min + each_value) / max_diff ) #get a corresponding value in the [0,99] range
specific_range_arr.append(new_value)
print specific_range_arr #post computed data => all in range [0,99]
Which give you:
[26.54494382022472, 0.0, 40.449438202247194, 28.18820224719101, 14.094101123595506, 70.3061797752809, 29.71769662921348, 29.76825842696629, 30.526685393258425, 30.31179775280899, 33.05477528089887, 44.924157303370784, 30.526685393258425, 29.995786516853933, 30.27387640449438, 41.01825842696629, 92.90730337078652, 29.742977528089888, 33.813202247191015, 70.71067415730337, 99.0]
Note that all data are now in the [0,99] range
Following this process:
User id could be float instead an integer
Action is an integer (if you got less than 100 actions) or float (if more than 100 actions)
Datetime will be splited in two integer (or one integer and one float), please see bellow:
Concerning:
(A) way to use date/time in the training
You may split your datetime in two column, something like:
one column for the weekday:
0: Sunday
1: Monday
2: Tuesday
[...]
6: Saturday
one column for the time in the day:
0: Between 00:00 & 00:15
1: Between 00:15 & 00:30
2: Between 00:30 & 00:40
[...]
95 : Between 23:45 & 00:00
If you need a better granularity (here it is a 15 min window) you may also use float number for the time column.
So from messing with this for a while, I think I may see where the issue may lie. I think that the first three inputs of the Train Matchbox Recommender would need to be filled in for an accurate prediction. I'll include screenshots of the sample for recommending restaurants, as well.
The first input would be the dataset consisting of the user, item, and rating.
The second input would be the features of each user.
And the third input would be the features of each feature (restaurant in this case).
So to help with the date/time issue, I'm wondering if the data would need to be munged to match something similar to the restaurant and user data.
I know it's not much, but I hope it helps lead you down the right track.

Python string with currency formatted

I am attempting to retrieve a dollar amount from my database, concatenate it with other information and have it print with 2 decimal points. I have tried so many different configurations I have lost tract but continue to get only one decimal point (or an error). Can someone please show me where I am losing it.
#get payments on the permits
for i in range(len(IDS)):
paydate = date.today() - timedelta(30)
query = '''select `AmountPaid_credit`, `PayComment`, `DateReceived`, `WPTSinvoiceNo`
from Payments
where `NPDES_ID` = ? and `DateReceived` >= ?'''
PayStrings.append("\tNo Payment Recieved")
for row in cur.execute(query, (IDS[i],paydate)):
WPTSInv.append(row[3])
d= row[2].strftime('%m/%d/%Y')
#create a payment string
PayStrings[i]="\t$"+str(row[0])+" - "+d+" - "+row[1]
returns this
$2000.0 - 02/09/2017 - 2017 APPLICATION FEE
but I need this
$2000.00 - 02/09/2017 - 2017 APPLICATION FEE
As an alternative to Haifeng's answer, you could do the following
from decimal import Decimal
str(round(Decimal(row[0]), 2))
Edit: I would also like to add that there is a locale module and that it is probably a better solution, even if it is a little bit more work. You can see how to use it at this question here: Currency formatting in Python
In your case, you just converting decimal places:
>>> value = "2000.0"
>>> "{:.2f}".format(float(value))
'2000.00'

Resources