Seeking guide for Python3 new feature: "f-string" - python-3.x

How to use Python3 new feature : "f-string" to output the first 50 bits of math.pi?
We can achieve it by the following 2 old ways:
1 ("%.50f"% math.pi)
2 '{.50f}'.format(math.pi)
But for the new feature "f-string",I knew that we can use this format:
f"the value of pi is {math.pi}", but how to limit and filter the first 50 bits?
In [2]: ("%.50f"%math.pi)
Out[2]: '3.14159265358979311599796346854418516159057617187500'

Same formatting as with str.format, but using the variable name in the first block before the ::
>>> import math
>>> f"{math.pi:.50f}"
'3.14159265358979311599796346854418516159057617187500'
>>> f"the value of pi is {math.pi:.50f}"
'the value of pi is 3.14159265358979311599796346854418516159057617187500'

Related

How to find the string match between 2 Excel files and return the match percentage in python?

I have 2 Excel files which contains names as the only column:
File 1: file1.xlsx
Names
Vinay adkz
Sagarbhansali
Jeffery Jonas
Kiara Francis
Dominic
File 2: file2.xlsx
Names:
bhansali Sagar
Dominic
Jenny
adkzVinay
Sample Output:
I want to match the names in file 1 with names in file 2, and i am trying to get an output like the below :
Names File2Matchname. Match%
Vinay adkz. adkzVinay. 98%
Sagarbhansali. bhansali sagar 97%
Jeffery Jonas NA 0%
Kiara Francis NA 0%
Dominic Dominic 100%
Is there any logic by which the above logic can be arrived in python ?
I tried to do this in Excel but vlookup doesn't help with match%. I know this is possible with python using cousine similarity but i am unable to get the logic in which the output can be arrived.
Any help would be much appreciated.
You can use Pandas and use python's built-in difflib library which has a function called difflib.SequenceMatcher() function to find the longest common substring between the two names.
Example code:
import pandas as pd
import difflib
#For testing
dict_lists = {"Names":["Vinay adkz", "Shailesh", "Seema", "Nidhi","Ajitesh"]}
dict_lists2 = {"Names":["Ajitesh", "Vinay adkz", "Seema", "Nid"]}
# Excel to dataframes
df1 = pd.DataFrame(dict_lists) #pd.read_excel('file1.xlsx')
df2 = pd.DataFrame(dict_lists2) #pd.read_excel('file2.xlsx')
# Empty lists to stor matched name, match percentage
match_name = []
match_percent = []
# Iterate through the first dataframe
for i, row in df1.iterrows():
name = row['Names']
match = difflib.get_close_matches(name, df2['Names'], n=1, cutoff=0.8)
if match:
match_name.append(match[0])
match_string = difflib.SequenceMatcher(None, name, match[0]).find_longest_match(0, len(name), 0, len(match[0]))
match_percentage = (match_string.size / len(name)) * 100
match_percent.append(match_percentage)
else:
match_name.append('NA')
match_percent.append(0)
df1['File2names'] = match_name
df1['Match_per'] = match_percent
print(df1)
# Write in Excel
# df1.to_excel('output.xlsx', index=False)
I hope this helps you. This is the first time I am answering a question here.
Read also: How to use SequenceMatcher to find similarity between two strings?

sorting a pandas Series not working correctly

I am trying to sort a given series in python pandas but as per my knowledge it is not correct , it should be like [1,3,5,10,python]
can you please guide on what basis it is sorting this way ?
s1 = pd.Series(['1','3','python','10','5'])
s1.sort_values(ascending=True)
enter image description here
As explained in the comments, you have strings so '5' is greater than '10' (strings are compared character by character and '5' > '1').
One workaround is to use natsort for natural sorting:
from natsort import natsort_key
s1.sort_values(ascending=True, key=natsort_key)
output:
0 1
1 3
4 5
3 10
2 python
dtype: object
alternative without natsort (numbers first, strings after):
key = lambda s: (pd.concat([pd.to_numeric(s, errors='coerce')
.fillna(float('inf')), s], axis=1)
.agg(tuple, axis=1)
)
s1.sort_values(ascending=True, key=key)

Pandas for Excel and selenium loop

I am trying to print out values from excel, and values are in numbers. I goal is to read these values and search in google one by one. Will stop for x seconds when the value is 'nan', then skip this 'nan' and then keep moving on to next.
Problems faced:
It is printing out in scientific notation format
Want to stop doing something when its 'nan' in excel
Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]
My solution:
I have 'lambda x: '%0.2f' % x' inside set_option to make it print out xxxxxx.00 with 2 decimal. Even i want it in int, but its already better than scientific notation format
Used 'if' to see if value in upc[i] equal to 'nan' <--nan is what i got from print. But it still print out range of 20 values with 'nan'.
I can't think of something now
Code:
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
import msvcrt
import datetime
import time
driver = webdriver.Chrome()
#Settings
pd.set_option('display.width',10, 'display.max_rows', 10, 'display.max_colwidth',100, 'display.width',10, 'display.float_format', lambda x: '%0.2f' % x)
df = pd.read_excel(r"BARCODE.xlsx", skiprows = 2, sheet_name = 'products id')
#Unnamed: 1 is also an empty column, i just didn't input UPC as title in excel.
upc = df['Unnamed: 1']
#I can't print out as interger...It will always have a xxxxx.0
print((upc[0:20]))
count = len(upc)
i = 0
for i in range(count ):
if upc[i] == 'nan':
'skip for x seconds and continue, i am not sure how to do yet'
else:
print(int(upc[i]))
driver.get('https://www.google.com')
driver.find_element_by_name('q').send_keys(int(upc[i]))
i = i + 1
Print out:
3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan <- i want program to stop here so i can do something else.
3337872411991.0
3433422408159.0
3337875598071.0
3337872412516.0
3337875518451.0
3337875613491.0
3337872413025.0
3337875398961.0
3337872410208.0
nan
Name: Unnamed: 1, Length: 20, dtype: float64
3337872411991
3433422408159
3337875598071
3337872412516
3337875518451
etc....
Googled some format about number, such as set printing format, but I got confused between .format and lambda.
It is printing out in scientific notation format
It seems you have numbers like UPC and EANs.
You can probably solve that by marking numbers as text instead. If you need to have always length 13 you can correct it with appending zeroes at start.
Want to stop doing something when its nan in excel
Simplest solution could be to use input and accept any character to continue executing your code. But if you want to have few seconds time.sleep() is good as well
Copy UPC[i] into google search, but i wanted to only copy once, due to i want to design it open new tab then copy the second UPC[i]
Some points you may want to reconsider:
Iterating in python can be done with enumerate() if you need index values. If you do not need index you may simply drop it instead. for value in data_frame['UPC']:
With selenium you can directly scrape results instead of using new tabs.
Below you can check out working example (at least on my machine with python3, w10 and chrome exe driver).
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
# Settings
pd.set_option('display.width', 10, 'display.max_rows', 10, 'display.max_colwidth', 100, 'display.width', 10,
'display.float_format', lambda x: '%0.2f' % x)
data_frame = pd.read_excel('test.xlsx', sheet_name='products id', skip_blank_lines=False)
# I have chrome driver in exe, so this is how I need to inject it to get driver out
driver = webdriver.Chrome('chromedriver.exe')
google = 'https://www.google.com'
for index, value in enumerate(data_frame['UPC']): # named the column in excel file
if pd.isna(value):
print('{}: zzz'.format(index))
sleep(2) # will sleep for 2 seconds, use input() if you want to wait indefinitely instead
else:
print('{}: {} {}'.format(index, value, type(value)))
# since given values are float, you can convert it to int
value = int(value)
driver.get(google)
google_search = driver.find_element_by_name('q')
google_search.send_keys(value)
google_search.send_keys('\uE007') # this is "ENTER" for committing your search in google or Keys.ENTER
sleep(0.5)
# you may want to wait a bit before page loads fully, then scrape info you want
# also consider using try-except blocks if something unexpected happens
# if you want to open new tab (windows + chrome driver)
# open a link in a new window - workaround
helping_link = driver.find_element_by_link_text('Help')
actions = ActionChains(driver)
actions.key_down(Keys.CONTROL).click(helping_link).key_up(Keys.CONTROL).perform()
driver.switch_to.window(driver.window_handles[-1])
# close your instance of chrome driver or leave it if you need your tabs
# driver.close()
check this post
if upc[i].isnull():
time.sleep(3)
check out this post, which boils down to:
driver.execute_script("window.open('https://www.google.com');")
driver.switch_to.window(driver.window_handles[-1])

I am getting valueError when trying to convert word2number module in pandas

i have written below's code but it's giving me this error"ValueError: Type of input is not string! Please enter a valid number word (eg. 'two million twenty three thousand and forty nine')"
df.experience = df.experience.apply(w2n.word_to_num)
First fill the missing values(null(Na) or Nan)) using fillna and then apply your code
df.experience = df.experience.apply(w2n.word_to_num)
df['experience'] = df['experience'].fillna('zero')
df.experience = df.experience.apply(w2n.word_to_num)
df
First, fill out (Nan or Na) to 'zero' not 0.
Then try running this code.
First make sure the column is of type string otherwise convert it into string by using
df["column_name"] = df.column_name.astype(str)
then try to apply
df["column_name"] = df.column_name.apply(word_to_num)
df.experience=df.experience.fillna('zero').astype(str)
df.experience=df.experience.apply(w2n.word_to_num)
You might be wondering why its not being executed. so for that first of all you need to do:
Step 1: pip install word2number in your notebook
Step 2: You need to import w2n by using from word2number import w2n
Step 3: dataframe.columnname = dataframe.columnname.fillna("zero") to fill the zero inplace of null value.
Step 4: dataframe.columnname = dataframe.apply(w2n.word_to_num)
Finally with these steps you can change the word to number example five to 5.

Formatting for zip function in Python 3.x (Error message:zip argument #1 must support iteration)

I am stuck on how to format my zip function. I am aware that zip function only takes iterable objects (lists, sets, tuples, strings, iterators, etc). So far, I am trying to generate an output file that zips three float values in all separate columns. I would really appreciate getting some feedback on how else I can tackle this problem while getting the same outcome.
Fyi, ny input file has something like this..
1600 1
1700 3
1800 2.5
3000 1
7000 5
The following is my code so far.
import numpy as np
import os
import csv
myfiles = os.listdir('input')
for file in myfiles:
size=[]
norm_intensity=[]
with open('input/'+file, 'r') as f:
data = csv.reader(f,delimiter=',')
next(data)
next(data)
for row in data:
size.append(float(row[0]))
norm_intensity.append(float(row[1]))
x_and_y = []
row = np.array([list (i) for i in zip(size,norm_intensity)])
for x, y in row:
if y>0:
x_and_y.append((x,y))
"""""""""""""""""
Sum of intensity from the first pool
"""""""""""""""""
first_x=[]
first_y= []
for x,y in (x_and_y):
if x>1600 and x<2035.549:
first_x.append(x)
first_y.append(y)
first_sum=np.sum(first_y)
Up to this point, I am collecting y value when x is greater than 1600 but smaller than 2035.549
In a similar way, I get second sum and third sum (each has a different x range).
The following is the most troubling part so far.
first_pool=first_sum/(first_sum+second_sum+third_sum)
second_pool=second_sum/(first_sum+second_sum+third_sum)
third_pool=third_sum/(first_sum+second_sum+third_sum)
with open ('output_pool/'+file, 'w') as f:
for a,b,c in zip(first_pool,second_pool,third_pool):
f.write('{0:f},{1:f},{2:f}\n'.format(a,b,c))
What I wanted to have at the end is the following..
first_pool second_pool third_pool
(first_sum) (second_sum) (third_sum)
Since first_pool, second_pool, third_pool are all floats, I am currently running to a message that is saying, zip argument #1 must support iteration. Do you have any suggestions that I could still achieve the goal?
From what I can tell, you don't need zip. Something like the following should do what you want:
sums = [first_sum, second_sum, third_sum]
pools = [first_pool, second_pool, third_pool]
...
for a,b,c in [pools, sums]:
f.write('{0:f},{1:f},{2:f}\n'.format(a,b,c))
Zipping would be, for example, if you had these two lists and wanted pairs of sums and pools:
for pool, summation in zip(pools, sums):
f.write('Pool: {}, Sum: {}'.format(pool, summation))
# Pool: 0.5, Sum: 10
# Pool: 0.3, Sum: 6
# ...

Resources