Getting string output embedded with \n characters

Getting string output embedded with \n characters - python-3.x

While scraping a website data, i am getting below o/p:
['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']
Where the first element is 1 tablespoon\nvegetable or coconut oil, second is
1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)
So, you can understand that the individual elements are separated by \n and also the individual elements also contains \n. So I am totally confused, how to make a list of the individual ingredient items with no \n there, like:
['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such as Thai Kitchen, Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']
For the list you can see that, there is no specific pattern like the if we can grab the \n just preceeding any integer as \n is there before Cooked rice, for serving (optional).
If we replace all the \n then all the occurrences will be replaced. I need to wipe out the \n occurrences from inside individual ingredient and also the \n separator between two ingredients need to be replaced by , as i have shown the expected o/p above.
Actual o/p:
['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']
Expected o/p:
['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such as Thai Kitchen, Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']

I got something close to what you want, hope it helps:
I found 3 separate occasions to replace in the string:
when there's a line break with a number, replace with ", (number)"
when there's a line break with an uppercase letter, replace with ", (letter)"
when there's a line break that doesn't fit both of these categories, replace with " "
import re
text = "['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']"
text = re.sub("\\n(\d)",", \g<1>", text)
text = re.sub("\\n([A-Z])", ", \g<1>", text)
text = re.sub("\\n"," ", text)
print (text)
output: ['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such
as Thai Kitchen, 2 medium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes, 1 (15-ounce) can chickpeas, drained and rinsed, 1 (13- to 14-ounce) can full-f
at coconut milk, 1/2 cup water, 1 teaspoon kosher salt, 1/4 teaspoon freshly ground black pepper, 1 (5-ounce) bag baby spinach (about 5 packed cups), Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']

Related

Fuzzy String Matching using Python

I have a training dataset for eg.
Letter Word
A Apple
B Bat
C Cat
D Dog
E Elephant
and I need to check the dataframe such as
AD Apple Dog
AE Applet Elephant
DC Dog Cow
EB Elephant Bag
AED Apple Elephant Dog
D Door
ABC All Bat Cat
the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.
Output Required:
Letters Words Status
AD Apple Dog Accept
AE Applet Elephant Accept
DC Dog Cow Reject
EB Elephant Bag Accept
AED Apple Elephant Dog Accept
D Door Reject
ABC All Bat Cat Accept
ABC accepted because 2 of 3 words match.
The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change.
How can I find these matches using Python.

You can use thefuzz to solve your problem:
# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz
THRESHOLD = 70
df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
.merge(df1, left_on='Letters', right_on='Letter')
.groupby('index')['Word'].agg(' '.join))
df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')
Output:
>>> df2
Letters Words Others Ratio Status
0 AD Apple Dog Apple Dog 100 Accept
1 AE Applet Elephant Apple Elephant 97 Accept
2 DC Dog Cow Dog Cat 71 Accept
3 EB Elephant Bag Elephant Bat 92 Accept
4 AED Apple Elephant Dog Apple Dog Elephant 78 Accept
5 D Door Dog 57 Reject
6 ABC All Bat Cat Apple Cat Bat 67 Reject

Python For Loop enumerate control

Hello I Have This Code
loss = list(range(1,10))
lists_fru = ['apple','banana','strawberry','erdberry','mango']
for index ,i in enumerate(loss):
if i > len(lists_fru):
print('larg')
else:
print(lists_fru[index])
The Resul Of It
apple
banana
strawberry
erdberry
mango
larg
larg
larg
larg
What I'm Looking For Or What I'm Trying To Do
I Wanna when the list_fru end to complete the loop from the begining
Like This
apple
banana
strawberry
erdberry
mango
apple
banana
strawberry
erdberry
like this

You can do what you want using the modulo operator, %.
loss = list(range(1,10))
lists_fru = ['apple','banana','strawberry','erdberry','mango']
for index ,i in enumerate(loss):
print(lists_fru[index % len(lists_fru)])

Unable to fetch the title of products from redmart using python

Creating a scraper for the purpose of parsing the title of different products from a web-page when I run it, I get nothing. im using the python to get the data.
from bs4 import BeautifulSoup as soup
from urllib.request import Request,urlopen
myurl=Request('https://redmart.com/product/concatenatew2-x2-y2-95192', headers={'User-Agent': 'Mozilla/5.0'})
pagehtml=urlopen(myurl).read()
pagesoup=soup(pagehtml,'html.parser')
containers=pagesoup.find_all('div',{'class':'productDetailsWrapper'})
print(containers)
prdtname=container.find_all('div',{'class':'description'})
name=prdtname[0].text
print(name)

The page loads its data dynamically through Ajax API. If you look to your Firefox/Chrome network inspector, you will see the URLs where the page connects. This example will load all data for beverages in category 'big-beverage-boom' in JSON format (I commented out the URL, from where the page loads all categories):
import json
import requests
from pprint import pprint
pagesize = 100
page = 1
category = 'big-beverage-boom'
url = 'https://api.redmart.com/v1.6.0/catalog/search?pageSize={}&sort=1024&category={}&page={}'
# This will load all categories:
# categories_url = 'https://api.redmart.com/v1.6.0/catalog/search?extent=0&depth=1'
# r = requests.get(categories_url)
# data = json.loads(r.content)
r = requests.get(url.format(pagesize, category, page))
data = json.loads(r.content)
pprint(data)
The script outputs:
...snip...
'img': {'h': 0,
'name': '/i/m/img_1527825363355.jpg',
'position': 0,
'w': 0},
'inventories': [{'atp_lots': [{'from_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 4,
'stock_status': 1,
'to_date': '2019-11-29T15:59:59Z'}],
'atp_status': 0,
'delivery_option': 'standard',
'limited_stock_status': 0,
'max_sale_qty': 48,
'next_available_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 48,
'stock_status': 1}],
'inventory': {'atp_lots': [{'from_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 4,
'stock_status': 1,
'to_date': '2019-11-29T15:59:59Z'}],
'atp_status': 0,
'delivery_option': 'standard',
'limited_stock_status': 0,
'max_sale_qty': 48,
'next_available_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 48,
'stock_status': 1},
'measure': {'size': 0.0, 'wt_or_vol': '24 x 500 ml'},
'pr': 103,
'pricing': {'applicable_discount': 'promo',
'discounts': {'live_up': {'promo_price': 25.95,
'savings': 12.03,
'savings_amount': 3.55,
'savings_text': '12% OFF',
'savings_type': 1},
'promo': {'promo_price': 26.55,
'savings': 10.0,
'savings_amount': 2.95,
'savings_text': '10% OFF',
'savings_type': 1}},
'on_sale': 1,
'price': 29.5,
'promo_id': 188169,
'promo_price': 26.55,
'savings': 10.0,
'savings_amount': 2.95,
'savings_text': '10% OFF',
'savings_type': 1},
...snip...
For getting titles from the data, you can use:
for d in data['products']:
print(d['title'])
This prints:
San Pellegrino Sparkling Natural Mineral Water
Pocari Sweat ION Supply Drink
Volvic Natural Mineral Water Case
Pauls Zymil Lactose Free Low Fat Milk
Pocari Sweat ION Supply Drink
MARIGOLD Less Sweet Chrysanthemum Tea
CoCoWater Pure Coconut Water - Case
Coco Life Coconut Water
Perrier Lemon Sparkling Mineral Water
Pokka Premium Afternoon Red Tea
RedMart Coffee Beans
Asian Story Chrysanthemum Tea (Less Sugar) - Case
Yeo's Soya Bean Drink
Pauls Zymil Lactose Free Full Cream Milk
MARIGOLD Low Fat UHT Milk - Case
Gerolsteiner Sparkling Water
100PLUS Tangy Tangerine Isotonic Drink
Vitasoy Chocolate Flavored Soy Drink
Evian Natural Mineral Water
Schweppes Bitter Lemon - Case
H-TWO-O Original Isotonic Drink
Vittel Natural Mineral Water - Case
Pokka Peppermint Green Tea
MARIGOLD Less Sweet Lemon Barley Drink - Case
Pacific Soy Barista Series
Jia Jia Less Sugar Herbal Tea
Dutch Lady UHT Full Cream Milk
Pacific Organic Soy Unsweetened Original Non-Dairy Beverage
Perrier Lemon Sparkling Mineral Water - Case
Pureharvest Organic Oat Milk Non-Dairy
Evian Natural Mineral Water Case
UFC Velvet Unsweetened Almond Milk
Schweppes Slimline Indian Tonic Water
Monster Energy Ultra Sugar Free Energy Drink
Asian Story Chrysanthemum Tea (Less Sugar)
Farmhouse Low Fat UHT Milk - Case
Pokka Straight Red Tea - Case
CocoMax Coconut Water
Bonsoy Organic Soy Milk - Case
YOUC1000 Vitamin Lemon Health Drink
Cowhead UHT Lactose Free Milk
CoCoWater Pure Coconut Water
Fevertree Naturally Light 4's Tonic Water
Pokka Premium Milk Tea - Case
Pauls UHT Low Fat Milk - Case
Twinings Pure Peppermint Tea
Australia's Own Unsweetened Soy Milk
MARIGOLD Less Sweet Lemon Barley
Twinings English Breakfast Tea 25's
Living Planet Low Fat Organic Dairy Milk
Perrier Lime Sparkling Mineral Water
Schweppes Slimline Indian Tonic 12 Per Pack
Blue Diamond Almond Breeze Unsweetened
HOMESOY No Sugar Added Soy Dairy Free Milk
Coco Life Coconut Water - Case
Red Bull Energy Drink
F&N Magnolia Chocolate Flavoured Milk
Lactel UHT Semi-Skimmed Milk
UFC Refresh 100% Natural Coconut Water
Perrier Sparkling Water
MARIGOLD Less Sweet Soya Bean Drink - Case
Wong Coco All Natural Coconut Juice With Pulp - Case
Pokka Nanyang Coffee
Perrier Lime Sparkling Mineral Water - Case
Ice Mountain Sparkling Lemon - Case
Dilmah Premium Quality 100% Pure Ceylon Tea
Vitasoy Original Soy Drink
Dutch Mill Yoghurt Drink with Strawberry Juice
Pauls Chocolate Milk
F&N Magnolia Smoo Chocolate Flavoured Milk
Devondale UHT Skim Milk
Pauls Strawberry Milk
Perrier Pink Grapefruit Sparkling Natural Mineral Water
Pokka No Sugar Oolong Tea
Vitasoy Melon Flavored Soy Drink
Super Essenso MicroGround Coffee - 2 In 1 Coffee And Creamer
Wong Coco All Natural Coconut Juice With Pulp
Perrier Pink Grapefruit Sparkling Mineral Water - Case
Dutch Mill Yoghurt Drink with Blueberry Juice
Ice Mountain Lemon Sparkling Water
CoCoWater Pure Coconut Water
100PLUS Isotonic Drink
Jeju Samdasoo Natural Mineral Water - Case
Red Bull Energy Drink Sugar Free
Super Essenso MicroGround Coffee - 3 In 1
Pokka Chrysanthemum White Tea Case
100PLUS Zero Sugar 6s
Rude Health Ultimate Organic Almond Drink
Three Legs Guava Flavour Cooling Water
Premium Matcha Green Tea
OSK Japanese Green Tea with Brown Rice
MARIGOLD Chocolate UHT Milk - Case
Dilmah Camomile Tea
Rude Health Organic Gluten Free Almond Drink
Twinings Lemon and Ginger Tea
Pocari Sweat ION Supply Drink - Case
CocoMax 100% Coconut Water - Case
YOUC1000 Vitamin Orange Health Drink
F&N Magnolia Smoo Vanilla Flavoured Milk
MARIGOLD Soya Bean Drink
Edit:
Redmart have own Github page with useful utilities: https://github.com/Redmart. Worth to check that too.

You can do it using Selenium this way:
from bs4 import BeautifulSoup
from selenium import webdriver
scrapeLink = 'https://thelinkyouwanttoscrape.com'
driver = webdriver.Firefox(executable_path = 'C:\geckodriver.exe')
driver.get(scrapeLink)
html = driver.execute_script('return document.body.innerHTML')
driver.close()
soup = BeautifulSoup(html,'html.parser')
titles = soup.find_all('the_tag_that_contains_the_info_you_want')
The website, by the way states:
Without prejudice to the generality of Clause 2.1, you agree not to
reproduce, display or otherwise provide access to the Site, App,
Services or Content, for example through framing, mirroring, linking,
spidering, scraping or any other technological means (including any
technology available in the future), without the prior written
permission of RedMart.

How to clean up troublesome characters in clipboard data so I can paste into a python script in IDLE?

I want to copy tables of data displayed in websites and paste as text directly into scripts as string variables using IDLE. This sometimes doesn't work because of something in the copied material that IDLE won't accept as savable. The resulting behavior is not an error message, but IDLE simply ignoring the save request. It just sits there until I close without saving.
That behavior is fine with me at the moment - I'd of course not want to save a python script that contains troublesome characters.
Is there some way I can get those pesky characters out of what's in my computer's clip board so I can get on with my script?
If I just needed to do this once, I could go in and look at the html of the site and possibly extract it, or in the case of the table of satellites on this page maybe I can go into the google app and get it.
But for the purposes of this question, I'd like a way to "fix" the data in my clip board to I can paste as a string into a script using IDLE and run it.
I've tried "Paste and Match Style" in a .txt file first to clean it up, no luck. I have Sublime Text 2 but not very familliar with it, if there is a relatively easy to use function in there, that would be OK.
Trying to paste inside triple quotes thing = """ """ at the prompt gives the following error message: Unsupported characters in input:
note: using Python and IDLE versions '2.7.11', Tk version '8.5.9' (I know, these are a year old) in OSX.
EDIT: Here is a chunk of data from my clip board, as suggested in the comments. Copying from here (as shown) results in unsuccessful save attempts in IDLE, so at least some of the pesky symbols are in here. I'm pasting between a pair of triple quotes, e.g. thing = """ """
1 2/6/2000 PICOSAT 1&2 (TETHERED) Aerospace Corporation mil Opal Opal T 5 N Minotaur-1
2 2/10/2000 PICOSAT 3 (JAK) Santa Clara University uni Opal Opal E 2 N Minotaur-1
3 2/10/2000 PICOSAT 6 (StenSat) Stensat Group. LLC civ Opal Opal C 2 N Minotaur-1
4 2/12/2000 PICOSAT 4 (Thelma) Santa Clara University uni Opal Opal S 2 N Minotaur-1
5 2/12/2000 PICOSAT 5 (Louise) Santa Clara University uni Opal Opal S 2 N Minotaur-1
6 9/6/2001 PICOSAT 7&8 (TETHERED) Aerospace Corporation mil Opal Opal T 2 D Minotaur-1
7 12/2/2002 MEPSI Aerospace Corporation mil 2U SSPL T 2 D Shuttle
8 6/30/2003 DTUSAT 1 Technical University of Denmark uni 1U PPOD E 2 N Rokot-KM
9 6/30/2003 CUTE-1 (CO-55) Tokyo Institute of Technology uni 1U PPOD E 3 N Rokot-KM
10 6/30/2003 QUAKESAT 1 Stanford University uni 3U PPOD S 5 N Rokot-KM
11 6/30/2003 AAU CUBESAT 1 Aalborg University uni 1U PPOD E 2 N Rokot-KM
12 6/30/2003 CANX-1 UTIAS (University of Toronto) uni 1U PPOD E 2 N Rokot-KM
13 6/30/2003 CUBESAT XI-IV (CO-57) University of Tokyo uni 1U PPOD E 4 S Rokot-KM
14 10/27/2005 UWE-1 University of Würzburg uni 1U TPOD E 3 N Kosmos-3M
15 10/27/2005 CUBESAT XI-V (CO-58) University of Tokyo uni 1U TPOD E 5 N Kosmos-3M
16 10/27/2005 Ncube 2 Norweigan Universities uni 1U TPOD E 2 N Kosmos-3M
17 2/21/2006 CUTE 1.7 Tokyo Institute of Technology uni 2U JPOD C 2 D M-5 (2)
18 7/26/2006 AeroCube 1 Aerospace Corporation mil 1U PPOD T 1 D Dnepr-1
19 7/26/2006 SEEDS Nihon University uni 1U PPOD E 1 D Dnepr-1
20 7/26/2006 SACRED University of Arizona uni 1U PPOD E 1 D Dnepr-1

I'd try to scan the string and find the characters outside the normal printable range. Maybe the strange character will be easier to identify.
text = """ <here comes your pasted text> """
def normal(c):
return (32 <= ord(c) <= 127) or (c in '\n\r\t')
strange = set(ord(c) for c in text if not normal(c))
print strange
I wonder what character codes may end up in strange.

Cross reference to different workbook

I have two Excel files:
File1
A B
1 NameofDeal
2 Nike Man
3 Nike Woman
4 Adidas Man
5 Adidas Woman
with many more rows, and other columns that are not here relevant.
File2
A B C D
1 NameofDeal Company TaxNo
2 NIKE woman Nike 101
3 NIKE man NIKE 101
4 Adidas man ADIDAS 102
5 Reebok shorts Reebok 103
6 Nike shoes Nike 101
and other rows (but fewer than in File1) and columns that are not here relevant. Whether man or woman, the company may be the same. The Tax number is the same for Nike man and Nike woman and the same for Adidas man or Adidas woman, etc.
How do I achieve the likes of the following for the first workbook?
A B C
1 NameofDeal TaxNo
2 Nike Man 101
3 Nike Woman 101
4 Adidas Man 102
5 Adidas Woman 102
6 Reebok shorts 103

Use
=VLOOKUP(A2,'~\Two.xlsx'!TList,3,0)
should suit, provided file Two is not in use ("locked"). ~ above is a place marker for the full path and should be replaced to suit along the lines of D:\Folder1\Folder2 Care should be taken with apostrophes and final backslash.
The formula would be entered to the right of ColumnA and on the row for the first entry (here Row2), then copied down as far as required.
Note that the image does not show the full path because the files were open in the same Excel instance at the time the screenshot was taken.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting string output embedded with \n characters - python-3.x

Related

Fuzzy String Matching using Python

Python For Loop enumerate control

Unable to fetch the title of products from redmart using python

How to clean up troublesome characters in clipboard data so I can paste into a python script in IDLE?

Cross reference to different workbook

Categories

Resources