Sed/Awk Paragraph Formatting Solution - text

I need to create paragraphs from run-together text from which, for the most part, carriage returns and/or line feeds have been removed. Dialogue is interspersed with the text. So what I'd like is for a blank line to be inserted after the second occurrence of a quote. It looks like the quotes will set off the reconstructed paragaphs. I've added forward slashes (which are not in the text) because I don't know the convention for quoting code on this site. Here's an example:
To go from this:
Bacon ipsum dolor amet pastrami chuck venison swine, salami prosciutto shank pork belly. Filet mignon beef ribs ham hock, bacon ground round porchetta alcatra. Beef bacon biltong bresaola short loin filet mignon "I want bacon." chuck brisket landjaeger jerky prosciutto ham leberkas pork loin doner. Shoulder tongue meatball tail jerky pork loin filet "I want bacon." mignon shank chuck shankle flank pig. Short loin pork loin hamburger corned beef ribeye tri-tip doner ham hock landjaeger t-bone swine. Swine pork belly frankfurter, t-bone ham hock bacon pastrami. Biltong beef chuck ham hock pork loin shoulder strip "I want bacon."steak short loin tail cupim rump alcatra.Shoulder beef cupim rump ground round. Beef sirloin cupim meatball ham ribeye. "I want bacon." Venison tail ribeye, pastrami tongue pig beef ribs kielbasa bresaola doner. Shankle filet mignon pig, shoulder ball tip pork belly jowl sausage fatback boudin. Prosciutto venison capicola bacon, short loin andouille salami shank tongue corned beef. Sirloin biltong boudin tenderloin brisket tri-tip pancetta kielbasa strip steak leberkas short ribs flank filet mignon ham hock pork. Tri-tip cupim "I want bacon." "I want bacon."
to this:
Bacon ipsum dolor amet pastrami chuck venison swine, salami prosciutto shank pork belly. Filet mignon beef ribs ham hock, bacon ground round porchetta alcatra. Beef bacon biltong bresaola short loin filet mignon
"I want bacon."
chuck brisket landjaeger jerky prosciutto ham leberkas pork loin doner. Shoulder tongue meatball tail jerky pork loin filet
"I want bacon."
mignon shank chuck shankle flank pig. Short loin pork loin hamburger corned beef ribeye tri-tip doner ham hock landjaeger t-bone swine. Swine pork belly frankfurter, t-bone ham hock bacon pastrami. Biltong beef chuck ham hock pork loin shoulder strip
"I want bacon."
steak short loin tail cupim rump alcatra. Shoulder beef cupim rump ground round. Beef sirloin cupim meatball ham ribeye.
"I want bacon."
Venison tail ribeye, pastrami tongue pig beef ribs kielbasa bresaola doner. Shankle filet mignon pig, shoulder ball tip pork belly jowl sausage fatback boudin. Prosciutto venison capicola bacon, short loin andouille salami shank tongue corned beef. Sirloin biltong boudin tenderloin brisket tri-tip pancetta kielbasa strip steak leberkas short ribs flank filet mignon ham hock pork. Tri-tip cupim
"I want bacon."
"I want bacon."

awk -v RS='"' '{
if (NR % 2 == 1) {
if (/[^[:space:]]/) printf "%s%s\n\n", (NR==1? "" : "\n"), $0
} else {
printf "\"%s\"\n", $0
}}' file
outputs
Bacon ipsum dolor amet pastrami chuck venison swine, salami prosciutto shank pork belly. Filet mignon beef ribs ham hock, bacon ground round porchetta alcatra. Beef bacon biltong bresaola short loin filet mignon
"I want bacon."
chuck brisket landjaeger jerky prosciutto ham leberkas pork loin doner. Shoulder tongue meatball tail jerky pork loin filet
"I want bacon."
mignon shank chuck shankle flank pig. Short loin pork loin hamburger corned beef ribeye tri-tip doner ham hock landjaeger t-bone swine. Swine pork belly frankfurter, t-bone ham hock bacon pastrami. Biltong beef chuck ham hock pork loin shoulder strip
"I want bacon."
steak short loin tail cupim rump alcatra.Shoulder beef cupim rump ground round. Beef sirloin cupim meatball ham ribeye.
"I want bacon."
Venison tail ribeye, pastrami tongue pig beef ribs kielbasa bresaola doner. Shankle filet mignon pig, shoulder ball tip pork belly jowl sausage fatback boudin. Prosciutto venison capicola bacon, short loin andouille salami shank tongue corned beef. Sirloin biltong boudin tenderloin brisket tri-tip pancetta kielbasa strip steak leberkas short ribs flank filet mignon ham hock pork. Tri-tip cupim
"I want bacon."
"I want bacon."

Try this:
awk 'BEGIN{RS="\ ?\"\ ?"; ORS="\n\n"}
NR%2==0{print "\""$0"\"";next;}
{}1' inputFile
This will insert a new paragraph before and after each quotation ("..."). However, this will make the last paragraphs look like this
"I want bacon."
"I want bacon."
To remove the blank paragraph between "I want bacon":
awk 'BEGIN{RS="\ ?\"\ ?"; ORS="\n\n"}
NR%2==0{print "\""$0"\"";next;}
($0!=""){print $0}' inputFile

sed might be easier
$ sed 's/"[^"]*" /\n\n&\n\n/g' bacon
example:
$ echo "bla bla bla \"This is bacon.\" Starts a new paragraph" | sed 's/"[^"]*" /\n\n&\n\n/g'
bla bla bla
"This is bacon."
Starts a new paragraph

WithGNU awk for multi-char RS and gensub():
$ awk -v RS='^$' -v ORS= '{$0=gensub(/\s*("[^"]+")\s*/,"\n\n\\1\n\n","g"); gsub(/\n+/,"\n\n")}1' file
Bacon ipsum dolor amet pastrami chuck venison swine, salami prosciutto shank pork belly. Filet mignon beef ribs ham hock, bacon ground round porchetta alcatra. Beef bacon biltong bresaola short loin filet mignon
"I want bacon."
chuck brisket landjaeger jerky prosciutto ham leberkas pork loin doner. Shoulder tongue meatball tail jerky pork loin filet
"I want bacon."
mignon shank chuck shankle flank pig. Short loin pork loin hamburger corned beef ribeye tri-tip doner ham hock landjaeger t-bone swine. Swine pork belly frankfurter, t-bone ham hock bacon pastrami. Biltong beef chuck ham hock pork loin shoulder strip
"I want bacon."
steak short loin tail cupim rump alcatra.Shoulder beef cupim rump ground round. Beef sirloin cupim meatball ham ribeye.
"I want bacon."
Venison tail ribeye, pastrami tongue pig beef ribs kielbasa bresaola doner. Shankle filet mignon pig, shoulder ball tip pork belly jowl sausage fatback boudin. Prosciutto venison capicola bacon, short loin andouille salami shank tongue corned beef. Sirloin biltong boudin tenderloin brisket tri-tip pancetta kielbasa strip steak leberkas short ribs flank filet mignon ham hock pork. Tri-tip cupim
"I want bacon."
"I want bacon."

Related

Replace Substring in a text file with a text file of strings using Sed

I'm trying to replace substrings in a text file [corpus.txt] with some other substrings using sed. I have the list of possible substrings in a file sub.txt containing the following:
dogs chase
birds eat
chase birds
chase cat
chase birds .
and a corpus.txt containing some texts as below:
dogs chase cats around
dogs bark
cats meow
dogs chase birds
cats chase birds , birds eat grains
dogs chase the cats
the birds chirp
with the desired output
<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp
Using the Command sed -f <(sed 's/.*/s|\\b&\\b|<bop> & <eop>|g/' sub.txt) corpus.txt it returns everything in the desired output correctly, except in the fifth line where it returns :
cats <bop> <bop> chase birds . <eop>eop> , <bop> birds eat <eop> grains
What can I do to get this to work?
you have to escape the . in the first file to make a literal match
$ sed -f <(sed 's/\./\\./;s/.*/s|\\b&\\b|<bop> & <eop>|g/' sub_o.txt) file
<bop> dogs chase <eop> cats around
dogs bark
cats meow
<bop> dogs chase <eop> birds
cats <bop> chase birds <eop> , <bop> birds eat <eop> grains
<bop> dogs chase <eop> the cats
the birds chirp

Getting string output embedded with \n characters

While scraping a website data, i am getting below o/p:
['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']
Where the first element is 1 tablespoon\nvegetable or coconut oil, second is
1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)
So, you can understand that the individual elements are separated by \n and also the individual elements also contains \n. So I am totally confused, how to make a list of the individual ingredient items with no \n there, like:
['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such as Thai Kitchen, Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']
For the list you can see that, there is no specific pattern like the if we can grab the \n just preceeding any integer as \n is there before Cooked rice, for serving (optional).
If we replace all the \n then all the occurrences will be replaced. I need to wipe out the \n occurrences from inside individual ingredient and also the \n separator between two ingredients need to be replaced by , as i have shown the expected o/p above.
Actual o/p:
['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']
Expected o/p:
['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such as Thai Kitchen, Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']
I got something close to what you want, hope it helps:
I found 3 separate occasions to replace in the string:
when there's a line break with a number, replace with ", (number)"
when there's a line break with an uppercase letter, replace with ", (letter)"
when there's a line break that doesn't fit both of these categories, replace with " "
import re
text = "['1 tablespoon\nvegetable or coconut oil\n1 tablespoon\npeeled and minced fresh ginger (from a 1-inch piece)\n2 cloves\ngarlic, minced\n3 tablespoons\nvegan Thai red curry paste, such as Thai Kitchen\n2\nmedium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes\n1 (15-ounce) can\nchickpeas, drained and rinsed\n1 (13- to 14-ounce) can\nfull-fat coconut milk\n1/2 cup\nwater\n1 teaspoon\nkosher salt\n1/4 teaspoon\nfreshly ground black pepper\n1 (5-ounce) bag\nbaby spinach (about 5 packed cups)\nJuice from 1 medium lime (about 2 tablespoons)\nCooked rice, for serving (optional)']"
text = re.sub("\\n(\d)",", \g<1>", text)
text = re.sub("\\n([A-Z])", ", \g<1>", text)
text = re.sub("\\n"," ", text)
print (text)
output: ['1 tablespoon vegetable or coconut oil, 1 tablespoon peeled and minced fresh ginger (from a 1-inch piece), 2 cloves garlic, minced, 3 tablespoons vegan Thai red curry paste, such
as Thai Kitchen, 2 medium sweet potatoes (about 1 pound total), peeled and cut into 1/2-inch cubes, 1 (15-ounce) can chickpeas, drained and rinsed, 1 (13- to 14-ounce) can full-f
at coconut milk, 1/2 cup water, 1 teaspoon kosher salt, 1/4 teaspoon freshly ground black pepper, 1 (5-ounce) bag baby spinach (about 5 packed cups), Juice from 1 medium lime (about 2 tablespoons), Cooked rice, for serving (optional)']

Unable to fetch the title of products from redmart using python

Creating a scraper for the purpose of parsing the title of different products from a web-page when I run it, I get nothing. im using the python to get the data.
from bs4 import BeautifulSoup as soup
from urllib.request import Request,urlopen
myurl=Request('https://redmart.com/product/concatenatew2-x2-y2-95192', headers={'User-Agent': 'Mozilla/5.0'})
pagehtml=urlopen(myurl).read()
pagesoup=soup(pagehtml,'html.parser')
containers=pagesoup.find_all('div',{'class':'productDetailsWrapper'})
print(containers)
prdtname=container.find_all('div',{'class':'description'})
name=prdtname[0].text
print(name)
The page loads its data dynamically through Ajax API. If you look to your Firefox/Chrome network inspector, you will see the URLs where the page connects. This example will load all data for beverages in category 'big-beverage-boom' in JSON format (I commented out the URL, from where the page loads all categories):
import json
import requests
from pprint import pprint
pagesize = 100
page = 1
category = 'big-beverage-boom'
url = 'https://api.redmart.com/v1.6.0/catalog/search?pageSize={}&sort=1024&category={}&page={}'
# This will load all categories:
# categories_url = 'https://api.redmart.com/v1.6.0/catalog/search?extent=0&depth=1'
# r = requests.get(categories_url)
# data = json.loads(r.content)
r = requests.get(url.format(pagesize, category, page))
data = json.loads(r.content)
pprint(data)
The script outputs:
...snip...
'img': {'h': 0,
'name': '/i/m/img_1527825363355.jpg',
'position': 0,
'w': 0},
'inventories': [{'atp_lots': [{'from_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 4,
'stock_status': 1,
'to_date': '2019-11-29T15:59:59Z'}],
'atp_status': 0,
'delivery_option': 'standard',
'limited_stock_status': 0,
'max_sale_qty': 48,
'next_available_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 48,
'stock_status': 1}],
'inventory': {'atp_lots': [{'from_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 4,
'stock_status': 1,
'to_date': '2019-11-29T15:59:59Z'}],
'atp_status': 0,
'delivery_option': 'standard',
'limited_stock_status': 0,
'max_sale_qty': 48,
'next_available_date': '2018-07-26T16:09:13Z',
'qty_in_carts': 0,
'qty_in_stock': 48,
'stock_status': 1},
'measure': {'size': 0.0, 'wt_or_vol': '24 x 500 ml'},
'pr': 103,
'pricing': {'applicable_discount': 'promo',
'discounts': {'live_up': {'promo_price': 25.95,
'savings': 12.03,
'savings_amount': 3.55,
'savings_text': '12% OFF',
'savings_type': 1},
'promo': {'promo_price': 26.55,
'savings': 10.0,
'savings_amount': 2.95,
'savings_text': '10% OFF',
'savings_type': 1}},
'on_sale': 1,
'price': 29.5,
'promo_id': 188169,
'promo_price': 26.55,
'savings': 10.0,
'savings_amount': 2.95,
'savings_text': '10% OFF',
'savings_type': 1},
...snip...
For getting titles from the data, you can use:
for d in data['products']:
print(d['title'])
This prints:
San Pellegrino Sparkling Natural Mineral Water
Pocari Sweat ION Supply Drink
Volvic Natural Mineral Water Case
Pauls Zymil Lactose Free Low Fat Milk
Pocari Sweat ION Supply Drink
MARIGOLD Less Sweet Chrysanthemum Tea
CoCoWater Pure Coconut Water - Case
Coco Life Coconut Water
Perrier Lemon Sparkling Mineral Water
Pokka Premium Afternoon Red Tea
RedMart Coffee Beans
Asian Story Chrysanthemum Tea (Less Sugar) - Case
Yeo's Soya Bean Drink
Pauls Zymil Lactose Free Full Cream Milk
MARIGOLD Low Fat UHT Milk - Case
Gerolsteiner Sparkling Water
100PLUS Tangy Tangerine Isotonic Drink
Vitasoy Chocolate Flavored Soy Drink
Evian Natural Mineral Water
Schweppes Bitter Lemon - Case
H-TWO-O Original Isotonic Drink
Vittel Natural Mineral Water - Case
Pokka Peppermint Green Tea
MARIGOLD Less Sweet Lemon Barley Drink - Case
Pacific Soy Barista Series
Jia Jia Less Sugar Herbal Tea
Dutch Lady UHT Full Cream Milk
Pacific Organic Soy Unsweetened Original Non-Dairy Beverage
Perrier Lemon Sparkling Mineral Water - Case
Pureharvest Organic Oat Milk Non-Dairy
Evian Natural Mineral Water Case
UFC Velvet Unsweetened Almond Milk
Schweppes Slimline Indian Tonic Water
Monster Energy Ultra Sugar Free Energy Drink
Asian Story Chrysanthemum Tea (Less Sugar)
Farmhouse Low Fat UHT Milk - Case
Pokka Straight Red Tea - Case
CocoMax Coconut Water
Bonsoy Organic Soy Milk - Case
YOUC1000 Vitamin Lemon Health Drink
Cowhead UHT Lactose Free Milk
CoCoWater Pure Coconut Water
Fevertree Naturally Light 4's Tonic Water
Pokka Premium Milk Tea - Case
Pauls UHT Low Fat Milk - Case
Twinings Pure Peppermint Tea
Australia's Own Unsweetened Soy Milk
MARIGOLD Less Sweet Lemon Barley
Twinings English Breakfast Tea 25's
Living Planet Low Fat Organic Dairy Milk
Perrier Lime Sparkling Mineral Water
Schweppes Slimline Indian Tonic 12 Per Pack
Blue Diamond Almond Breeze Unsweetened
HOMESOY No Sugar Added Soy Dairy Free Milk
Coco Life Coconut Water - Case
Red Bull Energy Drink
F&N Magnolia Chocolate Flavoured Milk
Lactel UHT Semi-Skimmed Milk
UFC Refresh 100% Natural Coconut Water
Perrier Sparkling Water
MARIGOLD Less Sweet Soya Bean Drink - Case
Wong Coco All Natural Coconut Juice With Pulp - Case
Pokka Nanyang Coffee
Perrier Lime Sparkling Mineral Water - Case
Ice Mountain Sparkling Lemon - Case
Dilmah Premium Quality 100% Pure Ceylon Tea
Vitasoy Original Soy Drink
Dutch Mill Yoghurt Drink with Strawberry Juice
Pauls Chocolate Milk
F&N Magnolia Smoo Chocolate Flavoured Milk
Devondale UHT Skim Milk
Pauls Strawberry Milk
Perrier Pink Grapefruit Sparkling Natural Mineral Water
Pokka No Sugar Oolong Tea
Vitasoy Melon Flavored Soy Drink
Super Essenso MicroGround Coffee - 2 In 1 Coffee And Creamer
Wong Coco All Natural Coconut Juice With Pulp
Perrier Pink Grapefruit Sparkling Mineral Water - Case
Dutch Mill Yoghurt Drink with Blueberry Juice
Ice Mountain Lemon Sparkling Water
CoCoWater Pure Coconut Water
100PLUS Isotonic Drink
Jeju Samdasoo Natural Mineral Water - Case
Red Bull Energy Drink Sugar Free
Super Essenso MicroGround Coffee - 3 In 1
Pokka Chrysanthemum White Tea Case
100PLUS Zero Sugar 6s
Rude Health Ultimate Organic Almond Drink
Three Legs Guava Flavour Cooling Water
Premium Matcha Green Tea
OSK Japanese Green Tea with Brown Rice
MARIGOLD Chocolate UHT Milk - Case
Dilmah Camomile Tea
Rude Health Organic Gluten Free Almond Drink
Twinings Lemon and Ginger Tea
Pocari Sweat ION Supply Drink - Case
CocoMax 100% Coconut Water - Case
YOUC1000 Vitamin Orange Health Drink
F&N Magnolia Smoo Vanilla Flavoured Milk
MARIGOLD Soya Bean Drink
Edit:
Redmart have own Github page with useful utilities: https://github.com/Redmart. Worth to check that too.
You can do it using Selenium this way:
from bs4 import BeautifulSoup
from selenium import webdriver
scrapeLink = 'https://thelinkyouwanttoscrape.com'
driver = webdriver.Firefox(executable_path = 'C:\geckodriver.exe')
driver.get(scrapeLink)
html = driver.execute_script('return document.body.innerHTML')
driver.close()
soup = BeautifulSoup(html,'html.parser')
titles = soup.find_all('the_tag_that_contains_the_info_you_want')
The website, by the way states:
Without prejudice to the generality of Clause 2.1, you agree not to
reproduce, display or otherwise provide access to the Site, App,
Services or Content, for example through framing, mirroring, linking,
spidering, scraping or any other technological means (including any
technology available in the future), without the prior written
permission of RedMart.

How to each nth line a a column using awk?

I have a single column text file looking like this:
John
Doe
Male
1984
Marie
Parker
Female
1989
And I would like to convert it to look like this:
John Doe Male 1984
Marie Parker Female 1989
I've tried using awk and modulo but I cannot manage to find a working solution.
$ pr -4at file
John Doe Male 1984
Marie Parker Female 1989
or your format
$ pr -4ats' ' file
John Doe Male 1984
Marie Parker Female 1989
of course with awk
$ awk 'ORS=NR%4?FS:RS' file
John Doe Male 1984
Marie Parker Female 1989
with paste
$ paste -d' ' - - - - < file
John Doe Male 1984
Marie Parker Female 1989

grep and egrep selecting numbers

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt
guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt
If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Resources