How to scrape this pdf file? - python-3.x

I want to scrape tables of this persian pdf file and get the results as a pandas dataframe but I get error "NameError: name 'PDFResourceManager' is not defined" and no good content is extracted.
please help me to find a true encoded solution for it. Including your tested code is appreciated.
from pdfminer.converter import TextConverter
from io import StringIO
from io import open
from urllib.request import urlopen
import pdfminer as pm
urlpdf="https://www.codal.ir/Reports/DownloadFile.aspx?id=jck8NF9OtmFW6fpyefK09w%3d%3d"
response = requests.get(urlpdf, verify=False, timeout=5)
f=io.BytesIO(response.content)
def readPDF(f):
rsrcmgr=PDFResourceManager()
retstr=StringIO()
laparams=LAParams()
device=TextConverter(rsrcmgr,retstr,laparams=laparams)
process_pdf(rsrcmgr,device,pdfFile)
device.close()
content=retstr.getvalue()
retstr.close()
return content
pdfFile=urlopen(urlpdf)
outputString=readPDF(pdfFile)
proceedings=outputString.encode('utf-8') # creates a UTF-8 byte object
proceedings=str(proceedings) # creates string representation <- the source of your issue
file=open("extract.txt","w", encoding="utf-8") # encodes str to platform specific encoding.
file.write(proceedings)
file.close()

Related

Extracting data from a UCI dataset Online using python if the file is compressed(.zip)

I want to use web scrapping to get the data from file
https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
How can I do that using requests in python?
You can use this example how to load the zip file using requests and built-in zipfile module:
import requests
from io import BytesIO
from zipfile import ZipFile
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip"
with ZipFile(BytesIO(requests.get(url).content), "r") as myzip:
# print content of zip:
# print(myzip.namelist())
# print content of one of the file:
with myzip.open("Youtube01-Psy.csv", "r") as f_in:
print(f_in.read())
Prints:
b'COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS\n
...

How to convert a byte stream (binary form) to a CSV file using Python 3.8?

I have to process a .csv file using AWS Lambda function. I serve the .csv file to the Lambda function using an AWS API Gateway. Now the API Gateway transforms the .csv file into a base64 string as it is received in the request. Any idea how to convert it back to .csv file.
I have mentioned my code below for reference.
import os
import sys
CWD = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, os.path.join(CWD, "lib"))
import json
import base64
import boto3
import numpy as np
import io
from io import BytesIO
import pandas as pd
def lambda_handler(event, context):
s3 = boto3.client("s3")
# retrieving data from event which is base64 string
get_file_content_from_postman = event["content"]
# decoding data. Here the file content is converted back to binary form
binary_file= base64.b64decode(get_file_content_from_postman)
Since your binary_file will by bytes, you can just wrap it in BytesIO to treat it as a file for your pandas:
df = pd.read_csv(BytesIO(binary_file))
print(df)

getting pandas.errors.ParserError: Error tokenizing data while putting csv objects in send_key()

trying to extract links but getting error "pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.", not getting how to solve this error.
tried using selenium chromedriver send_key, but fail to success.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver import ActionChains
import csv
import re
import pandas as pd
links = pd.read_csv('C:\\Users\\dell\\Desktop\\CIN_Name.xlsx',encoding='utf8',dtype=str,header=None,error_bad_lines=False)
for i in range(0,5):
link = links.iloc[i,0]
url = "https://www.knowyourgst.com/gst-number-search/by-name-pan/"
driver = webdriver.Chrome(r'C:\chromedriver.exe')
driver.get(url)
driver.find_element_by_xpath("""//*[#id="gstnumber"]""").send_keys(str(link))
driver.find_element_by_xpath("""/html/body/div[1]/div/div[1]/div[1]/div[1]/form/div[2]/input""").click()
soup = BeautifulSoup(driver.page_source,'html.parser')
driver.close()
link = soup.find('div',{"id":"searchresult"}).find('a')
print(link['href'])
want to extract link by putting objects in loop one by one by reading csv file column. please help me to solve this error.

Compress a CSV file written to a StringIO Buffer in Python3

I'm parsing text from pdf files into rows of ordered char metadata; I need to serialize these files to cloud storage, which is all working fine, however due to their size I'd also like to gzip these files but I've run into some issues there.
Here is my code:
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
stored_format = zlib.compress(output_buffer)
This reads each row into the io.StringIO Buffer successfully, but gzip/zlib seem to only work with bytes-like objects like io.BytesIO so the last line errors; I cannot create read a csv into a BytesIO Buffer because DictWriter/Writer error unless io.StringIO() is used.
Thank you for your help!
I figured this out and wanted to show my answer for anyone who runs into this:
The issue is that zlib.compress is expecting a Bytes-like object; this actually doesn't mean either StringIO or BytesIO as both of these are "file-like" objects which implment read() and your normal unix file handles.
All you have to do to fix this is use StringIO() to write the csv file to and then call get the string from the StringIO() object and encode it into a bytestring; it can then be compressed by zlib.
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
encoded = output_buffer.getvalue().encode()
stored_format = zlib.compress(encoded)
I have an alternative answer for anyone interested which should use less intermediate space, it needs python 3.3 and over to use the getbuffer() method:
from io import BytesIO, TextIOWrapper
import csv
import zlib
def compress_csv(series):
byte_buf = BytesIO()
fp = TextIOWrapper(byte_buf, newline='', encoding='utf-8')
writer = csv.writer(fp)
for row in series:
writer.writerow(row)
compressed = zlib.compress(byte_buf.getbuffer())
fp.close()
byte_buf.close()
return compressed

How to remove/delete texts in Python list from different index position

I have tried several ways to extract the words I want, but I couldn't make it. The completions need to like this:
https://drive.google.com/open?id=0BzzXkoIWuMAHUjFuT2IteEtLVjQ
After extracting the texts, and then I can put them into CSV file
from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup
AccessCME=Request('http://www.cmegroup.com/trading/energy/natural-
gas/natural-gas_contract_specifications.html',headers={"User-
Agent":"Mozilla/5.0"})
CMEPage=uReq(AccessCME).read()
page_soup=soup(CMEPage,"html.parser")
possible_tds=page_soup.find_all('td',attrs={'class':'prodSpecAtribute'})
parent_td=[td for td in possible_tds if 'Trading' in td.text][0]
Target=parent_td.fetchNextSiblings('td')[1].text
print(Target)
New=Target.split(" ")
print(New)

Resources