Loop url from excel file, download pdf files and name them with combination of multiple columns in Python - python-3.x

Given a test data from this link:
I would like to read excel file and loop all the urls, then download pdf files, and named them with combination of city-type-year-quarter.pdf, ie. for the first file, it would be guangzhou-retail-2021-q2.pdf.
How could I do that based on the code below? Thanks.
Updated code:
df = pd.read_excel('test1.xlsx')
urls = df['url'].tolist()
# df.columns
for index, row in df.iterrows():
with open(f"{}_{}_{}_{}.pdf".format(row.city, row.type, row.year, row.quarter), "wb") as f:
f.write(requests.get(row['url']).content)
Out:
SyntaxError: f-string: empty expression not allowed
Reference link:
Loop url from dataframe and download pdf files in Python

Related

PDF parser in pdfs with multiple images and formats with python and tabula (open to other options)

So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)

How to convert a particular sheet in excel file to pdf using python

There is a list of excel files in a directory. The input is a list of sheet names that has to be converted to pdf. So my code has to open the excel file, look for that particular excel sheet and convert that one sheet to pdf. Can anybody suggest which library to use and approach for this. How can I use a variable that has a list of all the required sheet names from all excel files, as argument to open the required excel sheets. Thank you.
INPUT: file1.xls file2.xls file3.xls
sheets in file1: Title, Contents, Summary
sheets in file2: Title, Contents, Summary
sheets in file3: Title, Contents, Summary
Required sheet in file1: Title
Required sheet in file2: Contents
Required sheet in file3: Summary
OUTPUT:
file1_Title.pdf
file2_Contents.pdf
file3_Summary.pdf
Approach: I have a python list with all the sheets in each excel file. And a python list which contains the required sheet to be converted.
import xlrd
book = xlrd.open_workbook(PathforInputFile)
AllSheets = book.sheet_names()
RequiredSheet= line.split("\t")
Code Output:
['Title', 'Contents', 'Summary']
['Title']
['Title', 'Contents', 'Summary']
['Contents']
['Title', 'Contents', 'Summary']
['Summary']
Openpyxl and aspose-cells seem to be the most relevant or, at least the best general excel options available that I could find.
This is an article I found. https://blog.aspose.com/2021/04/02/convert-excel-files-to-pdf-in-python/
But, I would also recommend going to the documentation of the two libraries I suggested. I think they could get you on the right track.
For going through a directory of files, use glob:
dir = (root directory path without files)
for f_csv in glob2.iglob(os.path.join(dir, '*.csv')): # '*.csv' can be changed to the file extension of choice like .xlsx, etc.
# run your ops here per file
Then you can add the base framework so that you're saving coding lines of doing this multiple times to the same exact type of file. I used openpyxl and pandas, but once you get the worksheet open and use index(0) in xlrd you would pick up right where I left off:
dir = (root directory path without files)
for f_csv in glob2.iglob(os.path.join(dir, '*.csv')):
wb = load_workbook(f_csv)
# Access to a worksheet named 'no_header'
ws = wb['no_header']
# Convert to DataFrame
df = pd.DataFrame(ws.values)
Now the last part can be done differently, but I like to convert the sheet into pandas, then use df.to_html() to get it onto a website for download.
df.to_html(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, max_rows=None, max_cols=None, show_dimensions=False, decimal='.', bold_rows=True, classes=None, escape=True, notebook=False, border=None, table_id=None, render_links=False, encoding=None)
I would read the docs on Pandas.dataframe.to_html() if the args don't make sense or you want to customize the method.

I want to create a corpus in python from multiple text files

I want to do text analytics on some text data. Issue is that so far i have worked on CSV file or just 1 file, but here I have multiple text files. So, my approach is to combine them all to 1 file and then use nltk to do some text pre processing and further steps.
I tried to download gutenberg pkg from nltk, and I am not getting any error in the code. But I am not able to see content of 1st text file in 1 cell, 2nd text file in 2nd cell and so on. Kindly help.
filenames = [
"246.txt",
"276.txt",
"286.txt",
"344.txt",
"372.txt",
"383.txt",
"388.txt",
"392.txt",
"556.txt",
"665.txt"
]
with open("result.csv", "w") as f:
for filename in filenames:
f.write(nltk.corpus.gutenberg.raw(filename))
Expected result - I should get 1 csv file with contents of these 10 texts files listed in 10 different rows.
filenames = [
"246.txt",
"276.txt",
"286.txt",
"344.txt",
"372.txt",
"383.txt",
"388.txt",
"392.txt",
"556.txt",
"665.txt"
]
with open("result.csv", "w") as f:
for index, filename in enumerate(filenames):
f.write(nltk.corpus.gutenberg.raw(filename))
# Append a comma to the file content when
# filename is not the content of the
# last file in the list.
if index != (len(filenames) - 1):
f.write(",")
Output:
this,is,a,sentence,spread,over,multiple,files,and,the end
Code and .txt files available at https://github.com/michaelhochleitner/stackoverflow.com-questions-57081411 .
Using Python 2.7.15+ and nltk 3.4.4 . I had to move the .txt files to /home/mh/nltk_data/corpora/gutenberg .

removing extra column in a csv file while exporting data using python3

I wrote a function in python3 which merges some files in the same directory and returns a csv file as the output but the problem with csv file is that I get one extra column at the beginning which does not have header and the other rows of that columns are numbers starting from 0. do you know how I write the csv file without getting the extra column?
you can split by ,, and then use slicing to remove the first element.
example:
original = """col1,col2,col3
0,val01,val02,val03
1,val11,val12,val13
2,val21,val22,val23
"""
original_lines = original.splitlines()
result = original_lines[:1] # copy header
for line in original_lines[1:]:
result.append(','.join(line.split(',')[1:]))
print('\n'.join(result))
Output:
col1,col2,col3
val01,val02,val03
val11,val12,val13
val21,val22,val23

Rename images in file based on csv in Python

I have a folder with a couple thousand images named: 10000.jpg, 10001.jpg, etc.; and a csv file with two columns: id and name.
The csv id matches the images in the folder.
I need to rename the images as per the name column in the csv (e.g. from 10000.jpg to name1.jpg.
I've been trying the os.rename() inside a for loop as per below.
with open('train_labels.csv') as f:
lines = csv.reader(f)
for line in lines:
os.rename(line[0], line[1])
This gives me an encoding error inside the loop.
Any idea what I'm missing in the logic?
Also tried another strategy (below), but got the error: IndexError: list index out of range.
with open('train_labels.csv', 'rb') as csvfile:
lines = csv.reader(csvfile, delimiter = ' ', quotechar='|')
for line in lines:
os.rename(line[0], line[1])
I also got the same error. When i opened CSV file in notepad, I found that there was no comma between ID and name. So please check it. otherwise you can see the solutions in Renaming images in folder

Resources