Importing only a specific part of the docx in Python - python-3.x

I am trying to extract the majority of my docx file when I am importing it to the Python. The best would be if I could tell my code which paragraphs I need or what part of the text I am going to use.
Can anyone help me with that?
I have tried this code:
import docx
doc = docx.Document('A.docx')
print(len(doc.paragraphs))
print (doc.paragraphs[2].text)
but the problem with this is that whenever I hit enter it thinks that a new paragraph has started.

Related

How to save the output of text from selenium chrome (Python)

I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.

Vader Sentiment with multiple PDF

I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.

Autofill a word.docm file using python

I am trying to auto-fill a word.docm file using python.
I could find solution to automate word.docx filling. But with word.docm files, the code fails.
This is how the sample 'template.docm' file looks like.
I need to fill field1 and field2 from python.
I am attaching the code,that worked perfectly for docx files. Can anyone please suggest any edits or any other method that works for word.docm files using python?
Thanks in advance
from __future__ import print_function
from mailmerge import MailMerge
from datetime import date
template = "template.docm"
document = MailMerge(template)
print(document.get_merge_fields())
document.merge(field1='Name',field2='Address')
document.write('template-output.docm')

Extracting title from pdf using pypdf2 not working

I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)
Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.

Importing xlsx file with space in filename in Stata .do file

I am totally new to Stata and am wondering how to import .xlsx data in Stata. Let's say the data is in the subdirectory Data and has name "a b c.xlsx". So, from working directory, the data is in /Data
I am trying to do
import excel using "\Data\a b c.xlsx", sheet("a")
but it's not working
it's not working
is anything but a useful error report. For future questions, please report the exact error given by Stata.
Let's say the file is in the directory /home/roberto then
clear
set more off
import excel using "/home/roberto/a b c.xlsx"
list
should work.
If you are already in /home/roberto (which you can verify using display c(pwd)), then
import excel using "a b c.xlsx"
should work.
Using backslashes to refer to directories is not encouraged. See Stata tip 65: Beware the backstabbing backslash, by Nick Cox.
See also help cd.

Resources