I've a requirement in which I've to restrict user from editing specific paragraphs in the word file .
I tried using Apache POI but I'm not able to figure out how to do it .
I was able to find the class " import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPermStart " can be used , but eclipse says " The import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPermStart can't be resolved"
I'm using Apache POI-Version: 3.9.
Related
I have an issue with the use of Tika for language detection (in python). I first remarked that when I parse PDF files with parser.from_file(file), the language was not included in the "metadata part" in most cases.
Thus, I tried to explicitly detect the language and I got in most cases "th" as result, while my documents are in french. Then, I copied the PDF file content in a simple text file and the result was strangely right.
This is the code I used:
from tika import language
print(language.from_file(file))
Let me notice that I just installed tika with the command pip install tika whithout any additional configuration. Is anything wrong in the process I used?
from the documentation:
https://cwiki.apache.org/confluence/display/TIKA/TikaServer
"HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language.
NOTE: This endpoint does not parse files. It runs detection on a UTF-8 string."
you should first parse the pdf and extract the text, then run the language identifier:
pdf = parser.from_file(file_path, localhost_tika)
text = pdf["content"]
detected_lang = language.from_buffer(text)
I am trying to extract the majority of my docx file when I am importing it to the Python. The best would be if I could tell my code which paragraphs I need or what part of the text I am going to use.
Can anyone help me with that?
I have tried this code:
import docx
doc = docx.Document('A.docx')
print(len(doc.paragraphs))
print (doc.paragraphs[2].text)
but the problem with this is that whenever I hit enter it thinks that a new paragraph has started.
I am totally new to Stata and am wondering how to import .xlsx data in Stata. Let's say the data is in the subdirectory Data and has name "a b c.xlsx". So, from working directory, the data is in /Data
I am trying to do
import excel using "\Data\a b c.xlsx", sheet("a")
but it's not working
it's not working
is anything but a useful error report. For future questions, please report the exact error given by Stata.
Let's say the file is in the directory /home/roberto then
clear
set more off
import excel using "/home/roberto/a b c.xlsx"
list
should work.
If you are already in /home/roberto (which you can verify using display c(pwd)), then
import excel using "a b c.xlsx"
should work.
Using backslashes to refer to directories is not encouraged. See Stata tip 65: Beware the backstabbing backslash, by Nick Cox.
See also help cd.
I want to calculate distance between sensors deployed in georaphical area using longitude and latitude in sparql query issued in apache jena 2.11.(Sensor description and observation are stored as RDF triple in sensor.n3, eclipse as IDE and Fedora 19, TDB as triple store)
I found that "Spatial searches with SPARQL" will help in this regard. But when I import package given at http://jena.apache.org/documentation/query/spatial-query.html import org.apache.jena.query.spatial.EntityDefinition in eclipse I get the error The import org.apache.jena.query cannot be resolved. When browsed the folder ../apache-jena-2.11.1/javadoc-arq/org/apache/jena directory it contains only
(altas, common, web, riot) there is no query folder which is the reason why import is highlighted in red.
I have one more doubt whether Apache Solr need to be installed ( I have downloaded solr 4.10.1) or just use build path to import external jar.
You need to separately download jena-spatial. (Use maven to manage your dependencies.) You can use lucene instead of Solr. Again, maven will load the dependencies. AndyS
Hai, i'm using Apache POI 3.6
I've already created some code..
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
wordxExtractor = new XWPFWordExtractor(doc);
text = wordxExtractor.getText();
System.out.println("adding docx " + file);
d.add(new Field("content", text, Field.Store.NO, Field.Index.ANALYZED));
unfortunately, it generated error..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:98)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:199)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:178)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:53)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:98)
at org.apache.lucene.demo.Indexer.indexDocs(Indexer.java:153)
at org.apache.lucene.demo.Indexer.main(Indexer.java:88)
It seemed that it used Constructor
XWPFWordExtractor(OPCPackage container)
but not this one ->
XWPFWordExtractor(XWPFDocument document)
Any wondering why??
Or any idea how I can extract the .docx then convert it into a String?
You need to Add dom4j Library to your claspath or your project libraries
It looks like you don't have all of the dependencies on your classpath.
If you look at http://poi.apache.org/overview.html you'll see that dom4j is a required library when working with the OOXML files. From the exception you got, it seems that you don't have it... If you look in the POI binary download, you should find it in the ooxml-libs subdirectory.
You could try docx4j instead; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java