Language dectection from tika-python does not work - python-3.x

I have an issue with the use of Tika for language detection (in python). I first remarked that when I parse PDF files with parser.from_file(file), the language was not included in the "metadata part" in most cases.
Thus, I tried to explicitly detect the language and I got in most cases "th" as result, while my documents are in french. Then, I copied the PDF file content in a simple text file and the result was strangely right.
This is the code I used:
from tika import language
print(language.from_file(file))
Let me notice that I just installed tika with the command pip install tika whithout any additional configuration. Is anything wrong in the process I used?

from the documentation:
https://cwiki.apache.org/confluence/display/TIKA/TikaServer
"HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language.
NOTE: This endpoint does not parse files. It runs detection on a UTF-8 string."
you should first parse the pdf and extract the text, then run the language identifier:
pdf = parser.from_file(file_path, localhost_tika)
text = pdf["content"]
detected_lang = language.from_buffer(text)

Related

Js-Report not display other language when deploy on lambda

I have use js-report to generate pdf with thai language
all of my process is
1.get csv file from s3
2.read data and convert csv to object
3.send object to jsreport
i already done all process in local and it's work fine .
but when i deploy this project on lambda (we can attach js report to lambda. more details in this url : jsreport-aws-lambda )
but when i deploy and test, it's not displaying thai language(and i think maybe other language too.)
at first i think it's becode encoding ('base64') i try to change it's to utf-8 but the file is corrupted.
I already set meta content of html file to
enter code here
but's still not working
what can i do to solved this, please help me.
Thanks.

How to remove unicode characters without removing parts of text

I am trying to do an N-gram analysis on an ancient language that does not have modern orthography and I am running into a problem of encoding.
The orthography looks like the following
It is contained in a Docx document and I use the following code to retrieve it
Text = docx2txt.process(Corpus)
print(Text)
When I put it into a dictionary it spits out the following
"daniel.mahabi": {"xq\u2019xucubaquibms.xqui\u00e7i\ua72dih": 1.0},
I can partially resolve this with the following code
Text = docx2txt.process(Corpus)
Text = Text.encode("ascii", "ignore")
Text = Text.decode()
However upon doing that it also removes parts of the text.
What can I do to resolve this?

How to convert an asciidoc with cyrillic symbols to pdf?

I have created an .adoc file and want to convert it to PDF.
I am using Linux Debian and GEdit for writing the .adoc. I followed all the steps to install asciidoc-pdf, RVM, etc. from here. And it actually works, but for some reason it doesnt want to convert my file.
I downloaded a readme for Asciidoctor PDf converter and tried to convert it to pdf to see if all is working correctly and it converted with no problem.
When I try to convert my file, it only converts the title and that's it. When I try to convert it to HTML, I get a bit more - the line with the name of the author (my name). And I can't understand what's wrong.
I even tried to write the same file from scratch with no result.
Here is a sample of my file:
= Как купить билет на сайте РЖД
Маркиев Владимир <grolribasi#gmail.com>
:hide-uri-scheme:
:imagesdir: img
ifdef::env-github[]
:importatnt-caption: :warning:
:source-highlighter: rouge
Инструкция:: Данная инструкция поможет вам приобрести билет най сайте hhtps://rzd.ru
. Наберите в адресной строке браузера rzd.ru, откроется главная страница сайта.
+
--
image::1.png[главная страница]
--
After some reading, I found out that this is due to Cyrillic symbols in my document. I found out that now Asciidoc supports Cyrillic symbols in documents, but you need to specify font family as "Noto Serif" in the header:
base-font-family: Noto Serif
I tried to set the font family, added the line, but it still doesn't want to convert to pdf.
I guess, now the main question is: how to use Cyrillic symbols in asciidoc.
This ifdef
ifdef::env-github[]
is never closed.
Try to close it that way:
ifdef::env-github[]
:importatnt-caption: :warning:
:source-highlighter: rouge
endif::env-github[]

Python: Universal XML parser

I'm trying to make simple Python 3 program to read weather information from XML web source, convert it into Python-readable object (maybe dictionary) and process it (for example visualize multiple observations into graph).
Source of data is national weather service's (direct translation) xml file at link provided in code.
What's different from typical XML parsing related question in Stack Overflow is that there are repetitive tags without in-tag identificator (<station> tags in my example) and some with (1st line, <observations timestamp="14568.....">). Also I would like to try parse it straight from website, not local file. Of course, I could create local temporary file too.
What I have so far, is simply loading script, that gives string containing xml code for both forecast and latest weather observations.
from urllib.request import urlopen
#Read 4-day forecast
forecast= urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/forecast.php").read().decode("iso-8859-1")
#Get current weather
observ=urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/observations.php").read().decode("iso-8859-1")
Shortly, I'm looking for as universal as possible way to parse XML to Python-readable object (such as dictionary/JSON or list) while preserving all of the information in XML-file.
P.S I prefer standard Python 3 module such as xml, which I didn't understand.
Try xmltodict package for simple conversion of XML structure to Python dict: https://github.com/martinblech/xmltodict

how to verify links in a PDF file

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.

Resources