Efficient Way of Recording Page Numbers from a Search of a PDF

Efficient Way of Recording Page Numbers from a Search of a PDF - excel

I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!

Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):
You could split your PDF into different files (1 file per page) using pdftk
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Then convert those files to text with a pdf-to-text utility like this one:
http://www.fileguru.com/PDF-To-TXT-Converter/download
or this one
http://www.pdf2text.com/
And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).

Related

Find a file with all given strings on Linux

Similar to this question How do I find all files containing specific text on Linux? but I want all the files that contain multiple given strings (these strings not necessarily next to each other or on the same line, just in the same file).
My use case is I am looking at a UI and want to modify the file which controls this particular screen. The codebase though is huge and it is difficult to locate this file. All I have to go on is some of the hardcoded strings on this screen which I would like to do the search on. The strings are quite generic though such as 'Done', 'Close', 'View Details'... Doing a search on any of these strings individually, using the answer from the linked question above, brings back too many results but I think doing the search on all of them together will filter it down enough to find the file.

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.

AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

After using pdftotext: find page of string from txt

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?

You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

is it possible to find which page and/or line number a given text was found using full text search and filestream?

I just started using the filestream and full text search technologies available on Microsoft Sql Server, I can index and search txt and pdf files, however, when I get the results I can't see the text, nor which page and/or line number that text was found inside the pdf for example, is it possible to at least retrieve the text from the document when a search is made? I believe it's not possible to return a "region" of text but maybe something I can use to look for in the file afterwards?
I'm trying to figure out the advantages of doing a search like this if I can't see the text that was found.

After doing a lot of research I concluded it isn't possible to search for a given page on an indexed pdf, so I decided to use solr instead and index the information the way I need to search later

Serialized Printing Method

I am looking for a method by which I can print one document, and have a field that is incremented on each copy printed. I currently run linux, so bash in concert with several programs might be the way to go, but I'm just not sure where to start.
I have a document that is used for our business that currently is hand stamped for serialization... We would like to simply print them but cant find a method by which to increment a specific field. I would like to use either a PDF or an ODF/ODT for the document.
Thanks for any help you can give!

How is the document produced at the first place?
If you master that process, you could certainly add serialization at that level. For instance if using LibreOffice you could do that in LibreOffice. If using a text formatter (like LaTeX, Lout, ....) just emit the formatting instructions (e.g. the .tex or .lout source file) with some unique counting (perhaps simpler to do in some scripting language like Python or Ocaml).
Then run the relevant tool to get a .pdf file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string