how to extract text from pdf using mupdf?

how to extract text from pdf using mupdf? - visual-c++

I want to extract text from pdf and relayout it.
My code is the following:
BOOL CTextEditorDoc::loadTxt()
{
if(m_strPDFPath.IsEmpty())
return FALSE;
#ifdef _DEBUG
DWORD dwTick = GetTickCount();
CString strLog;
#endif
CString strFile;
fz_context *ctx;
fz_document* doc;
fz_matrix ctm;
fz_page *page;
fz_device *dev;
fz_text_page *text;
fz_text_sheet *sheet;
int i,line,rotation,pagecount;
if(!gb2312toutf8(m_strPDFPath,strFile))
return FALSE;
ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED);
fz_try(ctx){
doc = fz_open_document(ctx, strFile.GetBuffer(0));
}fz_catch(ctx){
fz_free_context(ctx);
return FALSE;
}
line = 0;
rotation = 0;
pagecount = 0;
pagecount = fz_count_pages(doc);
fz_rotate(&ctm, rotation);
fz_pre_scale(&ctm,1.0f,1.0f);
sheet = fz_new_text_sheet(ctx);
for(i=0;i<pagecount;i++){
page = fz_load_page(doc,i);
text = fz_new_text_page(ctx);
dev = fz_new_text_device(ctx, sheet, text);
#ifdef _DEBUG
dwTick = GetTickCount();
#endif
fz_run_page(doc, page, dev, &ctm, NULL);
#ifdef _DEBUG
strLog.Format("run page:%d ms\n",GetTickCount() - dwTick);
OutputDebugString(strLog);
dwTick = GetTickCount();
#endif
//m_linesInfoVector.push_back(line);
print_text_page(ctx,m_strContent,text,line);
#ifdef _DEBUG
strLog.Format("print text:%d ms\n",GetTickCount() - dwTick);
OutputDebugString(strLog);
dwTick = GetTickCount();
#endif
fz_free_device(dev);
fz_free_text_page(ctx,text);
fz_free_page(doc, page);
}
fz_free_text_sheet(ctx,sheet);
fz_close_document(doc);
fz_free_context(ctx);
return TRUE;
}
This code can extract all the text of pdf but it may be too slow. How to improve it?
Most of time is spent in function fz_run_page. Maybe just to extract text from pdf, I don't need to execute fz_run_page?

At a quick glance your code looks fine.
To extract text from a PDF you need to interpret the PDF operator streams. fz_run_page does this. It results in calls to whatever device you specify - in this case the structured text extraction device. This collates the randomly positioned glyphs from all over the page into a more structure form of words/lines/paragraphs/columns etc.
So, in short you're doing the right thing.
There are no current user servicable ways to improve the speed of this. It is possible that we could maybe use a device hint to avoid reading images etc in future versions. I will ponder on this and discuss it with the other devs. But for now you're doing the right thing.
HTH.

No, the fz_run_page call is needed. You need to interpret the pages of the document to pull out the text, and that is what fz_run_page does.
Possibly you could create a simpler text device that avoided keeping track of the character positions, but I doubt that that would make an real difference to performance.

Related

vkCreateWin32SurfaceKHR not writing to surface

I'm trying to get a simple test of Vulkan working. I've been following the LunarG tutorials, but ran into the problem that vkCreateWin32SurfaceKHR seems to do nothing. Namely, surface is not being written to. The function vkCreateWin32SurfaceKHR returns 0, so it isn't reporting a failure. Any help is appreciated.
// create window
sdlWindow = SDL_CreateWindow(APP_SHORT_NAME, SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED, width, height, 0);
struct SDL_SysWMinfo wmInfo;
SDL_VERSION(&wmInfo.version);
SDL_GetWindowWMInfo(sdlWindow, &wmInfo);
hWnd = wmInfo.info.win.window;
hInstance = GetModuleHandle(NULL);
// create a surface attached to the window
VkWin32SurfaceCreateInfoKHR surface_info = {};
surface_info.sType = VK_STRUCTURE_TYPE_WIN32_SURFACE_CREATE_INFO_KHR;
surface_info.pNext = NULL;
surface_info.hinstance = hInstance;
surface_info.hwnd = hWnd;
sanity(!vkCreateWin32SurfaceKHR(inst, &surface_info, NULL, &surface));

Sascha Willems correctly identified that I was not requesting the extensions necessary to create a surface. I changed my code to request extensions as shown below, and now everything works as expected.
// create an instance
vector<char*> enabledInstanceExtensions;
enabledInstanceExtensions.push_back(VK_KHR_SURFACE_EXTENSION_NAME);
enabledInstanceExtensions.push_back(VK_KHR_WIN32_SURFACE_EXTENSION_NAME);
#ifdef VALIDATE_VULKAN
enabledInstanceExtensions.push_back("VK_EXT_debug_report");
#endif
vector<char*> enabledInstanceLayers;
#ifdef VALIDATE_VULKAN
enabledInstanceLayers.push_back("VK_LAYER_LUNARG_standard_validation");
#endif
VkInstanceCreateInfo inst_info = {};
inst_info.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
inst_info.pNext = NULL;
inst_info.flags = 0;
inst_info.pApplicationInfo = &app_info;
inst_info.enabledExtensionCount = (uint32_t)enabledInstanceExtensions.size();
inst_info.ppEnabledExtensionNames = enabledInstanceExtensions.data();
inst_info.enabledLayerCount = (uint32_t)enabledInstanceLayers.size();
inst_info.ppEnabledLayerNames = enabledInstanceLayers.data();
sanity(!vkCreateInstance(&inst_info, NULL, &instance));

Beside what Joe added in his answer, I will also say that the call to vkCreateWin32SurfaceKHR() if provided invalid arguments does not fail and return VK_SUCCESS. I`m not sure about other platforms if this is still the case.
When I say invalid arguments I am referring to the two most important hinstance and hwnd of the vulkan structure VkWin32SurfaceCreateInfoKHR.
So pay close attention to those two arguments, it tricked me few times.
Not sure tough why is returning VK_SUCCESS while providing invalid arguments, there may be some internal related things that god know why.

How to wrap the IO functions in Lua to prevent the user from leaving X directory

How could you wrap the IO functions in Lua to prevent someone from leaving your top level directory.
You place them in "MyDoc" and they have full IO access to everything sub of MyDoc but couldn't for example .. back into the C drive or anywhere else.

open up liolib.c. head over to these 3 functions
static void opencheck (lua_State *L, const char *fname, const char *mode) {
LStream *p = newfile(L);
p->f = fopen(fname, mode);
if (p->f == NULL)
luaL_error(L, "cannot open file " LUA_QS " (%s)", fname, strerror(errno));
}
static int io_open (lua_State *L) {
const char *filename = luaL_checkstring(L, 1);
const char *mode = luaL_optstring(L, 2, "r");
LStream *p = newfile(L);
const char *md = mode; /* to traverse/check mode */
luaL_argcheck(L, lua_checkmode(md), 2, "invalid mode");
p->f = fopen(filename, mode);
return (p->f == NULL) ? luaL_fileresult(L, 0, filename) : 1;
}
static int io_popen (lua_State *L) {
const char *filename = luaL_checkstring(L, 1);
const char *mode = luaL_optstring(L, 2, "r");
LStream *p = newprefile(L);
p->f = lua_popen(L, filename, mode);
p->closef = &io_pclose;
return (p->f == NULL) ? luaL_fileresult(L, 0, filename) : 1;
}
these are the functions you want to edit.
the first one receives the file name as the parameter fname, the second and the third
pop it out of the lua stack as the local variable filename.
now all you need to do is
1) get your own process path
2) canonize the given file path
3) compare them so that they are the same up until the last slash on both
4) if they are not the same then in opencheck use luaL_error(L,"access denied to %s", fname);
in the other two return luaL_fileresult(L,0,filename);

Presumably you have sandboxed your user environment, so for instance they can't use the builtin "require" or "dofile" or "setatable"? Basically you have to limit the functions they can call to only what you want, and create your own versions of anything you want to control. There are several ways to do this and they each have their pros and cons and nothing is unbreakable, all you can do is up the bar of experience, effort and time required to break your "jail".
This means you have to work at the C API level, but I would not recommend modifying the source unless you are very familiar with it and can easily determine that your modifications aren't easiy breakable. By staying at the C API level, at least other Lua users can help validate the solidity of the sandbox.
You have to figure out a way to enable your code to call Lua builtin without allowing the user to call the builtin. I believe you can store tables in the lua registry, where only the C code can look. It's been a while. Or maybe if you don't put getmetable in user environment, that allows you to call the builtins via metatable but user can't get to them.
For example, from C
you load the builtins such as io module and save the functions you will wrap (such as open) in a (meta)table table;
delete the builtin table io from _G so user only has access to the version you created; you've saved the functions you will need for later
create a global table called io and set its metatable to what you created in step 1, so it defines only functions you want to give access to, such as a function called "open".
In that function you do whatever filtering you need, before calling the builtin you saved.
The details will make a big difference, and implementation will be different if you use Lua 5.1 vs 5.2, but there are several good articles on sandboxing in Lua on the web (sorry no time to find), take a look and come up with something, then maybe post on Lua user mailing list or SO for pros/cons. ;)

Making duplicate of .pdf file using c++ code

I am trying to write a code which should be able to make duplicate copy of a file in any format. At the moment, I'm trying it for .pdf format. Here is the code that I have written:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream in("a.pdf", ios::binary);
if(in.fail())
{
cout<<"\nThe file couldn't be opened\n";
exit(0);
}
ofstream out("b.pdf", ios::binary);
while(!in.eof())
{
char buf[1000];
in.read(buf, sizeof(buf));
out<<buf;
}
in.close();
out.close();
return 0;
}
Now the problem is that the duplicate file either gets corrupted or is smaller/larger than the original file. And also doesn't contain any text.
I am witting this code for my computer networks project in which I have to send a file in any format from server to client.

I think this happens because you are using operator<< for output, which is designed to work with formatted strings, not binary data. std::ostream.write() is the counter part of read meant to be used with binary/unformatted data.
while(!in.eof())
{
char buf[1000];
in.read(buf, sizeof(buf));
out.write(buf, sizeof(buf));
}
operator<<(char*) probably stops on first "strange character". It also does not guarantee the output will be the same as input even in case of text, 'cause one can alter format output stream with formatting flags.
Actually I am not really sure what is the impact of ios::binary flag. Maybe you got confused that it would change behaviour of operator<< but it doesn't. I would expect it helps to alter buffering mode or something.

Get certain parts out of a string using C

Evening everyone hope on of you gurus can help. I am trying to find the answer to this issue I need to read the data out of the string below by searching the tags. i.e IZTAG UKPART etc however the code I am using is no good as it only stores the 1st part of it for example UKPART = 12999 and misses out the -0112. Is there a better way to search strings ?
UPDATE SO FAR.
#include <stdio.h>
#include <string.h>
#include <windows.h>
int main ()
{
// in my application this comes from the handle and readfile
char buffer[255]="TEST999.UKPART=12999-0112...ISUE-125" ;
//
int i;
int codes[256];
char *pos = buffer;
size_t current = 0;
//
char buffer2[255];
if ((pos=strstr(pos, "UKPART")) != NULL) {
strcpy (buffer2, pos); // buffer2 <= "UKPART=12999-0112...ISUE-125"
}
printf("%s\n", buffer2);
system("pause");
return 0;
}
NOW WORKS BUT RETURN WHOLE STRING AS OUTPUT I NEED TO JUST RETURN UKPART FOR EXAMPLE THANKS SO FAR :-)

strstr() is absolutely the right way to search for the substring. Cool :)
It sounds like you want something different from "sscanf()" to copy the substring.
Q: Why not just use "strcpy ()" instead?
EXAMPLE:
char buffer[255]="IZTAG-12345...UKPART=12999-0112...ISUE-125" ;
char buffer2[255];
if ((pos=strstr(pos, "UKPART")) != NULL) {
strcpy (buffer2, pos); // buffer2 <= "UKPART=12999-0112...ISUE-125"

How does one find the start of the "Central Directory" in zip files?

Wikipedia has an excellent description of the ZIP file format, but the "central directory" structure is confusing to me. Specifically this:
This ordering allows a ZIP file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.
The problem is that even the trailing header for the central directory is variable length. How then, can someone get the start of the central directory to parse?
(Oh, and I did spend some time looking at APPNOTE.TXT in vain before coming here and asking :P)

My condolences, reading the wikipedia description gives me the very strong impression that you need to do a fair amount of guess + check work:
Hunt backwards from the end for the 0x06054b50 end-of-directory tag, look forward 16 bytes to find the offset for the start-of-directory tag 0x02014b50, and hope that is it. You could do some sanity checks like looking for the comment length and comment string tags after the end-of-directory tag, but it sure feels like Zip decoders work because people don't put funny characters into their zip comments, filenames, and so forth. Based entirely on the wikipedia page, anyhow.

I was implementing zip archive support some time ago, and I search last few kilobytes for a end of central directory signature (4 bytes). That works pretty good, until somebody will put 50kb text into comment (which is unlikely to happen. To be absolutely sure, you can search last 64kb + few bytes, since comment size is 16 bit).
After that, I look up for zip64 end of central dir locator, that's easier since it has fixed structure.

Here is a solution I have just had to roll out incase anybody needs this. This involves grabbing the central directory.
In my case I did not want any of the compression features that are offered in any of the zip solutions. I just wanted to know about the contents. The following code will return a ZipArchive of a listing of every entry in the zip.
It also uses a minimum amount of file access and memory allocation.
TinyZip.cpp
#include "TinyZip.h"
#include <cstdio>
namespace TinyZip
{
#define VALID_ZIP_SIGNATURE 0x04034b50
#define CENTRAL_DIRECTORY_EOCD 0x06054b50 //signature
#define CENTRAL_DIRECTORY_ENTRY_SIGNATURE 0x02014b50
#define PTR_OFFS(type, mem, offs) *((type*)(mem + offs)) //SHOULD BE OK
typedef struct {
unsigned int signature : 32;
unsigned int number_of_disk : 16;
unsigned int disk_where_cd_starts : 16;
unsigned int number_of_cd_records : 16;
unsigned int total_number_of_cd_records : 16;
unsigned int size_of_cd : 32;
unsigned int offset_of_start : 32;
unsigned int comment_length : 16;
} ZipEOCD;
ZipArchive* ZipArchive::GetArchive(const char *filepath)
{
FILE *pFile = nullptr;
#ifdef WIN32
errno_t err;
if ((err = fopen_s(&pFile, filepath, "rb")) == 0)
#else
if ((pFile = fopen(filepath, "rb")) == NULL)
#endif
{
int fileSignature = 0;
//Seek to start and read zip header
fread(&fileSignature, sizeof(int), 1, pFile);
if (fileSignature != VALID_ZIP_SIGNATURE) return false;
//Grab the file size
long fileSize = 0;
long currPos = 0;
fseek(pFile, 0L, SEEK_END);
fileSize = ftell(pFile);
fseek(pFile, 0L, SEEK_SET);
//Step back the size of the ZipEOCD
//If it doesn't have any comments, should get an instant signature match
currPos = fileSize;
int signature = 0;
while (currPos > 0)
{
fseek(pFile, currPos, SEEK_SET);
fread(&signature, sizeof(int), 1, pFile);
if (signature == CENTRAL_DIRECTORY_EOCD)
{
break;
}
currPos -= sizeof(char); //step back one byte
}
if (currPos != 0)
{
ZipEOCD zipOECD;
fseek(pFile, currPos, SEEK_SET);
fread(&zipOECD, sizeof(ZipEOCD), 1, pFile);
long memBlockSize = fileSize - zipOECD.offset_of_start;
//Allocate zip archive of size
ZipArchive *pArchive = new ZipArchive(memBlockSize);
//Read in the whole central directory (also includes the ZipEOCD...)
fseek(pFile, zipOECD.offset_of_start, SEEK_SET);
fread((void*)pArchive->m_MemBlock, memBlockSize - 10, 1, pFile);
long currMemBlockPos = 0;
long currNullTerminatorPos = -1;
while (currMemBlockPos < memBlockSize)
{
int sig = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos);
if (sig != CENTRAL_DIRECTORY_ENTRY_SIGNATURE)
{
if (sig == CENTRAL_DIRECTORY_EOCD) return pArchive;
return nullptr; //something went wrong
}
if (currNullTerminatorPos > 0)
{
pArchive->m_MemBlock[currNullTerminatorPos] = '\0';
currNullTerminatorPos = -1;
}
const long offsToFilenameLen = 28;
const long offsToFieldLen = 30;
const long offsetToFilename = 46;
int filenameLength = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFilenameLen);
int extraFieldLen = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFieldLen);
const char *pFilepath = &pArchive->m_MemBlock[currMemBlockPos + offsetToFilename];
currNullTerminatorPos = (currMemBlockPos + offsetToFilename) + filenameLength;
pArchive->m_Entries.push_back(pFilepath);
currMemBlockPos += (offsetToFilename + filenameLength + extraFieldLen);
}
return pArchive;
}
}
return nullptr;
}
ZipArchive::ZipArchive(long size)
{
m_MemBlock = new char[size];
}
ZipArchive::~ZipArchive()
{
delete[] m_MemBlock;
}
const std::vector<const char*> &ZipArchive::GetEntries()
{
return m_Entries;
}
}
TinyZip.h
#ifndef __TinyZip__
#define __TinyZip__
#include <vector>
#include <string>
namespace TinyZip
{
class ZipArchive
{
public:
ZipArchive(long memBlockSize);
~ZipArchive();
static ZipArchive* GetArchive(const char *filepath);
const std::vector<const char*> &GetEntries();
private:
std::vector<const char*> m_Entries;
char *m_MemBlock;
};
}
#endif
Usage:
TinyZip::ZipArchive *pArchive = TinyZip::ZipArchive::GetArchive("Scripts_unencrypt.pak");
if (pArchive != nullptr)
{
const std::vector<const char*> entries = pArchive->GetEntries();
for (auto entry : entries)
{
//do stuff
}
}

In case someone out there is still struggling with this problem - have a look at the repository I hosted on GitHub containing my project that could answer your questions.
Zip file reader
Basically what it does is download the central directory part of the .zip file which resides in the end of the file.
Then it will read out every file and folder name with it's path from the bytes and print it out to console.
I have made comments about the more complicated steps in my source code.
The program can work only till about 4GB .zip files. After that you will have to do some changes to the VM size and maybe more.
Enjoy :)

I recently encountered a similar use-case and figured I would share my solution for posterity since this post helped send me in the right direction.
Using the Zip file central directory offsets detailed on Wikipedia here, we can take the following approach to parse the central directory and retrieve a list of the contained files:
STEPS:
Find the end of the central directory record (EOCDR) by scanning the zip file in binary format for the EOCDR signature (0x06054b50), beginning at the end of the file (i.e. read the file in reverse using std::ios::ate if using a ifstream)
Use the offset located in the EOCDR (16 bytes from the EOCDR) to position the stream reader at the beginning of the central directory
Use the offset (46 bytes from the CD start) to position the stream reader at the file name and track its position start point
Scan until either another central directory header is found (0x02014b50) or the EOCDR is found, and track the position
Reset the reader to the start of the file name and read until the end
Position the reader over the next header, or terminate if the EOCDR is found
The key point here is that the EOCDR is uniquely identified by a signature (0x06054b50) that occurs only one time. Using the 16 byte offset, we can position ourselves to the first occurrence of the central directory header (0x02014b50). Each record will have the same 0x02014b50 header signature, so you just need to loop through occurrences of the header signatures until you hit the EOCDR ending signature (0x06054b50) again.
SUMMARY:
If you want to see a working example of the above steps, you can check out my minimal implementation (ZipReader) on GitHub here. The implementation can be used like this:
ZipReader zr;
if (zr.SetInput("blah.zip") == ZipReaderStatus::S_FAIL)
std::cout << "set input error" << std::endl;
std::vector<std::string> entries;
if (zr.GetEntries(entries) == ZipReaderStatus::S_FAIL)
std::cout << "get entries error" << std::endl;
for (auto entry : entries)
std::cout << entry << std::endl;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to extract text from pdf using mupdf? - visual-c++

Related

vkCreateWin32SurfaceKHR not writing to surface

How to wrap the IO functions in Lua to prevent the user from leaving X directory

Making duplicate of .pdf file using c++ code

Get certain parts out of a string using C

How does one find the start of the "Central Directory" in zip files?

Categories

Resources