I am trying to build a pdf crawler for annual reports of corporates - these reports are pdf documents with a lot of text and also a lot of tables.
I don't have any trouble with converting the pdf into a txt, but my actual goal is to search for certain keywords (for example REVENUE, PROFIT) and extract the data Revenue 1.000.000.000€ into a data frame.
I tried different libraries, especially tabula-py and PyPDF2 but I couldn't find a smart way to do that - can anyone please help with a strategy, it would be amazing!
Best Regards,
Robin
Extracting data from PDFs is tricky business. Although there are PDF standards , not all PDFs are created equal. If you can already extract the data you need in text form, you can use RegEx to pull the data you require.
Amazon have a machine learning tool called Textract which you can use alongside their boto3 SDK in Python. However, this is a 'pay-per' service. The main difference with using Textract to regular expressions is that Textract can recognise and format data pairs and tables which should mean that creating your 'crawler' is quicker and less prone to breaking if your PDFs change going forward.
There is a Python package called Textract but it's not the same as the one provided in AWS, rather, it's a wrapper that (for PDFs) uses pdftotext (default) or pdfminer.six. It's worth checking it out as it may yield your data in a better format.
Related
I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.
How can I load a dataset for person reidentification. In my dataset there are two folders train and test.
I wish I could provide comments, but I cannot yet. Therefore, I will "answer" your question to the best of my ability.
First, you should provide a general format or example content of the dataset. This would help me provide a less nebulous answer.
Second, from the nature of your question I am assuming that you are fairly new to python in general. Forgive me if I'm wrong in my assumption. With that assumption, depending on what kind of data you are trying to load (i.e. text, numbers, or a mixture of text and numbers) there are various ways to load the data. Some of the methods are easier than others. If you are strictly loading numbers, I suggest using numpy.loadtxt(<file name>). If you are using text, you could use the Pandas package, or if it's in a CSV file you could use the built-in (into Python that is) CSV package. Alternatively, if it's in a format that Tensorflow can read, you could use the provided load data functions.
Once you have loaded your data you will need to separate the data into the input and output values. Considering that Tensorflow models accept either lists or numpy arrays, you should be able to use these in your training and testing steps.
Checkout modules csv (import csv) or load your dataset via open(filename, „r“) or so. It might be easiest if you provide more context/info.
I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.
This question hopefully serves as a purpose for other general questions of using SDMX/JSON or other request types of achieving data requests into excel.
I am new to this sort of data request, but was wondering how data could be pulled in a semi-automated way into excel for various data fields from Statistics Sweden's website.
Their API details are given here: http://www.scb.se/en_/About-us/Open-data-API/API-for-the-Statistical-Database-/
I'm looking to understand how certain fields can be pulled into Excel.
For SDMX XML format you may use some existing libraries (search github for sdmx).
Example of data: http://share.scb.se/AA0102/data/ECOFIN_NAG_08_Q_SE1.xml
Once you have the data you can transform it to CSV format (also look for libs), which can be imported into Excel.
If you are lucky you may even find some CSV export on that swedish site.
I have some experience with java and I am a student doing my final year project.
I need to work on a project in Natural language processing , well I am currently trying to work on stanford-nlp libraries (but am not locked to it , i can change my tool) so answers can be for any tool proper for my problem.
I have planned to work on Information Extraction IE , and have seen some page/pdf that explain how it works with various NLP techniques. Data will be processed with NLP and i need to perform Information Retrieval IR on the processed data
My problem now is: What data-structure or storage medium should I use to store the data I have retrieved by using NLP techniques
that data-store must have a capacity to support query
XML,JSON does not look an ideal candidate . (i could be wrong) : if they can be then some help/guidance on best way to do it will be helpful.
my current view is to convert/store the parse tree into a data format that can be directly read for query .(parse tree:a diagrammatic representation of the parsed structure of a sentence or string)
a sample of type of data need to be stored , for the text "My project is based on NLP." the Dependency would be as below
root(ROOT-0, based-4)
poss(project-2, My-1)
nsubjpass(based-4, project-2)
auxpass(based-4, is-3)
prep(based-4, on-5)
pobj(on-5, NLP-6)
Have you already extracted the information or are you trying to store the parse tree? If the former, this is still an open question in NLP. See, for example, the book by Jurafsky and Martin, which discusses many ways to do this.
Basically, we can't answer until we know what you're trying to store. If it's super simple information, you might be able to get away with a simple relational database.