This question hopefully serves as a purpose for other general questions of using SDMX/JSON or other request types of achieving data requests into excel.
I am new to this sort of data request, but was wondering how data could be pulled in a semi-automated way into excel for various data fields from Statistics Sweden's website.
Their API details are given here: http://www.scb.se/en_/About-us/Open-data-API/API-for-the-Statistical-Database-/
I'm looking to understand how certain fields can be pulled into Excel.
For SDMX XML format you may use some existing libraries (search github for sdmx).
Example of data: http://share.scb.se/AA0102/data/ECOFIN_NAG_08_Q_SE1.xml
Once you have the data you can transform it to CSV format (also look for libs), which can be imported into Excel.
If you are lucky you may even find some CSV export on that swedish site.
Related
I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.
First of all, apologies for what might seem like an 'amateur' scenario & question...
Situation
I have many, many documents (100,000) that I need to users to be able to search and browse via a web application we are building
This search functionality is just 1 of several other functions
I currently have around a dozen excel spreadsheets that contain the 'metadata' (title, date, author, source, country etc.) or document information
Each of the 100,000 'records' (or excel rows) has a unique identifier
The actual files (majority PDF but some Word & Excel) are stored in Dropbox using the corresponding unique identifier as the file name
Questions
Is Apache Solr the best tool to use in order to provide the search functionality?
What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
I very much appreciate any assistance as I have just been getting many different answers from paid consultants.
Regards
Mark
Your questions:
Q: Is Apache Solr the best tool to use in order to provide the search functionality?
A: In my opinion, Solr is an awesome option for things like this. However, as you've discovered, there's "some assembly required" (and that's putting it mildly)
Q: What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
A: If it were me, I'd use the filesystem. I think it's the easiest to debug.
Also, if it were me, I'd export the Excel sheets to CSV, I think it's a bit easier to work with then. However, Solr does include open source Tika filters, which do support Excel, but they won't treat your multiple-records as multiple-documents; Tika will make each Excel sheet into just one document each.
Q: What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
A: I'm a fan of local filesystem. Dropbox does let you mirror your Dropbox files on a local directory. And as I said before, if you can get the Excel sheets exported to CSV, in some automated or "macro" way, I think it'd make your life easier too. For example, Python can read and write CSV files and is a great tool for messaging data into its final form.
If you don't mind commercial solutions, you might consider Lucidworks Fusion; it does include a bunch of connectors, including a filesystem datasource connector. Disclaimer: I happen to work for Lucid, but listing this suggestion last, before the "free" answers. And I'd mention this anyway, even if I didn't work there.
I am trying to build a pdf crawler for annual reports of corporates - these reports are pdf documents with a lot of text and also a lot of tables.
I don't have any trouble with converting the pdf into a txt, but my actual goal is to search for certain keywords (for example REVENUE, PROFIT) and extract the data Revenue 1.000.000.000€ into a data frame.
I tried different libraries, especially tabula-py and PyPDF2 but I couldn't find a smart way to do that - can anyone please help with a strategy, it would be amazing!
Best Regards,
Robin
Extracting data from PDFs is tricky business. Although there are PDF standards , not all PDFs are created equal. If you can already extract the data you need in text form, you can use RegEx to pull the data you require.
Amazon have a machine learning tool called Textract which you can use alongside their boto3 SDK in Python. However, this is a 'pay-per' service. The main difference with using Textract to regular expressions is that Textract can recognise and format data pairs and tables which should mean that creating your 'crawler' is quicker and less prone to breaking if your PDFs change going forward.
There is a Python package called Textract but it's not the same as the one provided in AWS, rather, it's a wrapper that (for PDFs) uses pdftotext (default) or pdfminer.six. It's worth checking it out as it may yield your data in a better format.
I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.
I have a requirement to retrieve data from share point (I guess it is 2010, but will check with admin if relevant) and generate an excel report/chart. Say we have a bug tracking system in share point. Currently, I could create a view and see some statistics, but I need to plot a graph to see historically (every week) how the number of bugs changed. For example,
get the number of bugs filed in a specific week
do some grouping based on type/severity
based on classification get number of bugs solved that week etc.
If I can get the numbers based on date range, I may use excel to plot the graph.
After some reading, SharePoint object model come close to what I used to work with (Oracle DB). I understand it may be entirely different from tradition db and querying.
Please help me with
What is the best method to approach this?
Is there a good book/resource.
Thanks a lot,
bsr
The easiest apprach would be to LINK to the sharepoint lists using Access 2007 or 2010 and then export the data to Excel for further processling. Of course, you could also write a program that uses CAML query to access the data. Your requirement sound straightforward, unless you need to automate the reporting process, the simplest approach would be to access the lists via an access database.
You could also create a web service via REST that pulls the data directly into Excel.
SharePoint has it's own query language: CAML query, and in theory that could be used to retrieve the list you seek.
And you should be prepared for "some" trial and error.
Tools I used:
http://www.u2u.be/res/tools/camlquerybuilder.aspx
http://spud.codeplex.com/
what I understand from this question is that you have the need to put the SharePoint data to an excel file and this from within the SharePoint site? So it looks to me that you could just create a simple SharePoint web part that consists of one button "generate excel file". So when the user clicks on the button you would just query your SPList object(SharePoint object model) and you would get all the necessary data from the list (SPListItems).
This is the way that I would take. Mind you that this is offcourse custom SharePoint Development (.NET c#). There are lots of books or blogs that described how to create your own web part in SharePoint.