Using Apache Solr with 'metadata' in Excel and files in DropBox - excel

First of all, apologies for what might seem like an 'amateur' scenario & question...
Situation
I have many, many documents (100,000) that I need to users to be able to search and browse via a web application we are building
This search functionality is just 1 of several other functions
I currently have around a dozen excel spreadsheets that contain the 'metadata' (title, date, author, source, country etc.) or document information
Each of the 100,000 'records' (or excel rows) has a unique identifier
The actual files (majority PDF but some Word & Excel) are stored in Dropbox using the corresponding unique identifier as the file name
Questions
Is Apache Solr the best tool to use in order to provide the search functionality?
What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
I very much appreciate any assistance as I have just been getting many different answers from paid consultants.
Regards
Mark

Your questions:
Q: Is Apache Solr the best tool to use in order to provide the search functionality?
A: In my opinion, Solr is an awesome option for things like this. However, as you've discovered, there's "some assembly required" (and that's putting it mildly)
Q: What is the best design to facilitate this (e.g. Files in AWS S3 etc.)?
A: If it were me, I'd use the filesystem. I think it's the easiest to debug.
Also, if it were me, I'd export the Excel sheets to CSV, I think it's a bit easier to work with then. However, Solr does include open source Tika filters, which do support Excel, but they won't treat your multiple-records as multiple-documents; Tika will make each Excel sheet into just one document each.
Q: What is the best method to migrate from excel/Dropbox to the proposed Apache solr solution?
A: I'm a fan of local filesystem. Dropbox does let you mirror your Dropbox files on a local directory. And as I said before, if you can get the Excel sheets exported to CSV, in some automated or "macro" way, I think it'd make your life easier too. For example, Python can read and write CSV files and is a great tool for messaging data into its final form.
If you don't mind commercial solutions, you might consider Lucidworks Fusion; it does include a bunch of connectors, including a filesystem datasource connector. Disclaimer: I happen to work for Lucid, but listing this suggestion last, before the "free" answers. And I'd mention this anyway, even if I didn't work there.

Related

How to use Solr for multiple data sources?

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.
It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book
I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

Is it possible to write a word add-in or something similar to provide a custom mailmerge datasource?

I've a backend system which I want to use as the datasource for MS Word 2010 mailmerge.
I've the idea that I should write a word add-in which provides the datasource and is communicating with my backend over webservices.
Is it generally possible to provide a custom datasource with a word add in?
Is there a besser solution?
Is it possible to do this in C#?
Where can I found moar informations about this?
Thanks
The types of data source that Word's out-of-the-box Mailmerge feature can use are well known - e.g.
file-type data sources such as Word documents, plain text files,
Excel workbooks and so on,
"DDE" sources (Access, Excel, and MS Query, which in turn expects to
work with ODBC connections)
ODBC data sources and
OLE DB data sources (these have to conform to some criteria which
AFAIK are not documented, e.g. I think the provider has to implement
iCommand)
To use the built-in Mailmerge, Word has to open the source, and you cannot override OpenDataSource.
There's no interface that lets you use an ADO.NET data source directly, or a disconnected recordset. If you want to do that, you have to "roll your own" mailmerge. One starting point you could consider is Eric White's series of articles - see, e.g. here .
Otherwise, you have to ensure that your data source is one of the above things. In the case of ODBC sources, you also have to use a DSN - you can't connect using a DSN-less connection string. You can avoid using a "file" using a System or Machine DSN, but it may actually be easier to distribute a file DSN than create one on the target machine.
In the case of OLE DB data sources, you have to have a "connection file" of some kind, either a .udl or .odc
A final note: as far as Word is concerned, "ODBC" is "old" and "OLE DB" is newer. But Microsoft announced some time ago that it was deprecating OLE DB in favour of ODBC. It isn't clear whether that only applies to SQL Server, nor have I seen anything that tells us what Word will support/favour in future. Meanwhile, of course, a lot more documents and data are being stored in cloud locations which may not be accessible using some ODBC/OLE DB drivers/providers.
Yes you can do it from C# but it is more work than the way you were hoping to do it. If you could use a custom data source this would probably give you best bang-for-your-buck. If you are comfortable with programing (you mentioned C#) then you would write a small program (hopefully small) to control the mail merge process. I've seen JODReports and Docmosis do mail-merge doc-to-doc in the Java space, but I don't know if they have a C# api. If those tools look useful then you could definitely call them from scripts/command-line which might be your next fastest return on effort.
You also have the option of processing DOCX files yourself since they are zip files of XML you can literally open them up and make modifications. That's a lot more programming work though.
Good luck.

Website -> Excel -> Excel

A little background to my question. I work for a company that is charged with retrieving data from databases from all 50 states and DC. I take this data and reformat it in excel. Once it's reformatted I use SQL Server to upload it to our website vetportal.agdata.net. While some states are not so bad, retrieving information from others make it very painful to sort through.
I have 2 questions:
Can a code be written so that a new database can be crossed checked with the old database (our records) and update the information in the old database while also excluding duplicate information?
Can a code be written to take a number from an open excel sheet, switch over to an open website, input the number, search for the individual, and extract his/her information, and finally update the excel with that information then move on to the next person? Ex, WA State's website is set up so that you can only look up one person at a time which is very tedious when going through 1200+ individuals.
I have some experience with C++ and have written programs that draw code from other files, but mainly only equations or values which then get evaluated in my code so I know this is a bit different.
I guess if you have a repetitive technological problem you can solve it with some programming.
Your questions:
You can make that with a little app, that using SQL reads the information from the new database and checks/updates the information of the old database.
This code is a little more difficult to do, but i guess it can be done. In C++, I don't know if there is any library that can already open Excel files, but in Java you have the Apache POI, that way you can open your excel file in the application, then while iterating through the information you open the website in the application, and submit the form you want with your number, getting the response and parsing it.
If you want to make this in Java I think it will not waste you too much time if you know C++ . The only exception is opening the website in java and parsing it which will take more time to learn and do.
Hope it helps!
1) Yes. Depending on the databases, you may be able to do a db to db connection. You could then write a query using an INNER JOIN to update information in the old database and exclude duplicates.
2) A few ways to approach this problem. Depending on your language (mine is PHP) you could use an open source class such as PHPExcel to open the sheet and fetch & update website data (cURL). You could also write some VBA within Excel that could do similar functionality.

Full-text indexing an archived file

Greetings,
in short, I have to find out whether I can implement a way to index zipped .rtf files via IFilter under Sql Server 2008 Express for fulltext search.
Long version:
this question is mostly a theoretical one - I'm neither experienced nor knowledgeable enough to find out whether such a thing is possible on my own.
The problem is as follows. There's a limited-size Sql Server Express 2008 R2 database thats going to store large .rtf files, probably 2-10k of them, and index them for fulltext search. Now, they probably wont fit into the 10gb limitation, thus I'm wondering if they could be archived (zipped, for instance) and stored that way. Fulltext search should be doable on them, in their zipped state.
My thought was to try to chain ifilters in some way to achieve this (I've no idea if thats doable), or there could be a different solution that I'm not seeing atm; I'd appreciate any input, as I'm kinda at a loss.
You may have a much easier time using something like Lucene. You could extract the text for the files and index it.

Using Excel as front end to Access database (with VBA)

I am building a small application for a friend and they'd like to be able to use Excel as the front end. (the UI will basically be userforms in Excel). They have a bunch of data in Excel that they would like to be able to query but I do not want to use excel as a database as I don't think it is fit for that purpose and am considering using Access. [BTW, I know Access has its shortcomings but there is zero budget available and Access already on friend's PC]
To summarise, I am considering dumping a bunch of data into Access and then using Excel as a front end to query the database and display results in a userform style environment.
Questions:
How easy is it to link to Access from Excel using ADO / DAO? Is it quite limited in terms of functionality or can I get creative?
Do I pay a performance penalty (vs.using forms in Access as the UI)?
Assuming that the database will always be updated using ADO / DAO commands from within Excel VBA, does that mean I can have multiple Excel users using that one single Access database and not run into any concurrency issues etc.?
Any other things I should be aware of?
I have strong Excel VBA skills and think I can overcome Access VBA quite quickly but never really done Excel / Access link before. I could shoehorn the data into Excel and use as a quasi-database but that just seems more pain than it is worth (and not a robust long term solution)
Any advice appreciated.
Alex
I'm sure you'll get a ton of "don't do this" answers, and I must say, there is good reason. This isn't an ideal solution....
That being said, I've gone down this road (and similar ones) before, mostly because the job specified it as a hard requirement and I couldn't talk around it.
Here are a few things to consider with this:
How easy is it to link to Access from Excel using ADO / DAO? Is it quite limited in terms of functionality or can I get creative?
It's fairly straitforward. You're more limited than you would be doing things using other tools, since VBA and Excel forms is a bit more limiting than most full programming languages, but there isn't anything that will be a show stopper. It works - sometimes its a bit ugly, but it does work. In my last company, I often had to do this - and occasionally was pulling data from Access and Oracle via VBA in Excel.
Do I pay a performance penalty (vs.using forms in Access as the UI)?
My experience is that there is definitely a perf. penalty in doing this. I never cared (in my use case, things were small enough that it was reasonable), but going Excel<->Access is a lot slower than just working in Access directly. Part of it depends on what you want to do....
In my case, the thing that seemed to be the absolute slowest (and most painful) was trying to fill in Excel spreadsheets based on Access data. This wasn't fun, and was often very slow. If you have to go down this road, make sure to do everything with Excel hidden/invisible, or the redrawing will absolutely kill you.
Assuming that the database will always be updated using ADO / DAO commands from within Excel VBA, does that mean I can have multiple Excel users using that one single Access database and not run into any concurrency issues etc.?
You're pretty much using Excel as a client - the same way you would use a WinForms application or any other tool. The ADO/DAO clients for Access are pretty good, so you probably won't run into any concurrency issues.
That being said, Access does NOT scale well. This works great if you have 2 or 3 (or even 10) users. If you are going to have 100, you'll probably run into problems. Also, I tended to find that Access needed regular maintenance in order to not have corruption issues. Regular backups of the Access DB are a must. Compacting the access database on a regular basis will help prevent database corruption, in my experience.
Any other things I should be aware of?
You're doing this the hard way. Using Excel to hit Access is going to be a lot more work than just using Access directly.
I'd recommend looking into the Access VBA API - most of it is the same as Excel, so you'll have a small learning curve. The parts that are different just make this easier. You'll also have all of the advantages of Access reporting and Forms, which are much more data-oriented than the ones in Excel. The reporting can be great for things like this, and having the Macros and Reports will make life easier in the long run. If the user's going to be using forms to manage everything, doing the forms in Access will be very, very similar to doing them in Excel, and will look nearly identical, but will make everything faster and smoother.
I do this all the time. If you're using ADO, you're not really using Access, but Jet, the underlying database. That means anybody with Excel can use the app - Access not required. Oh I should mention, the place I work bought a bunch of Office Small Business licenses - no Access. Prior to working here, I would have assumed that anyone who had Excel would also have Access. Not so.
I create one class for every table in Access. I very rarely run queries through ADO, instead I keep that logic in the class modules. I read in with a SELECT statement and write out with and UPDATE or INSERT using the Execute method of the ADODB.Connection object.
See http://www.dailydoseofexcel.com/archives/2008/12/21/vba-framework-ii/
if you want to see how I set up my code.
To answer your questions: It will be a small learning curve for you if you already know Excel VBA, but there will be some learning to do; you will pay a performance penalty over doing it all in Access, but it's not that bad and only you can decide if it's worth it; and you can have multiple people accessing the database.
Just skip the excel part - the excel user forms are just a poor man's version of the way more robust Access forms. Also Access VBA is identical to Excel VBA - you just have to learn Access' object model. With a simple application you won't need to write much VBA anyways because in Access you can wire things together quite easily.
If the end user has Access, it might be easier to develop the whole thing in Access. Access has some WYSIWYG form design tools built-in.
Unless there is a strong advantage to running your user form in Excel then I would go with a 100% Access solution that would export the reports and data to Excel on an ad-hoc basis.
From what you describe, Access seems the stronger contender as it is built for working with data:
you would have a lot more tools at your disposal to solve any data problems than have to go around the limitations of Excel and shoehorn it into becoming Access...
As for your questions:
Very easy. There have been some other questions on SO on that subject.
See for instance this one and that one.
Don't know, but I would guess that there could be a small penalty.
The biggest difficulty I see is trying to get all the functionalities that Access gives you and re-creating some of these in Excel.
Yes, you can have multiple Excel users and a single Access database.
Here again, using Access as a front-end and keeping the data in a linked Access database on your network would make more sense and it's easy as pie, there's even a wizard in Access to help you do that: it's just 1 click away.
Really, as most other people have said, take a tiny bit of time to get acquainted with Access, it will save you a lot of time and trouble.
You may know Excel better but if you've gone 80% of the way already if you know VBA and are familiar with the Office object model.
Other advantages of doing it in Access: the Access 2007 runtime is free, meaning that if you were to deploy to app to 1 or 30 PC it would cost you the same: nothing.
You only need one full version of Access for your development work (the Runtime doesn't have the designers).
It really depends on the application. For a normal project, I would recommend using only Access, but sometimes, the needs are specific and an Excel spreadsheet might be more appropriate.
For instance, in a project I had to develop for a former employer, the need was to give access to different persons on forms(pre-filled with some data, different for each person) and have them complete them, then re-import the data.
Since the form was using heavy number crunching, it made more sense to build it in Excel.
The Excel workbooks for the different persons were built from a template using VBA, then saved in a proper location, with the access rights on the folder.
All workbooks were attached as External tables to the workbooks, using named ranges. I could then query the workbooks from the Access Application. All administrative stuff was made from the db, but the end users only had access to their respective workbook.
Developping an Excel/Access application this way was a pleasant experience and the UI was more user-friendly than it would have been using Access.
I have to say that in this case, it would have taken a lot more time doing it in Access than it took using Excel. Also, the Application Object Model seems better though in Excel than in Access.
If you plan to use Excel as a front-end, do not forget to lock all the cells, but the editable ones and don't be affraid to use masked rows and columnns (to construct output tables for the access database, to perform intermediate calculations, etc).
You should also turn off autocalculation while importing data.
It's quite easy and efficient to use Excel as a reporting tool for Access data.
A quick "non programming" approach is to set a List or a Pivot Table, linked to your External Data source. But that's out of scope for Stackoverflow.
A programmatic approach can be very simple:
strProv = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & SourceFile & ";"
Set cnn = New ADODB.Connection
cnn.Open strProv
Set rst = New ADODB.Recordset
rst.Open strSql, cnn
myDestRange.CopyFromRecordset rst
That's it !
Given the ease of use of Access, I don't see a compelling reason to use Excel at all other than to export data for number crunching. Access is designed to easily build data forms and, in my opinion, will be orders of magnitude easier and less time-consuming than using Excel. A few hours to learn the Access object model will pay for itself many times over in terms of time and effort.
I did it in one project of mine. I used MDB to store the data about bills and used Excel to render them, giving the user the possibility to adapt it.
In this case the best solution is:
Not to use any ADO/DAO in Excel. I implemented everything as public functions in MDB modules and called them directly from Excel. You can return even complex data objects, like arrays of strings etc by calling MDB functions with necessary arguments. This is similar to client/server architecture of modern web applications: you web application just does the rendering and user interaction, database and middle tier is then on the server side.
Use Excel forms for user interaction and for data visualisation.
I usually have a very last sheet with some names regions for settings: the path to MDB files, some settings (current user, password if needed etc.) -- so you can easily adapt your Excel implementation to different location of you "back-end" data.
To connect Excel to Access using VBA is very useful I use it in my profession everyday. The connection string I use is according to the program found in the link below. The program can be automated to do multiple connections or tasks in on shot but the basic connection code looks the same. Good luck!
http://vbaexcel.eu/vba-macro-code/database-connection-retrieve-data-from-database-querying-data-into-excel-using-vba-dao
It Depends how much functionality you are expecting by Excel<->Acess solution. In many cases where you don't have budget to get a complete application solution, these little utilities does work. If the Scope of project is limited then I would go for this solution, because excel does give you flexibility to design spreadsheets as in accordance to your needs and then you may use those predesigned sheets for users to use. Designing a spreadsheet like form in Access is more time consuming and difficult and does requires some ActiveX. It object might not only handling data but presenting in spreadsheet like formates then this solution should works with limited scope.
You could try something like XLLoop. This lets you implement excel functions (UDFs) on an external server (server implementations in many different languages are provided).
For example you could use a MySQL database and Apache web server and then write the functions in PHP to serve up the data to your users.
BTW, I work on the project so let me know if you have any questions.

Resources