I am trying to migrate structured documents (i.e. documents that are mostly some metadata and one big table) to a database. When I try to move tabular data from Word to Excel, my main point of pain is handling CRLFs within a cell in Word. Any solution for this?
Now, since I will be transferring from Word to Access:
What will be the default behaviour when I attempt to populate a field with a string that contains a CRLF?
What is the cheapest way to get Access to respect "rich text"? (mostly boldface and overstrike)
Tnx
It should just enter the two characters as any other two characters.
HTML is a pretty good solution.
For a more detailed answer, we should probably know how you are doing this "migration".
Related
I am in middle of applying NLP to the set of comments that I have received from my data. These comments are stored in one column. I have cleaned them altogether stored them in a list. There are no stop words, special characters etc. Now I want to create a summary from this text. What could be the best method to do that? I have already failed myself with heapq, so I dont want any solution around that.
My clean text is stored in list named : clean_text_summary and it looks like this :
clean_text_summary = ['You are so bad - I hate your product','I am going to deregister','You are frauds'....]
I need to get most common things that people have talked about as a summary.
I recently came across following problem: When applying a topic model on a bunch of parsed PDF files, I discovered that content of the references unfortunately also counts for the model. I.e. words within the references appear in the tokenized list of words.
Is there any known "best-practice" to solve this problem?
I thought about a search strategy where the python code automatically removes all content after the last mention of "references" or "bibliography". If I would go by the first, or a random mention of "references" or "bibliography" within the full text, the parser might not capture the true full content.
The input PDF are all from different journals and thus have a different page structure.
The syntax is what makes a bibliography entry distinct from a regular sentence.
Test for the pattern that coincides with whatever (or multiple) reference styles you are trying to remove.
Aka date, unquoted string, string, page numbers in a certain format.
I'd spend some time searching for a tool that already recognizes bibliography before doing this, as it will be unique to each style (MLA etc.)
Couple of additional features to consider for detecting the start of reference setion
Check if the mention of "references" or "bibliography" is in the last pages as opposed to earlier pages
Run entity recognition on some length of words (~50?) after the word and if a high number of tokens in the 50 are entities, that indicates journal names, author names, etc.
I know that it is possible to generate a word document from a list, but what if I need data in my word document from two lists?
I can create a linked data source which combines the two lists, but I have not seen any example on using the linked data source to create a word report. The list relationship is one-to-many, so the first list has header information, while the other has item information.
Can someone point me in the right direction for this? I'd like to do this without any libraries such as OpenXML etc.
Thanks,
gixen
I am a little bit confused regarding "I'd like to do this without any libraries such as OpenXML etc."
If you really need to create solution without any library, then, maybe, WordML could help you. Or you could deal with docx files content manualy (it is bunch of zipped files). However this really would be time consuming task. If you need to create document I would insist on using OpenXML SDK.
If "without any library" means you would like to avoid writing code you could try our product.
How about creating it in MS Excel?
You can have a VLOOKUP to get the one-to-many relationship addressed:
LIST ONE:
dog
cat
COMBINED LISTS ONE AND TWO:
dog fido
dog king
cat mittens
I have a document library set up with multiple different categories of document, and I'm using a metadata column to differentiate between them.
I want to be able to display two different document library web part on a page for different categories of file side by side. This is simple for one category, I just set up a list view filtered by the metadata column, but when I add a second web part alongside the first, it breaks the first one.
I have no idea why this is happening, but it seems like SharePoint isn't happy with pulling two sets of data from the same document library.
When I am editing the web parts, I can get them to both display the documents I want, but then when I click save, the first web part empties.
Not sure what other information would be useful for diagnosing or helping with the problem, so if I haven't given enough detail let me know. I am familiar with SPD as well as developing through the web interface, so if this needs a more complex solution that's fine with me!
Having spent some more time playing around with this, it struck me that I could probably achieve what I wanted using something other than a Document web part, and I was right.
Instead of using the somewhat inflexible document web part, I created a content query web part which only searched within the document library from my site, and filtered by the metadata column.
This way I can create as many queries as I like and they don't interact with each other in weird ways. It also has the advantage of being significantly easier to customise the output without needing to resort to SharePoint Designer.
Content Queries are the answer!
In SharePoint Designer I use some lists as sources and then link them together with an operation GetListItems (I fetch items from multiple lists on different site collections for rollup/aggregation):
alt text http://img151.imageshack.us/img151/1807/ss20090428101310.png
Now something is fine as I managed to get the result: alt text http://img410.imageshack.us/img410/4835/ss20090428101013.png
But the strings that are attached to field result (6;#, 2;#) is... disturbing.
How can I get rid from those attached strings? They are not attached to all fields, but to some (important ones):
alt text http://img168.imageshack.us/img168/1647/ss20090428100732.png
Ahh, well usally that happens - you keep searching for answer, then seek for help and find it yourself.
I used substring xsl function, to strip away those first characters. Messy, if i want to add links to that table, but works.
alt text http://img2.imageshack.us/img2/3117/ss20090428102714.png
By the way, the main question how to rollup content from multiple site collections has been journey to me for several days already. If anyone is in the same situation, I recommend (well because I found myself an answer there) these:
How-To Rollup two lists in two site
collections on a page
Or a better way to use for a single
site collection: SharePoint
Customisation Tricks: Use The
SPDataSource, Luke! (Good links
inside that article).
Something I didn't touch, because I
didn't need such an advanced method,
but maybe someone does: Populating
data sources in code