Creating text summary using NLP - python-3.x

I am in middle of applying NLP to the set of comments that I have received from my data. These comments are stored in one column. I have cleaned them altogether stored them in a list. There are no stop words, special characters etc. Now I want to create a summary from this text. What could be the best method to do that? I have already failed myself with heapq, so I dont want any solution around that.
My clean text is stored in list named : clean_text_summary and it looks like this :
clean_text_summary = ['You are so bad - I hate your product','I am going to deregister','You are frauds'....]
I need to get most common things that people have talked about as a summary.

Related

Find and save a Specific String Until a semicolon

I have a large dataset (~25GB) and I would like to retrieve the data following 8 specific modifiers. For example, if I have the "AC_afr" tag I searched for, I would also like to keep its data "AC_afr=8855525;". I need a way to search for a tag, and then keep everything after that tag until the semicolon.
I would normally open it up and Excel, but the data is much too large.
I have looked online for grep options, but could not find a solution.
Example of the data:
AC_afr=0;AN_afr=8250;non_neuro_AN_eas_kor=2404;non_neuro_AF_eas_kor=0.00000e+00;non_neuro_nhomalt_eas_kor=0;non_cancer_AF_nfe_seu=0.00000e+00;non_cancer_nhomalt_nfe_seu=0;

NLP Challenge: Automatically removing bibliography/references?

I recently came across following problem: When applying a topic model on a bunch of parsed PDF files, I discovered that content of the references unfortunately also counts for the model. I.e. words within the references appear in the tokenized list of words.
Is there any known "best-practice" to solve this problem?
I thought about a search strategy where the python code automatically removes all content after the last mention of "references" or "bibliography". If I would go by the first, or a random mention of "references" or "bibliography" within the full text, the parser might not capture the true full content.
The input PDF are all from different journals and thus have a different page structure.
The syntax is what makes a bibliography entry distinct from a regular sentence.
Test for the pattern that coincides with whatever (or multiple) reference styles you are trying to remove.
Aka date, unquoted string, string, page numbers in a certain format.
I'd spend some time searching for a tool that already recognizes bibliography before doing this, as it will be unique to each style (MLA etc.)
Couple of additional features to consider for detecting the start of reference setion
Check if the mention of "references" or "bibliography" is in the last pages as opposed to earlier pages
Run entity recognition on some length of words (~50?) after the word and if a high number of tokens in the 50 are entities, that indicates journal names, author names, etc.

Search Algorithm for a web application that needs to look for a specific value

I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.
This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.

Saving to multiple lists from 1 sharepoint 2007 list form

I have a request form I'm working on, wherein different departemnts need to be able to update it. To minimize overlap and lost changes I'd like to be able to submit data from the new form to different lists, but I cannot find a way to do this.
Does anyone have any experience trying to do anything similar?
If you're familiar with JQuery andSPServices I could envisage a way to do this.
In the EditForm.aspx, add the JQuery and SPServices libraries. using the $.(document).Ready function, I'd do a quick item update with the SPServices and just copy a column with the same data, so in effect no change looks to have taken place. I'd add in the edit comments something like "Pseduo checkout to [name], [date_time]".
Then allow the user to edit the form as normal but in the code you've added, you trap the PreSave Action and check that the person trying to do the save is the same as the last modified - if it is, save as normal, otherwise, return false on the PreSave and it will be denied. When you actually allow the save, set the edit comments to something sensible.
To complete this, check before doing the pseudo checkout, that the last comments don't contain the psuedo checkout phrase so that you can prevent anyone opening/editing the form whilst somebody else is in the middle of an edit.
This gives a cheap and relatievly easy to implement Check-In/Check-Out for a list. Not perfect of course but should work well in most scenarios (not in datasheet though, so you might need to prevent that type of edit).
If you have two lists would you not then have the problem of potentially two requests for the same thing?
Does none of the version control options for the list solve the problem of potentially multiple concurrent editors?
While SPService is certainly a solution, but you will have to build a UI of ur own.
Try writing a event receiver, which can copy over item to another list as soon as it is created.
It will be nice if you can tell why you really want to have a copy of item in another list
i.e. Auditing purpose etc. , you can get a perfect solution for this in Forum

Moving data from Word to Access seamlessly

I am trying to migrate structured documents (i.e. documents that are mostly some metadata and one big table) to a database. When I try to move tabular data from Word to Excel, my main point of pain is handling CRLFs within a cell in Word. Any solution for this?
Now, since I will be transferring from Word to Access:
What will be the default behaviour when I attempt to populate a field with a string that contains a CRLF?
What is the cheapest way to get Access to respect "rich text"? (mostly boldface and overstrike)
Tnx
It should just enter the two characters as any other two characters.
HTML is a pretty good solution.
For a more detailed answer, we should probably know how you are doing this "migration".

Resources