Addressing Reliable Output in Newspaper3k - python-3.x

Current Behavior:
In attempting to use the News-aggregator package Newspaper3k , I am unable to produce consistent/reliable output.
System/Environment Setup:
Windows 10
Miniconda3 4.5.12
Python 3.7.1
Newspaper3k 0.2.8
Steps (Code) to Reproduce:
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
print(cnn_paper.size())
Expected Behavior/Output (varies based on current links posted on cnn):
Produce consistent number of posted links on cnn on consecutive Print output runs.
Actual Behavior/Output
Running the code the first time produces a different number of links than code run immediately after.
1st Run Print output: 94 (as of time of posting this question)
2nd Run Print output: 0
3rd Run Print output: 18
4th Run Print output: 7
Printing the actual links will vary the same way as the above link count print. I have tried using a number of different news sources, and the same unexpected variance results. Do I need to change my User-Agent Header? Is this a detection issue? How do I produce reliable results?
Any help would be much appreciated.
Thanks.

My issue was resolved by better understanding of the default caching found under the heading 6.1.3 Article caching in the user documentation .
Apart from my general ignorance, my confusion came from the fact that the read the docs 'Documentation' listed the caching function as a TODO as can be seen here
Upon better scrutiny, I discovered:
By default, newspaper caches all previously extracted articles
andeliminates any article which it has already ex-tracted.This feature
exists to prevent duplicate articles and to increase extraction speed.
The return value of cbs_paper.size()changes from 1030 to 2 because
when we first crawled cbs we found 1030 articles. However, on our
second crawl, we eliminate all articles which have already been
crawled. This means 2 new articles have been published since our first
extraction.
You may opt out of this feature with the
memoize_articlesparameter.
You may also pass in the lower
level ‘‘Config‘‘ objects as covered in the advanced section.
>>>import newspaper
>>>cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>>cbs_paper.size()1030

Related

Reading a grib2 message into an Iris cube

I am currently exploring the notion of using iris in a project to read forecast grib2 files using python.
My aim is to load/convert a grib message into an iris cube based on a grib message key having a specific value.
I have experimented with iris-grib, which uses gribapi. Using iris-grib I have not been to find the key in the grib2 file, althrough the key is visible with 'grib_ls -w...' via the cli.
gribapi does the job, but I am not sure how to interface it with iris (which is what, I assume, iris-grib is for).
I was wondering if anyone knew of a way to get a message into an iris cube based on a grib message key having a specific value. Thank you
You can get at anything that the gribapi understands through the low-level grib interface in iris-grib, which is the iris_grib.GribMessage class.
Typically you would use for msg in GribMessage.messages_from_filename(xxx): and then access it like e.g. msg.sections[4]['productDefinitionTemplateNumber']; msg.sections[4]['parameterNumber'] and so on.
You can use this to identify required messages, and then convert to cubes with iris_grib.load_pairs_from_fields().
However, Iris-grib only knows how to translate specific encodings into cubes : it is quite strict about exactly what it recognises, and will fail on anything else. So if your data uses any unrecognised templates or data encodings it will definitely fail to load.
I'm just anticipating that you may have something unusual here, so that might be an issue?
You can possibly check your expected message contents against the translation code at iris_grib:_load_convert.py, starting at the convert() routine.
To get an Iris cube out of something not yet supported, you would either :
(a) extend the translation rules (i.e. a Github PR), or
(b) sometimes you can modify the message so that it looks like something
that can be recognised.
Failing that, you can
(c) simply build an Iris cube yourself from the data found in your GribMessage : That can be a little simpler than using 'gribapi' directly (possibly not, depending on detail).
If you have a problem like that, you should definitely raise it as an issue on the github project (iris-grib issues) + we will try to help.
P.S. as you have registered a Python3 interest, you may want to be aware that the newer "ecCodes" replacement for gribapi should shortly be available, making Python3 support for grib data possible at last.
However, the Python3 version is still in beta and we are presently experiencing some problems with it, now raised with ECMWF, so it is still almost-but-not-quite achievable.

If I interrupt sklearn grid_search.fit() before completion can I access the current .best_score_, .best_params_?

If I interrupt grid_search.fit() before completion will I loose everything it's done so far?
I got a little carried away with my grid search and provided an obscenely large search space. I can see scores that I'm happy with already but my stdout doesn't display which params led to those scores..
I've searched the docs: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
And there is a discussion from a couple years ago about adding a feature for parrallel search here: https://sourceforge.net/p/scikit-learn/mailman/message/31036457/
But nothing definitive. My search has been working for ~48hrs, so I don't want to loose what's been discovered, but I also don't want to continue.
Thanks!
welcome to SO!
To my understanding there isn't any intermediate variables that get returned off the grid_search function, only the resulting grid and their scores (see here for more information grid search.py).
So if you cancel it you might lose the work that's been done so far.
But a bit of advice, 48 hours is a long time (obviously this depends on the rows, columns and number of hyper parameters being tuned). You might want to start with a more broad grid search first and then refine your parameter search off that.
That will benefit you two ways:
Run time might end up being much shorter (see caveats above) meaning you don't have to wait so long and risk losing results
You might find that your model prediction score is only impacted by one or two hyper parameters, letting you keep the other searches more broad and focussing your efforts on the parameters that influence your prediction accuracy most.
Hopefully by the time I've written this response your grid search has completed!!

Node Js: Not getting the details using dataSources with datasets

I tried to get the step count by date wise. When I took the data from google fit using
API:
https://www.googleapis.com/fitness/v1/users/me/dataSources/derived:com.google.step_count.delta:com.google.android.gms:estimated_steps/datasets/1457548200000000000-1457631000000000000&token=1111111111
I can get only limited step count but not all the steps on that date. Why this kind of problem's are occurs to get the google fit data.
Can any one suggest me the better way to get all the data from google fit.
Using derived:com.google.step_count.delta:com.google.android.gms:estimated_steps datasource
will give you varying results depending on the scenario. The cause of this is mainly from the sensors used. Maybe this is the reason why you think that you have limited results.
estimated_steps also takes into account activity, and estimates steps
when there are none. For instance, assume the user walked for 30
minutes, but the hardware step counter only recorded 10 steps. We
know that number is inaccurate so instead we estimate, say 3000 steps
during that time.
This was noted and discussed in this SO post.

TFS Query results with a list of linked work item IDs in Excel?

How can I get a list of linked work item IDs for a set of work items?
Excel-hosted queries preferred. API Sample is acceptable.
Direct DB table query is acceptable (read-only and unsupported of course!)
Many thanks in advance! -Zephan
MORE INFORMATION
UPDATE: No answers for my original Q so broadening scope of acceptable answers as follows:
Answer for TFS2015 (migrating very shortly) or TFS2013 (potentially useful for TFS2015) is preferred over TFS2010
Coding acceptable if there are any APIs or PowerShell cmdlets (MS or community).
Connecting directly (read-only!) to TFS DB tables is acceptable (source tables and related relationship link table names). Yes, directly referencing TFS DB tables is VERY unsupported, read-only, and "AT YOUR OWN RISK." Still beats having to manually copy/paste data or reconstruct list of links in Excel.
ORIGINAL QUESTION & DETAILS
My team uses TFS2010 (soon 2013 or hoping 2015) and VS2010-2015. I need to support traceability reports and analyze/quantify our coverage of ~300 Test Case work items linked to ~400 Requirement work items. Direct Link and Tree queries are close but don't give me related links on the same row as parent work item. Many thanks in advance for your suggestions and any related code fragments.
Example:
3 test cases (Test1, Test2, Test3)
4 Requirements (Req1, Req2, Req3, Req4)
For simplicity let's just use TFS work item IDs to represent each TestN and ReqN. In actuality, I have a keyword to identify my validation requirements (separate from the 1,000's of other requirements in this Team Project). The only Test Case WI I care about for this problem are those linked to one or more Validation Requirement trace-ability.
Scenarios:
1:1 (simple) Test1 is linked to Req1
1:2 (1:n) Test2 is linked to Req2 and Req3
2:1 (n:1) Test3 (and Test2) are both linked to Req3
0:1 (Requirement missing Test coverage) Req4 has no test case links
I have a good coverage gap query by creating a Direct Link query for all Requirements then set "linking filters" to Only return items that do not have the specified links.
Desired output (all tests with list of related work items):
|Test1 | Req1 |
|Test2 | Req2, Req3 |
|Test3 | Req3 |
For row #2 I am OK with other separators or even entire list using same separator (.CSV or TAB delimited).
Skip right to answer now if you have a tidy answer. If not then I added considerable RELATED RESEARCH info below to help kick-start an idea that fits the need! Especially since this hasn't been discoverably solved in the last 5 years :-).
RELATED RESEARCH (loooong but may be useful)
1. Visual Studio Queries
Flat Queries should support a list of linked items out-of-the-box... but it does not. RelatedLinkCount field is handy for knowing if there are any links to chase, but that's it for flat queries. 
Direct Link queries give a list of all direct links, but the related IDs are on rows below the parent work item. I am seriously considering creating a formula to look on the next X rows to build a list of IDs, but this would be fragile especially when over 3 requirements are linked to same test. Still might solve 80% of my tracing needs.
Tree Queries also show links, but on different rows. Additionally they tend to follow just one link type. Ideally I will need list of User Requirements linked to Functional Requirements linked to Test Case(s).
2. Tools / Plug-ins
SmartExcel4TFS (eDEVTech, http://www.modernrequirements.com/smartexcel4tfs/) has 3 reports it supports, but none get me the core data I need in easily used format. At least it is FREE if you have an MSDN Premium subscription.
Requirements to Tests Trace Matrix is super-interesting. Alass, right now I need to go the other way (Requirements linked to a given test case). Also it merges cells and has sub-sections that are hard to manipulate I think. (I may revisit this option though.)
Intersection Traceability Matrix report is WAY too wide for a full 300 x 400 grid :-O.
Work Item Decomposition Matrix also didn't give me desired contents. (though frankly I've forgotten this report layout from when I checked ~1 month ago.)
3. TFS API calls
I have actually avoided this route in favor of native Excel solution... but if I can get an example of Excel VBA code (or other code with link to calling within Excel) I may go this route. At this point I don't have time to dig into rolling my own... but this would be cool assuming performance is acceptable.
Relevant API/code fragments:
Retrieving TFS Results from a Tree Query (Blogs.msdn.com 2012.02.22) - Looks like this would get me the data I need, but it is not in Excel so I'd need a bridge example of some sort calling this within Excel.
Retrieving work items and their linked work items in a single query using the TFS APIs (stackoverflow.com 2012.01.12) - Also looks very promising, but not connected to Excel. Gives hints for 2 level and 3 level nested links and performance consideration (don't make second call for each item returned!)
Retrieving work items using the Team Foundation Server API (pwee167.github.io 2012.09.18) - Excellently written introductory walkthrough blog posting to learn how to build an (ASP.Net MVC3) app that calls TFS APIs to run Flat or Tree queries. Start here if writing C# (which I could do but don't have time/justification unless easy example to integrate with Excel).
How can I query work items and their linked changesets in TFS? (stackoverflow.com 2011.05.10) - I don't need changesets but this has VB code to instantiate new TfsTeamProjectCollection which might work directly in Excel VBA (assuming proper reference is found and added)
var projectCollection = new TfsTeamProjectCollection(
new Uri("http://localhost:8080/tfs"),
new UICredentialsProvider());
OK, that's everything I have gathered on this problem. Please help contribute with the missing magic tool/snippet or follow the info above to build that last bit I have not had time to prototype & debug. Many thanks in advance!! -Zephan

is there a mechanism for capturing and comparing mvc-mini-profiler results?

The mvc-mini-profiler is a handy tool. ServiceStack has a forked version for use in services. I was thinking it would be dandy to capture the outputs of runs before and after a code change and compare the results.
I figure the steps are:
log the results to a data store or file instead of returning them in the result
compare the output of the various nodes
show the results side-by-side with diffs highlighted
bonus: configure tolerances for diff in amount of time spent in different areas. i.e. i may not care if time in sql varies by 300ms.
I did a quick search and didn't see anything.
Thanks,
Drew

Resources