How can the FrozenLake OpenAI-Gym environment be solved with no intermediate rewards? - openai-gym

I'm looking at the FrozenLake environments in openai-gym. In both of them, there are no rewards, not even negative rewards, until the agent reaches the goal. Even if the agent falls through the ice, there is no negative reward -- although the episode ends. Without rewards, there is nothing to learn! Each episode starts from scratch with no benefit from previous episodes.
This should be a simple breadth-first search. It doesn't need RL. But assuming you use RL, one approach would be a reward of -1 for a step to a frozen square (that isn't the goal) and a reward of -10 for a step into a hole. The -1 would allow the agent to learn not to repeat squares. The -10 would allow the agent to learn to avoid the holes. So I'm tempted to create my own negative rewards on the agent side. This would make it more like the cliffwalker.
What am I missing? How would RL solve this (except via random search) with no rewards?

The problem you are describing is often answered with Reward Shaping.
Like the frozen lake environment or Montazuma's Revenge, some problems have very sparse rewards. This means that any RL agent must spend a long time to explore the environment to see these rewards. This can be very frustrating for the humans who designed the task for the agent. So, like in the frozen lake environment, people often add extra information like you have suggested. This makes the reward function more dense and (sometimes) allows for faster learning (if the modified reward function actually follows what the human wants the agent to do).
In order for the agent to solve these kinds of problems faster than random search and without human intervention, such as reward shaping or giving the agent a video of expert playing the game, the agent needs some mechanism to explore the space in an intelligent way[citation needed].
Some current research areas on this topic are Intrinsic Motivation, Curiosity, and Options and Option discovery.
Although promising, these research areas are still in their infancy and sometimes its just easier to say:
if agent_is_in_a_hole:
return -10

I think the objective of this environment is to discover ways to balance exploration vs. exploitation. I think the reward manipulation is neither required or desirable.
Now, if you try to run this in q-learning for the 8x8 environment, you may find that it does not converge.
The fix for this was given by JKCooper on openAI forum. You can check out this page and scroll all the way to the bottom to see the comment, https://gym.openai.com/evaluations/eval_xSOlwrBsQDqUW7y6lJOevQ
In there, he introduces a concept of average terminal reward. This reward is then used to calibrate/tune the exploration.
At the beginning, the average terminal reward is undefined or null. On the very first "done" iteration, this variable is updated with the value of that reward.
On each subsequent iteration, if the current reward is greater than the existing value of the average terminal reward, then the epsilon value is "decayed", i.e. exploration is discouraged and exploitation is encouraged gradually.
Using this technique, you can see that qlearning converges.
the modified version on openAI is here: v0.0.2
https://gym.openai.com/evaluations/eval_FVrk7LAVS3zNHzzvissRQ/

Related

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Testing for heteroskedasticity and autocorrelation in large unbalanced panel data

I want to test for heteroskedasticity and autocorrelation in a large unbalanced panel dataset.
I do so using the following code:
* Heteroskedasticity test
// iterated GLS with only heteroskedasticity produces
// maximum-likelihood parameter estimates
xtgls adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT, igls panels(heteroskedastic)
estimates store hetero
* Autocorrelation
findit xtserial
net sj 3-2 st0039
net install st0039
xtserial adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT
Though I use the calculation power of high process center, because of the iteration method, this procedure takes more than 15 hours.
What is the most efficient program to perform these tests using Stata?
This question is borderline off-topic and quite broad, but i suspect still of
considerable interest to new users. As such, here i will try to consolidate our
conversation in the comments as an answer.
I strongly advise in the future to refrain from using highly subjective
words such as 'best', which can mean different things to different people. Or
terms like 'efficient', which can have a different meaning in a different context.
It is also difficult to provide specific advice regarding the use of commands
when we know nothing about what you are trying to do.
In my view, the 'best' choice, is the choice that gets the job done as accurately
as possible given the available data. Speed is an important consideration nowadays, but accuracy is still the most fundamental one. As you continue to use Stata, you will see that it has a considerable number of commands, often with overlapping functionality. Depending on the use case, sometimes opting for one implementation over another can be 'better', in the sense that it may be more practical or faster in achieving the desired end result.
Case in point, your comment in your previous post where the noconstant option is unavailable in rreg. In that particular context you can get a reasonably good alternative using regress with the vce(robust) option. In fact, this alternative may often be adequate for several use cases.
In this particular example, xtgls will be considerably faster if the igls
option is not used. This will be especially true with larger and more 'difficult' datasets. In cases where MLE is necessary, the iterate option will allow you to specify a fixed number of iterations, which could speed things up but can be a recipe for disaster if you don't know what you are doing and is thus not recommended. This option is usually used for other purposes. However, is xtgls the only command you could use? Read here why this may in fact not necessarily be the case.
Regarding speed, Stata in general is slow, at least when the ado language is used. This is because it is an interpreted language. The only realistic option for speed gains here is through parallelisation if you have Stata MP. Even in this case, whether any gains are achieved it will depend on a number of factors,
including which command you use.
Finally, xtserial is a community-contributed command, something which you
fail to make clear in your question. It is customary and useful to provide this
information right from the start, so others know that you do not refer to an
official, built-in command.

If I interrupt sklearn grid_search.fit() before completion can I access the current .best_score_, .best_params_?

If I interrupt grid_search.fit() before completion will I loose everything it's done so far?
I got a little carried away with my grid search and provided an obscenely large search space. I can see scores that I'm happy with already but my stdout doesn't display which params led to those scores..
I've searched the docs: http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
And there is a discussion from a couple years ago about adding a feature for parrallel search here: https://sourceforge.net/p/scikit-learn/mailman/message/31036457/
But nothing definitive. My search has been working for ~48hrs, so I don't want to loose what's been discovered, but I also don't want to continue.
Thanks!
welcome to SO!
To my understanding there isn't any intermediate variables that get returned off the grid_search function, only the resulting grid and their scores (see here for more information grid search.py).
So if you cancel it you might lose the work that's been done so far.
But a bit of advice, 48 hours is a long time (obviously this depends on the rows, columns and number of hyper parameters being tuned). You might want to start with a more broad grid search first and then refine your parameter search off that.
That will benefit you two ways:
Run time might end up being much shorter (see caveats above) meaning you don't have to wait so long and risk losing results
You might find that your model prediction score is only impacted by one or two hyper parameters, letting you keep the other searches more broad and focussing your efforts on the parameters that influence your prediction accuracy most.
Hopefully by the time I've written this response your grid search has completed!!

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

How do you measure if an interface change improved or reduced usability?

For an ecommerce website how do you measure if a change to your site actually improved usability? What kind of measurements should you gather and how would you set up a framework for making this testing part of development?
Multivariate testing and reporting is a great way to actually measure these kind of things.
It allows you to test what combination of page elements has the greatest conversion rate, providing continual improvement on your site design and usability.
Google Web Optimiser has support for this.
Similar methods that you used to identify the usability problems to begin with-- usability testing. Typically you identify your use-cases and then have a lab study evaluating how users go about accomplishing certain goals. Lab testing is typically good with 8-10 people.
The more information methodology we have adopted to understand our users is to have anonymous data collection (you may need user permission, make your privacy policys clear, etc.) This is simply evaluating what buttons/navigation menus users click on, how users delete something (i.e. changing quantity - are more users entering 0 and updating quantity or hitting X)? This is a bit more complex to setup; you have to develop an infrastructure to hold this data (which is actually just counters, i.e. "Times clicked x: 138838383, Times entered 0: 390393") and allow data points to be created as needed to plug into the design.
To push the measurement of an improvement of a UI change up the stream from end-user (where the data gathering could take a while) to design or implementation, some simple heuristics can be used:
Is the number of actions it takes to perform a scenario less? (If yes, then it has improved). Measurement: # of steps reduced / added.
Does the change reduce the number of kinds of input devices to use (even if # of steps is the same)? By this, I mean if you take something that relied on both the mouse and keyboard and changed it to rely only on the mouse or only on the keyboard, then you have improved useability. Measurement: Change in # of devices used.
Does the change make different parts of the website consistent? E.g. If one part of the e-Commerce site loses changes made while you are not logged on and another part does not, this is inconsistent. Changing it so that they have the same behavior improves usability (preferably to the more fault tolerant please!). Measurement: Make a graph (flow chart really) mapping the ways a particular action could be done. Improvement is a reduction in the # of edges on the graph.
And so on... find some general UI tips, figure out some metrics like the above, and you can approximate usability improvement.
Once you have these design approximations of user improvement, and then gather longer term data, you can see if there is any predictive ability for the design-level usability improvements to the end-user reaction (like: Over the last 10 projects, we've seen an average of 1% quicker scenarios for each action removed, with a range of 0.25% and standard dev of 0.32%).
The first way can be fully subjective or partly quantified: user complaints and positive feedbacks. The problem with this is that you may have some strong biases when it comes to filter those feedbacks, so you better make as quantitative as possible. Having some ticketing system to file every report from the users and gathering statistics about each version of the interface might be useful. Just get your statistics right.
The second way is to measure the difference in a questionnaire taken about the interface by end-users. Answers to each question should be a set of discrete values and then again you can gather statistics for each version of the interface.
The latter way may be much harder to setup (designing a questionnaire and possibly the controlled environment for it as well as the guidelines to interpret the results is a craft by itself) but the former makes it unpleasantly easy to mess up with the measurements. For example, you have to consider the fact that the number of tickets you get for each version is dependent on the time it is used, and that all time ranges are not equal (e.g. a whole class of critical issues may never be discovered before the third or fourth week of usage, or users might tend not to file tickets the first days of use, even if they find issues, etc.).
Torial stole my answer. Although if there is a measure of how long it takes to do a certain task. If the time is reduced and the task is still completed, then that's a good thing.
Also, if there is a way to record the number of cancels, then that would work too.

Resources