Why do we need padding in seq2seq network

Why do we need padding in seq2seq network - pytorch

To handle sequences of different, I would like to know.
Why do we need padding the sequence the word to the same length?
If the answer is "Yes, you need padding.". Can I set the padding in other indexes? For example, if I have an index word like this:
{0:"<s>,1:"<e>",2:"AAA",3:"BBB",.......,500:"zzz"}
Where <s> is starting word of the sentence and is the ending word of the sentence.
Can I set the padding flag to the last index?
{0:"<s>,1:"<e>",2:"AAA",3:"BBB",.......,500:"zzz",501:"<pad>"}

Why do we need padding the sequence the word to the same length?
Because basically all layers with parameters perform some way of matrix multiplication (actually: tensor multiplication) at some point in their logic. Now, try it yourself. Multiply matrices where not all rows or columns have the same length. E.g. what is this supposed to be?
| 1 2 3 | | 1 |
| 4 5 | * | 2 | = ???
| 3 |
It is simply not possible to do this, unless you put some value in the gap. Some people may even argue that this thing on the left hand side is not even a matrix.
Can I set the padding in other indexes? Can I set the padding flag to the last index?
Sure. You can take whatever value you want for padding. Ideally, you should use a value that has otherwise no other meaning in the context of you problem and thus cannot be confused with any "real" value.

Related

Is there something like the SEQUENCE function in Excel in Libreoffice Calc?

In Excel if I have two columns and I want the first one to be an ID, e.g. and the second column to be a string. I can simply copy the sequence down. I can do the same thing in Libreoffice no problem. (like example below)
ID-1 | String 1
ID-2 | String 2
ID-3 |
But Excel will also allow you to use the Sequence function to only populate the sequence if another cell has a value. This is done like this:
-SEQUENCE(COUNTA(B0:B3))
Then I would get something like below, where ID-3 sequence isn't filled in because the COUNTA function returns false.
ID-1 | String 1
ID-2 | String 2
|
I cannot find the sequence function in Calc and I wondered if it exists under another name, or if I need to do some more complex thing with IF statements?

I did it with the following.
=IF(ISBLANK(B1),"",CONCAT("ID-",ROW(A1)))
Although this is suboptimal compared to the Excel function, so I might write a SEQUENCE() function myself to allow for a more generic sequence.

MMULT(MUNIT(SequenceLength);ROW(OFFSET($A$1;0;0;SequenceLength;1)))
resp.
MMULT(COLUMN(OFFSET($A$1;0;0;1;SequenceLength));MUNIT(SequenceLength))
depending on whether you need a vertical or a horizontal vector.
(Replace "SequenceLength" with the required length, of course.)
The problem is, that the usual matrix formulas treat any range-reference matrix as relative to the current cell. So I had to find a way to release the connection of a matrix to the cell range. Luckily, MMULT does the trick.

How to detect repeating "sequences of words" across too many texts?

The problem is to detect repeating sequences of words across big number of text pieces. It is an approximation and efficiency problem, since the data I want to work with is huge. I want the assign numbers to texts while indexing texts, if they have matching parts with the texts which are already indexed.
For example, if a TextB which I am indexing now has a matching part with 2 other texts in the database. I want to assign a number to it ,p1.
If that matching part would be longer then I want it to assign p2 (p2>p1).
If TextB has matching part with only 1 other text then it should give p3 (p3 < p1).
These two parameters(length of the sequence, size of the matching group) would have maximum values, meaning after these max values have been surpassed, the number being assigned stops increasing.
I can think of a way to do this in brute force, but I need efficieny. My boss directed me to learn about NLP and search solutions there and I am planing to follow through this stanford video lectures.
But I am having doubts about if that is the right way to approach so I wanted to ask your opinion.
Example:
Text 1:"I want to become an artist and travel the world."
Text 2:"I want to become a musician."
Text 3:"travel the world."
Text 4:"She wants to travel the world."
Having these texts I want to have a data looks like this:
-"I want to become" , 2 instances , [1,2]
-"travel the world" , 3 instances , [1,3,4]
After having this data, finally, I wanna do this procedure(after having the previous data, this may be trivial):
(A matrix called A has some values at necessary indexes. I will determine these after some trials.)
Match groups have numeric values, which they retrieve from matrix A.
Group 1 = A(4,2) % 4 words, 2 instances
Group 2 = A(3,3) % 3 words , 3 instances
Then I will assign each text to have a number, which is the sum of numbers of the groups they are inside of.
My problem is forming this dataset in an efficient manner.

Filtering letter combinatons

Hi – I’m looking for help for the following problem.
I have a utility operating that gives me all the combinations for a set of letters (or values). This is in the form of 8 choose n, ie there are 8 letters and I can produce all the combinations for sequences where I want no more than 4 letters. So n can be 2, 3, or 4
Now here it gets a bit more complex: the 8 letters are made up of three lists or groups. Hence, A,B,C,D;E1,E2;F1,F2
As I say, I can get all the 2, 3 and 4-sequences without a problem. But I need to filter them so that I get combinations (or rather can filter the result) where I only want letters in the result that ensures I get (in the n=2 condition) at least one from A,B,C,D and one from either the E set or the F set.
So, as a few examples, where n=2
AE1 or DF2… is ok but AB or E1E2 or E1F1… is not ok
Where n=3 the rules alter slightly but it’s the same principle
ABE1, ABF1, BDF2 or BE2F1… is ok but ABC, ABD, AE1E2, DF1F2 or E1E2F1… is not ok.
Similarly, where n=4
ABE1F1, ABE1F2… is ok but ABCD, ABE1E2, CDF1F2 or E1E2F1F2… is not ok.
I’ve tried a few things using different formulas such as with Match and Countif but can’t quite figure it out. So would be very grateful for any help.
Jon

I've been trying to find an approach to this problem that takes some of the messiness out of it. There are two factors that make this a bit awkward to deal with
(a) Combination of single letters and bigrams (digrams?)
(b) Possibility of several different letters / bigrams at each position in the string.
It's possible to deal with both of these issues by classifying the letters or bigrams into three groups or classes
(1) Letters A-D - let's call this group L
(2) First pair of bigrams E1 & E2 - let's call this group M
(3) Second pair of bigrams F1 & F2 - let's call this group N.
Then we can make a list of the allowed combinations of groups which as far as I can work out is something like this
For N=2
LM
LN
For N=3
LLM
LLN
LMN
For N=4
LLMN
(I don't know if LLLM etc. is allowed but these can be added)
I'm going to make a big assumption that the utility mentioned in OP doesn't generate strings like AAAA or E1E1E1E1 otherwise it would be pretty useless and you would be better off starting from scratch.
So you just need a substitute that looks like this
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"A","L"),"B","L"),"C","L"),"D","L"),"E1","M"),"E2","M"),"F1","N"),"F2","N")
And a lookup in the list of allowed patterns
=ISNUMBER(MATCH(B2,$D$2:$D$10,0))
and filter on the lookup value being TRUE.

Lookup table with variable breakpoints

BACKGROUND
Normally when I am dealing with tables to find values to work with the tables are nicely laid out with something like:
and I rearrange them to make my lookup life easier by converting the table to look like:
I toss in the extra column so that someone maintain the table in the future will have an easier time updating the table and formulas that refer to it. The formula I was using to for the table above was:
=INDEX(C3:I5,MATCH(MIN(B7:B8),A3:A5,1),MATCH(MAX(B7:B8),C2:I2,1))
Where B7:B8 had the dimensions I was referring to. Right now I am working with the assumption that large dimensions only come in set sizes. I would need to incorporate an interpolation approach if permitted and any sizes were used.
CURRENT ISSUE
Now I just came across a table that is making me really think about what is the best approach to the situation. The initial table looks like:
d/b | Cs | Kls
<=1 | - | 1
>1 | <=10 | 1
>1 | >10 but <Ck | 1-0.3*(Cs/Ck)^4
>1 | >=Ck | (0.70*E05)/Cs^2*fbu)
and my first go round at rearranging the table is:
d/b | Cs | Kls
0 | 1 | 0 | 0 | 1
1 | 1 | 0 | 10 | 1
1 | 1 | 10 | Ck | 1-0.3*(Cs/Ck)^4
1 | 1 | Ck | 9.99E+101 | (0.70*E05)/Cs^2*fbu)
So my two stumbling blocks is d/b can be any positive number 1.00001 or 0.99999 type deal. so the d/b lookup has me a bit concerned only for the moment as in the background example the first column I am checking against is >= values.
The second stumbling block I have is that the breakpoint for the second lookup value Cs has a variable for two of the break point ranges. The approach I was planning on taking here was simply calculate the value and pass it to the table when I go to do the look up. Now this approach works great for me when there is only one item to check. This approach I do not think will work when I have to deal with multiple items to check. Right now the only things I can think of are:
Each item would require its own table...which kind of defeats the purpose of the table.
convert the table to a nested IF function.
Its been a long day for me and the mind is a bit fried. I am wondering if anyone would care to share their insight on the approaches I am thinking of or has an approach of their own for a table with a variable as a break point?
Ahh fiddle sticks! I just reread the question and I never stated I was trying to pull the formula/result from the third column (ie Kls). Told you mind was fried.
If I wind up going the nested if route I would use something like:
=IF(OR(E50<=1,E57<=10),1,IF(E57<E63,1-0.3*(E57/E63)^4,0.7*G63/(E57^2*D71)))
Where
- E50 is b/d
- E57 is Cs
- E63 is Ck
- G65 is E05
- D71 is fbu

I think a nested-IF would be better than rearranging the tables. You are losing some values by rearranging the cutoffs.
In your first example, the smaller dimension has >64 but <114.
When you change that, you have 0 to 64, 65 to 114. What about 64.5? This value is not part of your table. Nested IF statements would allow you to keep the original values of the table. Also, you can use a named range for variables which are the breakpoint for the second lookup value Cs.

Transform string from a1b2c3d4 to abcd1234

I am given a string which has numbers and letters.Numbers occupy all odd positions and letters even positions.I need to transform this string such that all letters move to front of array,and all numbers at the end.
The relative order of the letters and numbers needs to be preserved
I need to do this in O(n) time and O(1) space.
eg: a1b2c3d4 -> abcd1234 , x3y4z6 -> xyz346
This previous question has an explanation algorithm, but no matter how hard i try,i cant get a hold of it.
I hope someone can explain me this with a example test case .

The key is to think of the input array as a matrix like this:
a 1
b 2
c 3
d 4
and realize that you want the transpose of this matrix
a b c d
1 2 3 4
Remember, multi-dimensional arrays are really just single-dimensional arrays in disguise so you can do this.
But you need to do this in-place to satisfy the O(1) space requirement. Fortunately, this is a well-known problem complete with several possible approaches.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string