Google Merchant: Can I specify variants by custom attributes? - attributes

On Google's Product Feed Specification, regarding variants, it says:
We define variants as a group of identical products that only differ by the attributes ‘color’, ‘material’, ‘pattern’, or ‘size’.
But what if I have a products that actually differ by other attributes? For example, I could have variants that differ by "Color", "Surface" and "Volume". So there could be two variants with the same color, but different surfaces and volumes. Would Google Merchant see these as duplicates?
From what it seems, I have no way of specifying variants like that...

variants cannot be specified by custom attributes.
surface might be a material or pattern depending on the item details --
generally, if the surface can be differentiated by sight, use pattern;
if surface can be differentiated by touch or what the item is made of,
use material.
volume can be a size.
the item's size can be complex -- condense the information whenever possible;
the item's pattern can be a graphic or pattern;
simply be certain the value is accurate, can be understood by users, and has
a different combination of variant values for all items in the variant group.
if the information does not apply, leave the value blank (empty).
for example --
size,material,pattern,color,item_group_id
1000 mm x 600 mm x 355 mm,stainless steel,,,gid1
1500 mm x 600 mm x 355 mm,stainless steel,,,gid1
1700 mm x 1200 mm,plastic,hemispheres,green,gid2
1700 mm x 1200 mm,plastic,spheres,green,gid2
42 gallons,steel,,,black,gid3
200 mm x 200 mm,vinyl,notre dame,blue/gold,gid4
200 mm x 200 mm,vinyl,yankees,white/blue,gid4
5 mm,,,,gid5
6 mm,,,,gid5
a size value would be all that's needed if size alone is sufficient to
specify a unique combination for all items within that variant group.
each variant group must have an identical item_group_id value
and a unique combination of valid variant values per item --
otherwise, google will classify items as duplicates, and is
grounds for removing all items, disapproval, or suspension.
if only color is submitted and the variants are all the same color,
then the variants cannot be submitted -- without risking all items
being removed, disapproval, or suspension.
variants are not required to be submitted -- alternatively,
simply submit only the one item featured on the landing-page.
also, the website must have a landing-page with a matching price of the variant.
see also
https://support.google.com/merchants/answer/188494

Related

Application of a custom function to generate iterations across a distance range

Bit of a complex query but I will try to articulate it as best I can:
Essentially I have a list of objects for which I am trying to work out the probability of said items hitting particular points below on the seabed having been dropped at the surface. So there are two steps I need guidance on:
I need to define a custom function, ERF(a,b), where I need to refer to specified values dependent on the Item Number (see in the column names) in order to then use them as multipliers:
These multipliers can be found in a dictionary, Lat_Dev (note please ignore the dataframe column reference as this was a previous attempt at coding a solution, info is now found in a dictionary format).
The function then needs to be repeated for set iterations between a & b with the step size defined as 0.1m. This range is from 0-100m. a is the lower limit and b the upper limit (e.g. a = 0.1m & b = 0.2m). This is done for each column (i.e. Item_Num).
Hopefully the above is clear enough. Cheers,

How to apply sklean pipeline to a list of features depending on availability

I have a pandas dataframe with 10 features (e.g., all floats). Given the different characteristics of the features (e.g., mean), the dataframe can be broken into 4 subsets: mean <0, mean within range (0,1), mean within range (1,100), mean >=100
For each subset, a different pipeline will be applied, however, they may not always be available, for example, the dataset might only contain mean <0; or it may contain only mean <0 and mean (1,100); or it may contain all 4 subsets
The question is how to apply the pipelines depending on the availability of the subsets.
The problem is that there will be total 7 different combinations:
all subset exists, only 3 exists, only 2 subset exists, only 1 subset exist.
How can I assign different pipelines depending on the availability of the subsets without using a nested if else (10 if/else)
if subset1 exists:
make_column_transformer(pipeline1, subset1)
elif subset2 exists:
make_column_transformer(pipeline2, subset2)
elif subset3 exists:
make_column_transformer(pipeline3, subset3)
elif subset1 and subset 2 exists
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2)]
elif subset3 and subset 2 exists
make_column_transformer([(pipeline3, subset3), (pipeline2, subset2)]
elif subset1 and subset 3 exists
make_column_transformer([(pipeline1, subset1), (pipeline3, subset3)]
elif subset1 and subset2 and subset3 exists:
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2), (pipeline3, subset3)]
Is there a better way to avoid this nested if else (considering that if we have 10 different subsets _)
The way to apply different transformations to different sets of features is by ColumnTransformer [1]. You could then have a lists with the column names, which can be filled based on the conditions you want. Then, each transformer will take the columns listed in each list, for example cols_mean_lt0 = [...], etc.
Having said that, your approach doesn't look good to me. You probably want to scale the features so they all have the same mean and std. Depending on the algorithm you'll use, this may be mandatory or not.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
EDIT:
ColumnTransformer takes transformers, which are a tuple of name, tuple and columns. What you want is to have multiple transformers, each of which will process different columns. The columns in the tuple can be indicated by 'string or int, array-like of string or int, slice, boolean mask array or callable'. Here is where I suggest you pass a list of columns.
This way, you can have three transformers, one for each of your cases. Now, to indicate which columns you want each transformer to process, you just have to create three lists, one for each transformer. Each column will corresond to one of the lists. This is simple to to. In a loop you can check for each column what the mean is, and then append the column name to the list which corresponds to the corresponding transformer.
Hope this helps!

How to detect repeating "sequences of words" across too many texts?

The problem is to detect repeating sequences of words across big number of text pieces. It is an approximation and efficiency problem, since the data I want to work with is huge. I want the assign numbers to texts while indexing texts, if they have matching parts with the texts which are already indexed.
For example, if a TextB which I am indexing now has a matching part with 2 other texts in the database. I want to assign a number to it ,p1.
If that matching part would be longer then I want it to assign p2 (p2>p1).
If TextB has matching part with only 1 other text then it should give p3 (p3 < p1).
These two parameters(length of the sequence, size of the matching group) would have maximum values, meaning after these max values have been surpassed, the number being assigned stops increasing.
I can think of a way to do this in brute force, but I need efficieny. My boss directed me to learn about NLP and search solutions there and I am planing to follow through this stanford video lectures.
But I am having doubts about if that is the right way to approach so I wanted to ask your opinion.
Example:
Text 1:"I want to become an artist and travel the world."
Text 2:"I want to become a musician."
Text 3:"travel the world."
Text 4:"She wants to travel the world."
Having these texts I want to have a data looks like this:
-"I want to become" , 2 instances , [1,2]
-"travel the world" , 3 instances , [1,3,4]
After having this data, finally, I wanna do this procedure(after having the previous data, this may be trivial):
(A matrix called A has some values at necessary indexes. I will determine these after some trials.)
Match groups have numeric values, which they retrieve from matrix A.
Group 1 = A(4,2) % 4 words, 2 instances
Group 2 = A(3,3) % 3 words , 3 instances
Then I will assign each text to have a number, which is the sum of numbers of the groups they are inside of.
My problem is forming this dataset in an efficient manner.

List of items find almost duplicates

Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).
One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.
You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +
try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")

Search selection

For a C# program that I am writing, I need to compare similarities in two entities (can be documents, animals, or almost anything).
Based on certain properties, I calculate the similarities between the documents (or entities).
I put their similarities in a table as below
X Y Z
A|0.6 |0.5 |0.4
B|0.6 |0.4 |0.2
C|0.6 |0.3 |0.6
I want to find the best matching pairs (eg: AX, BY, CZ) based on the highest similarity score. High score indicates the higher similarity.
My problem arises when there is a tie between similarity values. For example, AX and CZ have both 0.6. How do I decide which two pairs to select? Are there any procedures/theories for this kind of problems?
Thanks.
In general, tie-breaking methods are going to depend on the context of the problem. In some cases, you want to report all the tying results. In other situations, you can use an arbitrary means of selection such as which one is alphabetically first. Finally, you may choose to have a secondary characteristic which is only evaluated in the case of a tie in the primary characteristic.
Additionally, you can always report one or more and then alert the user that there was a tie to allow him or her to decide for him- or herself.
In this case, the similarities you should be looking for are:
- Value
- Row
- Column
Objects which have any of the above in common are "similar". You could assign a weighting to each property, so that objects which have the same value are more similar than objects which are in the same column. Also, objects which have the same value and are in the same column are more similar than objects with just the same value.
Depending on whether there are any natural ranges occurring in your data, you could also consider comparing ranges. For example two numbers in the range 0-0.5 might be somewhat similar.

Resources