Un-nesting nested tuples to single terms - nested

I have written an udf (extends EvalFunc<Tuple>) which has as output tuples with inner tuples (nested).
For example the dump looks like:
(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))
Now I want to process each of this terms, distinct it and give it an id (RANK). To do this, i need to get rid of the brackets. A simple FLATTENdoes not help in this case.
The final output should be like:
1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....
My code (not the udf part and not the raw parsing):
tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;

I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.
Instead of
(((wedg,wedge),(audusd,audusd)))
you will have
({(wedg,wedge),(audusd,audusd)})
and you will be able to FLATTEN that bag to:
1. make the DISTINCT
2. RANK the tags
To do so, update your UDF like this :
class MyUDF extends EvalFunc <DataBag> {
#Override
public DataBag exec(Tuple input) throws IOException {
// create DataBag
}
}

Related

Isolate lines that dont exist in txt file

I have two text files that have camera models, however not all models on one text file are present in the other, so, I want to find the missing models. One issue tho, some models have extra strings in their name e.g., :
NIKON D610
D610
CANON POWERSHOT A1200
POWERSHOT A1200
"Nikon" and "Canon" is non-existent in one file.
~~ I'm scratching my head since 2 days.
At first there are some assumtions required for this answer:
Two strings describing the same model differ in a way that they either do or do not contain a manufacturer string.
It is feasibl to make a list of all possible manufacturer strins.
If these two assumptions are satisfied, one can ingore every string that is part of the manufacturer sting list while comparing two model stings. This way only the rest of the model string is evaluated.
Here is an example in C#. The local strings aClean and bClean are used to not mess up the original strings.
List<string> manufacturers // List of all possible manufacturer stings
List<string> modelsA // List of all models strings form file A
List<string> modelsB // List of all models strings form file B
foreach (string a in modelsA)
{
// Remove manufacturer name and spaces
string aClean = RemoveManufacturer(a).Replace(" ", "");
foreach (string b in modelsB)
{
// Remove manufacturer name and spaces
string bClean = RemoveManufacturer(b).Replace(" ", "");
// Now compare and process the strings.
// Store original strings a or b if required
...
}
}
string RemoveManufacturer(string model)
{
foreach (string manufacturer in manufacturers)
{
// remove manufacturer from model if possible
model.Replace(manufacturer, "");
}
return model;
}
This code is far from optimized. But it seems that your use case is not exactly performance sensitive anyways.

JOOQ: fetchGroups() always returns list with only one element

I'm new to JOOQ and currently fail to map a joined query to Map<K, List<V>>: the list always only contains one element.
Here's my code:
DSL.using(...)
.select(ORDER.fields())
.select(ORDER_ITEM_ARTICLE.fields())
.from(ORDER)
.leftOuterJoin(ORDER_ITEM_ARTICLE).on(ORDER.ID.eq(ORDER_ITEM_ARTICLE.ORDER_ID))
// to Map<InOutOrder, List<OrderItemArticle>>
.fetchGroups(
r -> r.into(ORDER).into(InOutOrder.class),
r -> r.into(ORDER_ITEM_ARTICLE).into(OrderItemArticle.class)
)
// map to InOutOrder
.entrySet().stream().map( e -> {
// e.getValue() always returns list with only 1 element?!
e.getKey().articles = e.getValue();
return e.getKey();
})
.collect(Collectors.toList())
;
Say I have 1 row in ORDER and 2 corresponding rows in ORDER_ITEM_ARTICLE. Running the SQL returned by .getSQL() (after .fetchGroups()), returns me 2 rows as expected, so I assumed the fetchGroups() call will populate my list with two entries as well?!
What am I missing?
Thanks!
Update:
As requested, the InOutOrder class:
public class InOutOrder extends Order {
public List<OrderItemArticle> articles;
public List<OrderItemOther> others;
public List<OrderItemCost> costs;
public List<OrderContact> contacts;
public List<EmailJob> emailJobs;
}
So this is just an extension of the JOOQ POJO class and is used for JSON communication with the API clients...
fetchGroups() simply puts objects in a LinkedHashMap. You have to adhere to the usual Map contract, which means implementing equals() and hashCode(). Without it, each object you're creating (or which jOOQ is creating for you) will use identity comparison, so you get every "value" only once in the result.

Map to hold multiple sets of key and values

I have a map1 which holds the information as
[40256942,6] [60246792,5]
Now that I want to prepare a map2 that holds information such as
itemNo, 40256942
qty, 6
itemNo, 60246792
qty, 5
to prepare final information as json
“partialArticlesInfo”: [{itemNo:”40256942”, availQty:”6”}, {itemNo:”60246792”, availQty:”5”}]
I am trying to iterate map1 to retrieve values and set that against the key. But I am getting only one entry which is last one. Is there any way , I get the new map with entries such as mentioned above
Map<String, String> partialArticlesInfo = new HashMap<String,String>();
Map<String, String> partialArticlesTempMap = null;
for (Map.Entry<String,String> entry : partialStockArticlesQtyMap.entrySet())
{
partialArticlesTempMap = new HashMap<String,String>();
partialArticlesTempMap.put("itemNo",entry.getKey());
partialArticlesTempMap.put("availQty",entry.getValue());
partialArticlesInfo.putAll(partialArticlesTempMap);
}
In Java (I'm assuming you're using Java, in the future it would be helpful to specify that) and every other language I know of, a map holds mappings between keys and values. Only one mapping is allowed per key. In your "map2", the keys are "itemNo" and "availQty". So what is happening is that your for loop sets the values for the first entry, and then is overwriting them with the data from the second entry, which is why that is the only one you see. Look at Java - Map and Map - Java 8 for more info.
I don't understand why you are trying to put the data into a map, you could just put it straight into JSON with something like this:
JSONArray partialArticlesInfo = new JSONArray();
for (Map.Entry<String,String> entry : partialStockArticlesQtyMap.entrySet()) {
JSONObject stockEntry = new JSONObject();
stockEntry.put("itemNo", entry.getKey());
stockEntry.put("availQty", entry.getValue());
partialArticlesInfo.put(stockEntry);
}
JSONObject root = new JSONObject();
root.put("partialArticlesInfo",partialArticlesInfo);
This will take "map1" (partialStockArticlesQtyMap in your code) and create a JSON object exactly like your example - no need to have map2 as an intermediate step. It loops over each entry in map1, creates a JSON object representing it and adds it to a JSON array, which is finally added to a root JSON object as "partialArticlesInfo".
The exact code may be slightly different depending on which JSON library you are using - check the docs for the specifics.
I agree with Brendan. Another solution would be otherwise to store in the Set or List objects like the following.
class Item {
Long itemNo;
int quantity;
public int hashCode() {
Long.hashCode(itemNo) + Integer.hashCode(quantity);
}
public int equals(Object other) {
other instanceOf Item && other.itemNo == this.itemNo && other.quantity = this.quantity;
}
}
}
then you can use the JsonArray method described by him to get the Json string in output
This means that adding new variables to the object won't require any more effort to generate the Json

how should i do a parralel.foreach on a sorted dictionary

I'd like to turn the following code into a parallel.foreach
foreach (KeyValuePair<int, List<int>>entry in DataGroups)
{
// my code goes here (its not the problem).
}
The DataGroups is not edited or returned, another external list DataTotal is updated by this routine. As each DataGroup contains unique indexes, and DataTotal contains a list of all possible indexes. There is no risk of a thread wanting to write twice to the same DataTotal, as the list of DataGroups only contains unique indexes.
My problem i'm trying to write this complex data structure of a sorted dictionary of int,> int (key, and data pairs), and i am confused on how to write that inside a
Parallel.ForEach ( KeyValuePair entry in DataGroups => Doesnt work
I think you got confused with the syntax. Enumerating dictionaries is not a special case. They are just another IEnumerable like any other:
Parallel.ForEach (DataGroups, kvp => { });

Hazelcast - query collections of Map values

Assume I have the following as the value in an IMap:
public class Employee{
public int empId;
public List<String> categories;
public List<String> getCategories(){
return this.categories;
}
}
I would like to find all employees that belong to category "Sales". Also, I would like to create an index on getCategories() so that the query returns fast. There seems to be no Predicate available to do this. How do I go about achieving this? It seems like I will have to write a Predicate to do this. Is there example code I can look at that would show me how to build a predicate that uses an index ?
The only way I currently see this happening is to denormalize the data model and use something like a IMap and the following as value:
class EmployeeCategory{int employeeId, String category}
And put an index on category.
It is somewhere on the backlog to provide more advances indices that should be able to do this out of the box.
I tried by iterating the List to a separate Imap and then querying it in the client.
IMap<String,ArrayList< categories >> cache=hazelcastInstance.getMap("cache");
IMap<String, categories> cachemodified = hazelcastInstance.getMap("cachemodified") ;
int[] idx = { 0 };
xref.get("urkey").forEach(cachefelement ->{
cachemodified.put(String.valueOf(idx[0]++),cachefelement);
});
Predicate p = Predicates.equal("categoryId", "SearchValue");
Collection<categories> result = cachemodified.values(p);

Resources