Log4j - how to tokenize logs like Elastic Search and group by values, filter and visualize them?

Log4j - how to tokenize logs like Elastic Search and group by values, filter and visualize them? - log4j

When you know the logging format/pattern, is there a way to visualize/classify/filter logs with certain thread name/client id/... without using Elastic Search? I want to show for example, "this thread has these lines, ordered by timestamp, with these keywords". I don't want to manually search by thread names anymore.
I can manually grep, of course, but I want to visualize them in a better way. I can search them if I know the thread names, but I want to show all of the thread names without knowing them previously.
Also, ES marks session boundary, but grep does not. You can see that a session/request is different from another, because the session id has changed. A thread can undertake several sessions in a period of time.
I have only seen that in ES for now. I think it can be called as a log tokenizer and classifier tool, but towards static log files which haven't been fed into ES.
Now, I am trying to do this in Excel: adding ; between tags, and parse them to columns, and classify/filter. In the screenshot, thread names is filtered to be one. But I found it very hard to do if logs are too many because separating by space is not good, need to manually add ; or other unique separator.

Log4j2 has a companion project to visualize logs: Chainsaw. It does not have a lot of features, but it might be enough for your needs.

Related

CQRS Read Model Projections: How complex is too complex a data transformation

I want to sanity check myself on a view projection, in regards to if an intermediary concept can purely exist in the read model while providing a bridge between commands.
Let me use a contrived example to explain.
We place an order which raises an OrderPlaced event. The workflow then involves generating a picking slip, which is used to prepare a shipment.
A picking slip can be generated from an order (or group of orders) without any additional information being supplied from any external source or user. Is it acceptable then that the picking slip can be represented purely as a read model?
So:
PlaceOrderCommand -> OrderPlacedEvent
OrderPlacedEvent -> PickingSlipView
The warehouse manager can then view a picking slip, select the lines they would like to ship, and then perform a PrepareShipment command. A ShipmentPrepared event will then update the original order, and remove the relevant lines from the PickingSlipView.
I know it's a toy example, but I have a conceptually similar use case where a colleague believes the PickingSlip should be a domain entity/aggregate in its own right, as it's conceptually different to order. So you have PlaceOrder, GeneratePickingSlip, and PrepareShipment commands.
The GeneratePickingSlip command however simply takes an order number (identifier), transforms the order data into a picking slip entity, and persists the entity. You can't modify or remove a picking slip or perform any action on it, apart from using it to prepare a shipment.
This feels like introducing unnecessary overhead on the write model, for what is ultimately just a transformation of existing information to enable another command.
So (and without delving deeply into the problem space of warehouses and shipping)...
Is what I'm proposing a legitimate use case for a read model?
Acting as an intermediary between two commands, via transformation of some data into a different view. Or, as my colleague proposes, should every concept be represented in the write model in all cases?
I feel my approach is simpler, and avoiding unneeded complexity, but I'm new to CQRS and so perhaps missing something.
Edit - Alternative Example
Providing another example to explore:
We have a book of record for categories, where each record is information about products and their location. The book of record is populated by an external system, and contains SKU numbers, mapped to available locations:
Book of Record (Electronics)
SKU# Location1 Location2 Location3 ... Location 10
XXXX Introduce Remove Introduce ... N/A
YYYY N/A Introduce Introduce ... Remove
Each book of record is an entity, and each line is a value object.
The book of record is used to generate different Tasks (which are grouped in a TaskPlan to be assigned to a person). The plan may only cover a subset of locations.
There are different types of Tasks: One TaskPlan is for the individual who is on a location to add or remove stock from shelves. Call this an AllocateStock task. Another type of Task exists for a regional supervisor managing multiple locations, to check that shelving is properly following store guidelines, say CheckDisplay task. For allocating stock, we are interested in both introduced and removed SKUs. For checking the displays, we're only interested in newly Introduced SKUs, etc.
We are exploring two options:
Option 1
The person creating the tasks has a View (read model) that allows them to select Book of Records. Say they select Electronics and Fashion. They then select one or more locations. They could then submit a command like:
GenerateCheckDisplayTasks(TaskPlanId, List<BookOfRecordId>, List<Locations>)
The commands would then orchestrate going through the records, filtering out locations we don't need, processing only the 'Introduced' items, and creating the corresponding CheckDisplayTasks for each SKU in the TaskPlan.
Option 2
The other option is to shift the filtering to the read model before generating the tasks.
When a book of record is added a view model for each type of task is maintained. The data might be transposed, and would only include relevant info. ie. the CheckDisplayScopeView might project the book of record to:
Category SKU Location
Electronics (BookOfRecordId) XXXX Location1
Electronics (BookOfRecordId) XXXX Location3
Electronics (BookOfRecordId) YYYY Location2
Electronics (BookOfRecordId) YYYY Location3
Fashion (BookOfRecordId) ... ... etc
When generating tasks, the view enables the user to select the category and locations they want to generate the tasks for. Perhaps they select the Electronics category and Location 1 and 3.
The command is now:
GenerateCheckDisplayTasks(TaskPlanId, List<BookOfRecordId, SKU, Location>)
Where the command now no longer is responsible for the logic needed to filter out the locations, the Removed and N/A items, etc.
So the command for the first option just submits the ID of the entity that is being converted to tasks, along with the filter options, and does all the work internally, likely utilizing domain services.
The second option offloads the filtering aspect to the view model, and now the command submits values that will generate the tasks.
Note: In terms of the guidance that Aggregates shouldn't appear out of thin air, the Task Plan aggregate will create the Tasks.
I'm trying to determine if option 2 is pushing too much responsibility onto the read model, or whether this filtering behavior is more applicable there.
Sorry, I attempted to use the PickingSlip example as I thought it would be a more recognizable problem space, but realize now that there are connotations that go along with the concept that may have muddied the waters.

The answer to your question, in my opinion, very much depends on how you design your domain, not how you implement CQRS. The way you present it, it seems that all these operations and aggregates are in the same Bounded Context but at first glance, I would think that there are 3 (naming is difficult!):
Order Management or Sales, where orders are placed
Warehouse Operations, where goods are packaged to be shipped
Shipments, where packages are put in trucks and leave
When an Order is Placed in Order Management, Warehouse reacts and starts the Packaging workflow. At this point, Warehouse should have all the data required to perform its logic, without needing the Order anymore.
The warehouse manager can then view a picking slip, select the lines they would like to ship, and then perform a PrepareShipment command.
To me, this clearly indicates the need for an aggregate that will ensure the invariants are respected. You cannot select items not present in the picking slip, you cannot select more items than the quantities specified, you cannot select items that have already been packaged in a previous package and so on.
A ShipmentPrepared event will then update the original order, and remove the relevant lines from the PickingSlipView.
I don't understand why you would modify the original order. Also, removing lines from a view is not a safe operation per se. You want to guarantee that concurrency doesn't cause a single item to be placed in multiple packages, for example. You guarantee that using an aggregate that contains all the items, generates the packaging instructions, and marks the items of each package safely and transactionally.
Acting as an intermediary between two commands
Aggregates execute the commands, they are not in between.
Viewing it from another angle, an indication that you need that aggregate is that the PrepareShippingCommand needs to create an aggregate (Shipping), and according to Udi Dahan, you should not create aggregate roots (out of thin air). Instead, other aggregate roots create them. So, it seems fair to say that there needs to be some aggregate, which ensures that the policies to create shippings are applied.
As a final note, domain design is difficult and you need to know the domain very well, so it is very likely that my proposed solution is not correct, but I hope the considerations I made on each step are helpful to you to come up with the right solution.
UPDATE after question update
I read a couple of times the updated question and updated several times my answer, but ended up every time with answers very specific to your example again and I'm most likely missing a lot of details to actually be helpful (I'd be happy to discuss it on another channel though). Therefore, I want to go back to the first sentence of your question to add an important comment that I missed:
an intermediary concept can purely exist in the read model, while providing a bridge between commands.
In my opinion, read models are disposable. They are not a single source of truth. They are a representation of the data to easily fulfil the current query needs. When these query needs change, old read models are deleted and new ones are created based on the data from the write models.
So, only based on this, I would recommend to not prepare a read model to facilitate your commands operations.
I think that your solution is here:
When a book of record is added a view model for each type of task is maintained. The data might be transposed, and would only include relevant info.
If I understand it correctly, what you should do here is not create view model, but create an Aggregate (or multiple). Then this aggregate can receive the commands, apply the business rules and mutate the state. So, instead of having a domain service reading data from "clever" read models and putting it all together, you have an aggregate which encapsulates the data it needs and the business logic.
I hope it makes sense. It's a broad topic and we could talk about it for hours probably.

Internal Search optimization for relevance

My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?

All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.

What is the best method to extract relevant info from Email?

My friend has a small business where customers order services using email. He receives several emails a day and sorting thru it is becoming cumbersome.
There are about 10 different kind of tasks the customer can request, and for each there are one or two words that specify it. The other info present in the emails is the place where the service is to be delivered, the time, and the involved people's names. The email also contains an ID, a long number with a fairly standard format.
The emails are very unstructured, but all contain the key info above. My question is: what is the best method to sweep thru these emails and extract the key info (such as type of service, place, people's names, the ID etc)?
I thought about some kind of pre-processing, then pass it thru AlchemyAPI and then test the Alchemy output using Neural Networks for each feature (key info). This can be supervised learning as I can do a feedback loop all the time, as once the info is inputted, I can have someone to validate.
Any ideas? Thanks

I guess some parts (ID, task, time) can be captured by a regular expression and dictionary matching. Have a look at GATE's JAPE tool.
It should be fairly easy to assemble a dictionary and then use the lookups for the "task", also you can reuse the available jape rules for date/time and write a new one for the ID (also, a simple regex could be fine).
For matching the location and people's names you should be careful, openCalais and alchemyAPI can give you good results if names and places are used in well defined sentences and will probably make more mistakes with some tabular or weird format. Also you can never be sure you captured the place and person correctly so don't rely on that for processing orders directly.
If you have more information about mails' structure or expected names and places (i.e. you have a "clients" table with all possible names), you would probably want to do your own tagging, otherwise I'd stick to openCalais or alchemyAPI + some regular expressions.
P.S. I assume all mails are in English.

How can I create Log4j 2 appender connected with jTextPane?

I'm currently trying to make Log4j 2 log into a JTextPane. It should act like a STDERR or STDOUT in Netbeans IDE console (incl. text style - color).
I know that I need to create an appender and connect it with JTextPane, however I don't know how do it using Log4j 2.
Do you have any suggestions?
I appreciate your help,
marty

I have done this for Logback (with plain text only). The basic things you need to do are:
Implement your own Appender to receive the log events. Log4j 2 provides AbstractAppender, which will give you the baseline functionality.
Use an appropriate Layout to format the log event (will depend what type of Document you're using for your JTextPane.
Append the formatted text to the underlying Document for the JTextPane.
A couple of other points:
Things will be simpler if you log plain text only, in which case you should use a JTextArea.
Presumably you will want to cap the amount of text in the Document. You can do this by checking the length on each append and cutting out the first X% using Document.remove when it exceeds the maximum length.
If you have frequent log operations, you should limit the frequency at which you append to the document, and buffer the changes in between to reduce the swing update/repaint overhead. I typically use 3 Hz. This is also advisable when you have multiple log producer threads because although the Document.insertString method is thread-safe, it obtains a lock on the document before performing the update and can result in quite a bit of contention.
I'd highly recommend referencing the documentation for this. I've never used Log4j 2, but the documentation looks quite straight forward. Similarly, the "Using Text Components" section of the Java Tutorials provides everything you need to know about the Swing side. Unfortunately I can't provide additional links here.

How do I create an array of resources using Jena?

I am using Jena and Java, and am reading a CSV file. For each line of the file there is a subject resource. Two subject resources, on adjacent lines, might have share the same value of a field in the line (e.g: both lines have the same process id). In this case, I need to combine the two subject resources as each one represents a sub-process in production (for example).
My question is: how can I reference those two resources dynamically so that I can combine them? I came to the idea that when I find that they share the same property to store them in an array resource subjects. Is it the right approach?

This question would be a lot easier to answer if you could show some sample data. As it is, I think you're focusing on the wrong bit of the question. If you can decide clearly what it means to have two rows in your CSV with identical process, and then you decide how you're going to encode that meaning in your RDF model, then the question of how to write the code - as an array or whatever - will be much clearer.
For example (and I'm going to make up some data here - as I said, it would be easier if you show an actual example), suppose your CSV contains:
processId,startTime,endTime
123,15:22:00,15:23:00
123,16:22:00,16:25:00
So process 123 has, apparently two start and end time pairs. If you model this naively in RDF, you'll end up with a confusing model:
process:process123
a :Process;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
.
which would suggest that one process had two start times (and two end times) which looks nonsensical. However, it might be that in reality you have a single process with multiple episodes, suggesting one way to model it, or a periodic process which occurs at different times, or, as you suggested, sub-processes of a parent process. Or something else entirely (I'm only guessing, I don't know your domain). Once you are clear what the data means, you can produce a suitable RDF model. For example, an episodic process might be:
process:process123
a :Process;
process:episode [
a process:Episode;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
];
process:episode [
a process:Episode;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
]
.
Once the modelling is clear in your mind, I think you can see that the question of how to produce the desire RDF triples from Java code - and whether or not you need an array - is much clearer. Equally importantly, you can think in terms of the JUnit tests you would write to test whether your code is behaving correctly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string