log search using hadoop

log search using hadoop - search

We have huge log files(~ 100s of Gigs) on multiple web servers that are needed to be searched in real time. These log files are written multiple times/second by different apps. We have recently installed a hadoop cluster on some servers for this purpose. In order to implement search on these logs, I have thought of this design: there is a process running on web servers which creates an inverted-index of logs and cache it in-memory (on web servers itself) and push to HDFS via flume to be stored in Hive when the cache is full (this is much like an LRU cache). This helps in two ways when something is searched for: most recent logs are returned from in-memory cache and is fast and older logs are returned from disk. And since user wants to see latest logs first, this technique works. Can somebody verify if this design will work and scale properly. Are there any better alternatives around?
Thanks

You could store the inverted index in HBase to provide more real-time access to your older logs.
HBase would also likely be a viable alternative to your in-memory cache. You could do this if you wanted to unify the storage platform instead of having it split up. It will obviously be slower than memcached or redis.
A completely different approach could be using Lucene/Solr to index your logs. This has a lot of nice features out of the box for searching.

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.

Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.

Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

What is the best way to resolve CouchDB document conflicts across 2 DB instances?

I have one application running over NodeJS and I am trying to make a distributed app. All write request goes to Node application and it writes to CouchDB A and on success of that It writes to CouchDB B. We read data through ELB(which reads from the 2 DBs).It's working fine.
But I faced a problem recently, my CouchDB B goes down and after CouchDB B up, now there is document _rev mismatch between the 2 instances.
What would be the best approach to resolve the above scenario without any down time?

If your CouchDB A & CouchDB B are in the same data centre, then #Flimzy's suggestion of using CouchDB 2.0 in a clustered deployment is a good one. You can have n CouchDB nodes configured in a cluster with a load balancer sitting above the cluster, delivering HTTP(s) traffic to any node that is "up".
If A & B are geographically separated, you can use CouchDB Replication to move data from A-->B and B-->A which would keep both instances perfectly in sync. A & B could each be clusters of 3 or more CouchDB 2.0 nodes, or single instances of CouchDB 1.7.
None of these solutions will "fix" the problem you are seeing when two copies of the database are modified in different ways at the same time. This "conflict" state is CouchDB's way of preventing data loss when two writes clash. Your app can resolve the conflict by picking a winning revision or writing a new one. It's not a fault condition, it's helping your application recover from a data loss during concurrent writes in a distributed system.
You can read more about document conflicts in this blog post series.

If both of your 1.6.x nodes are syncing buckets using standard replication, turning off one node shouldn’t be an issue. On node up it receives all updates without having conflicts – because there were no way to make them, the node was down.
If you experience conflicts during normal operation, unfortunately there exist no common general way to resolve them automatically. However, in most cases you can find a strategy of marking affected doc subtrees in a way allowing to determine which subversion is most recent (or more important).
To detect docs that have conflicts you may use standard views: a doc received by a view function has the _conflicts property if there exist conflicting revisions. Using appropriate view you can detect conflicts and merge docs. Anyway, regardless of how you detect conflicts, you need external code for resolving them.
If your conflicting data is numeric by nature, consider using CRDT structures and standard map/reduce to obtain final value. If your data is text-like you may also try to use CRDT, but to obtain reasonable performance you need to use reducers written in Erlang.
As for 2.x. I do not recommend using 2.x for your case (actually, for any real case except experiments). First, using 2.x will not remove conflicts, so it does not solve your problem. Also taking in account 2.x requires a lot of poorly documented manual operations across nodes and is unable to rebalance, you will get more pain than value.
BTW using any cluster solution have very little sense for two nodes.
As for above mentioned CVE 12635 and CouchDB 1.6.x: you can use this patch https://markmail.org/message/kunbxk7ppzoehih6 to cover the vulnerability.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?

From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.

The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).

Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Storing MP4 Files in Cassandra?

I am currently considering whether I should be storing media in an apache cassandra database. The use case is that the site will be taking uploads from users for insurance claims and will need to store the files so that they cannot be accessed outside the correct permissions and at the same time they need to be able to be streamed. If I store them on a file system, I have to deal with redundancy backups and so on using file system based old tech. I am not really interested in dealing with a CDN because many of them are expensive but also I the permissions to the whether you can view the content depends on information in the app such as which adjuster is assigned to the case and so on. In addition I want to stream the files rather than require download and view which would be the default mode with requests against a CDN. If I put them in cassandra it will handle the replication, storage and I can stream the binary data out of the database to the user with integrated permissions. What I am concerned about is if I will run into problems with cassandra rows having huge HD video files that are sometimes 1 to 2 hours long (testimony).
I am interested in the recommendations of Cassandra users concerning this issue. How would to solve the problem. Any lessons you have learned that I can benefit from. Would you suggest anything specific about the video tables if I go with cassandra storage? Is there any CDN that will stream, not require download, allow me to plug in permissions and at the same time be open source?
Thanks a bunch.

Cassandra is definitely not designed and should not be used as an object store. I've worked on plenty of use cases where Cassandra was used as the metadata store alongside the object store/CDN and can complement them quite nicely.
Check out KillrVideo for inspiration: https://killrvideo.github.io/

This seems like a good key-value usecase for Streaming LOB support in Oracle NoSQL Database. You might want to look at this - http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/lobapi.html

Event logging with distributed database for node.js (MongoDB?)

I am looking for system or library for node.js, that can log information about client access on every remote server and automatically gather that information on central log server for later analysis. Remote server will have write only access, while central server will accumulate a lot of data to read.
I hope there is solution using distributed [NoSQL] database, like MongoDB.
However I have not found how to set it up.
For example I hope that cleaning old data can be initiated on central log server (when data has been processed) and entries on old dates can be removed on remote server with little overhead.
Currently we have logging into files and Hadoop system for log analysis.
But I think we need to accumulate data in database.

Winston, currently the best logging framework for node.js, has option to log into MongoDB or CouchDB.

Scribe could be what you're looking for. There are node packages too
I have never checked it out so I'd be interested in reading your thoughts in the comments if you investigate it and find it good/bad, easy/hard to setup, etc.

MongoDB or any other distributed databases will not solve problem.
In-house project must be created.
Some features of MongoDB for consideration:
Capped Collections are actually way to loose data. I may be good for short history.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string