I am currently considering whether I should be storing media in an apache cassandra database. The use case is that the site will be taking uploads from users for insurance claims and will need to store the files so that they cannot be accessed outside the correct permissions and at the same time they need to be able to be streamed. If I store them on a file system, I have to deal with redundancy backups and so on using file system based old tech. I am not really interested in dealing with a CDN because many of them are expensive but also I the permissions to the whether you can view the content depends on information in the app such as which adjuster is assigned to the case and so on. In addition I want to stream the files rather than require download and view which would be the default mode with requests against a CDN. If I put them in cassandra it will handle the replication, storage and I can stream the binary data out of the database to the user with integrated permissions. What I am concerned about is if I will run into problems with cassandra rows having huge HD video files that are sometimes 1 to 2 hours long (testimony).
I am interested in the recommendations of Cassandra users concerning this issue. How would to solve the problem. Any lessons you have learned that I can benefit from. Would you suggest anything specific about the video tables if I go with cassandra storage? Is there any CDN that will stream, not require download, allow me to plug in permissions and at the same time be open source?
Thanks a bunch.
Cassandra is definitely not designed and should not be used as an object store. I've worked on plenty of use cases where Cassandra was used as the metadata store alongside the object store/CDN and can complement them quite nicely.
Check out KillrVideo for inspiration: https://killrvideo.github.io/
This seems like a good key-value usecase for Streaming LOB support in Oracle NoSQL Database. You might want to look at this - http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/lobapi.html
Related
I am loading a few big JSON data from 3rd party API on server startup and write them into .JSON files (150mb json files), loading it into an object whenever I need to use it.
The thing is, I am not sure this is the right and efficient way to do so. Should I use a database instead? If yes, could you mention which one to use?
Thanks.
glad to answer your question.
Modern databases are already able to keep up with large file sizes, so in this case size would not be an issue.
However, the issue regarding performance is that it still depends on the usage and purpose of the application.
For example, sometimes the application might require content caching, in this case most databases already have this function built-in, however, there are also applications where this won't apply.
This issue also discusses the comparison of disk storage and database storage, there are lots of good answers in there, I hope it will help.
I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.
Simply, I need to build an app to store images for users. So each user can upload images and view them on the app.
I am using NodeJS and Mongo/Mongoose.
Is this a good approach to handle this case:
When the user uploads the image file, I will store it locally.
I will use Multer to store the file.
Each user will have a separate folder created by his username.
In the user schema, I will define a string array that records the file path.
When user needs to retrieve the file, I will check the file path, retrieve it from the local disk.
Now my questions are:
Is this a good approach (storing in local file system and storing path in schema?
Is there any reason to use GridFS, if the file sizes are small (<1MB)?
If I am planning to use S3 to store files later, is this a good strategy?
This is my first time with a DB application like this so very much appreciate some guidance.
Thank you.
1) Yes, storing the location within your database for use within your application and the physical file elsewhere is an appropriate solution. Depending on the data store and number of files it can be detrimental to store within a database as it can impede processes like backup and replication if there are many large files
2) I admit that I don't know GridFS but the documentation says it is for files larger than 16MB so it sounds like you don't need it yet
3) S3 is a fantastic product and enables edge caching and backup through services and many others too. I think your choice needs to look at what AWS provides and if you need it e.g. global caching or replication to different countries and data centres. Different features cause different price points but personally I find the S3 platform excellent and have around 500G loaded there for different purposes.
I'd like to just use .json files to store data, rather than using a database. Many simple sites have little data, and reading/writing to a file (that can be added to version control) seems adequate, and eliminates the need for database versioning / deployment logistics.
npm: node-store
Here's one way to do it, yet I'd need to implement all kinds of query functionality.
I'm really unfamiliar with CouchDB. From the little I've read, it looks like it might use files to store the JSON data, but it might use some kind of disk storage. Can someone shed some light on this?
Does CouchDB store its JSON in text-based files that can be added to version control (git)?
Does anyone know of another text-based storage system with some query functionality?
CouchDB is a full fledged database. The value that gives you above simply using file based storage is additional indexing. Ie., if you do file based then you can either only do key based look ups (the file name) or build your own secondary indexing methodology (symlinks or whatever). Now you're in the database building business instead of the app building business, which is silly because your entire premise seems to be simplicity and focusing on your app.
Also, keep in mind that when you have many (even just 2) people causing writes to your file(s), then you're going to run into either file system locking problems or users overwriting one another.
You're correct though, if you only have a few pieces of information then a single JSON file - basically a config file - is far easier than a database. Especially if people are only reading from the file.
Also, keep in mind that there are Database-as-a-Service solutions that remove the need for DIY install/configure/maintenance/administration. One of them is Cloudant which is based on CouchDB, is API compatible, contributes back, etc. (I work at Cloudant).
Does anyone know of another text-based storage system with some query functionality?
You can use ueberDB module with Dirty file storage.
As far as I remember, this storage just appends your data to the same text file over and over again, so if you really have small dataset, it'll work just fine.
If you data will grow too much, you can always change storage while using the same module.
I am designing a system that's going to have about 10 millions+ users, each has a photo, which is about 1~2 MB.
We are going to deploy both database and web app using Microsoft Azure
I am wondering the way I should store the photos, there are currently two options,
1, Store all photos use Sql Server FileStream
2, Use File Server
I haven't experienced such large scale BLOB data using FileStream.
Can anybody give my any suggestion? The Cons and Pros?
And anyone with Microsoft Azure experiences concerning the large photos store is really appreciated!
Thx
Ryan.
I vote for neither. Use Windows Azure Blob storage. Simple REST API, $0.15/GB/month. You can even serve the images directly from there, if you make them public (like <img src="http://myaccount.blob.core.windows.net/container/image.jpg" />), meaning you don't have to funnel them through your web app.
Database is almost always a horrible choice for any large-scale binary storage needs. Database is best for relational-only systems, and instead, provide references in your database to the actual storage location. There's a few factors you should consider:
Cost - SQL Azure costs quite a lot per GB of storage, and has small storage limitations (50GB per database), both of which make it a poor choice for binary data. Windows Azure Blob storage is vastly cheaper for serving up binary objects (though has a bit more complicated pricing system, still vastly cheaper per GB).
Throughput - SQL Azure has pretty good throughput, as it can scale well, however, Windows Azure Blog storage has even greater throughput as it can scale to any number of nodes.
Content Delivery Network - A feature not available to SQL Azure (though a complex, custom wrapper could be created), but can easily be setup within minutes to piggy-back off your Windows Azure Blob storage to provide limitless bandwidth to your end-users, so you never have to worry about your binary objects being a bottleneck in your system. CDN costs are similar to that of Blob storage, but you can find all that stuff here: http://www.microsoft.com/windowsazure/pricing/#windows
In other words, no reason not to go with Blob storage. It is simple to use, cost effective, and will scale to any needs.
I can't speak on anything Azure related but for my money the biggest advantage of using FILESTREAM is that that data can get backed up inside the normal SQL Server backup process. The size of the data that you are talking about also suggests that FILESTREAM may be a good choice as well.
I've worked on a SCM system with a RDBMS back end and one of our big decisions was whether to store the file deltas on the file system or inside the DB itself. Because it was cross-RDBMS we had to cook up a generic non-FILESTREAM way of doing it but the ability to do a single shot backup sold us.
FILESTREAM is a horrible option for storing images. I'm surprised MS ever promoted it.
We're currently using it for our images on our website. Mainly the user generated images and any CMS related stuff that admins create. The decision to use FILESTREAM was made before I started. The biggest issue is related to serving the images up. You better have a CDN sitting in front. If not, plan on your system coming to a screeching halt. Of course, most sites have a CDN, but you don't want to be at the mercy of that service going down meaning your system will get overloaded. The amount of stress put on your sql server is the main problem here.
In terms of ease of backup. Your tradeoff there is that your db is MUCH MUCH LARGER and, therefore, the backup takes longer. Potentially, much longer and the system runs slower during the backup. Not to mention, moving backups around takes longer (i.e., restoring prod data in a dev environment or on local machines for dev purposes). Don't use this as a deciding factor.
Most cloud services have automatic redundancy of any files that you store on their system (i.e., aws's S3 and azure's blob). If you're on premise, just make sure you use a shared location for the images and make sure that location is backed up. I think the best option is to set it up so each image (other UGC file types too) has an entry in your db with a path to that file. Going one step further, separate the root path into a config setting and only store the remaining path with the entry. For example, root path in config might be a base url, a shared drive or virtual dir, or a blank entry. Then your entry might have "/files/images/image.jpg". This way, if you move your filestore, you can just update the root config. I would also suggest creating a FileStoreProvider interface (Singleton) that can be used for managing (saving, deleting, updating) these files. This way, if you switch between AWS, Azure, or on premise, you can just create a new Provider.
I have a client server DB, i manage many files (doc, txt, pdf, ...) and all of them go in a filestream BLOB. Customers has 50+ MB dbs. If in azure you can do the same go for it. Having all in the db is a wonderful thing. It is considered good policy also for Postgres and MySQL