Will Pulsar delete offloaded segments while deleting topics?

Will Pulsar delete offloaded segments while deleting topics? - apache-pulsar

Pulsar supports https://pulsar.apache.org/docs/en/cookbooks-tiered-storage/, Tiered storage, which is great. I experiment this feature, and saw it put some files on to AWS S3, looks like 2 files per segment, both with the same UUID:
5780710b-393a-49cb-aff7-282901d7e311-ledger-9
5780710b-393a-49cb-aff7-282901d7e311-ledger-9-index
I think it would be a mess later if there are many topics, and I cannot figure out which file belongs to which topic by the name. If I delete the topic use pulsar-admin, the files stay on AWS S3.
Is there a way to manage this?

Related

Which way to go when Zipping files on a cloud bucket?

I have a pipeline process which streams data from DB, maps them, and convert the mapped data into XML, then finally create the files on google cloud.
Think of it as this way
pipeline(
[
db_stream(),
mapJsonStream(),
convertToXmlStream(),
checkIfNewFileIsNeeded(),
// option 1. Add a zipping stream here before writing to the cloud
writeToCloudStream()
callback() // cleanup stuff - option 2. I can re-stream files and zip them here and stream back to cloud bucket
]
)
This will create multiple xml files (there's a limit on each xml file size, so our data needs to be split to different xml files) on the bucket.
Now I need all these files to be zipped, I searched if I can do that directly on the cloud using google Nodejs client, but it seems they don't support this kinda functionality.
I am a bit confused how to proceed because after searching abit I found that all zip libraries needs to built the archive in memory before streaming it which defeats the purpose of the pipeline I already have (option 1)
Option 2 seems a bit expensive and unnecessary.
If you any thoughts on this I would really appreciate it, Thanks in advance!

Azure Storage - File Share - Move 16m files in nested folders

Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks

A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )

Nodejs API, Docker (Swarm), scalability and storage

I programmed an API with nodejs and express like million others out there and it will go live in a few weeks. The API currently runs on a single docker host with one volume for persistent data containing images uploaded by users.
I'm now thinking about scalability and a high availability setup where the question about network volumes come in. I've read a lot about NFS volumes and potentially the S3 Driver for a docker swarm.
From the information I gathered, I sorted out two possible solutions for the swarm setup:
Docker Volume Driver
I could connect each docker host either to an S3 Bucket or EFS Storage with the compose file
Connection should work even if I move VPS Provider
Better security if I put a NFS storage on the private part of the network (no S3 or EFS)
API Multer S3 Middleware
No attached volume required since the S3 connection is done from within the container
Easier swarm and docker management
Things have to be re-programmed and a few files needs to be migrated
On a GET request, the files will be provided by AWS directly instead of the API
Please, tell me your opposition on this. Am I getting this right or do I miss something? Which route should I take? Is there something to consider with latency or permissions when mounting from different hosts?
Tipps on S3, EFS are definitely welcome, since I have no knowledge yet.

I would not recommend saving to disk, instead use S3 API directly - create buckets and write in your app code.
If you're thinking of mounting a single S3 bucket as your drive there are severe limitations with that. The 5Gb limit. Anytime you modify contents in any way the driver will reupload the entire bucket. If there's any contention it'll have to retry. Years ago when I tried this the fuse drivers weren't stable enough to use as part of a production system, they'd crash and you have to remount. It was a nice idea but could only be used as an ad hoc kind of thing on the command line.
As far as NFS for the love of god don't do this to yourself you're taking on responsibility for this on yourself.
EFS can't really comment, by the time it was available most people just learned to use S3 and it is cheaper.

AWS S3 File Sync

I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.

You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.

Update the ID3 tags of S3 bucket files

In my AWS s3 bucket I have thousand of mp3 files and I want to modify the ID3 tags for those files. please suggest the best way.

Sorry to give you the bad news, but only way to do is downloading files one by one update id3 tags and upload them back to s3 bucket. You cannot edit files in place, because AWS S3 is object storage, meaning it keeps data in key: value pairs, key is the folder/filename, value is the file content. It's not suitable for file system, databases, etc.
If you do it this way, one warning, check if you have versioning is on or off for your bucket. Sometimes it's nice to have versioning handled automatically by S3 but, you should remember that each version adds to the storage space that you're paying for.
If you want to edit/modify your files every now and then, you can use AWS EBS or EFS. Both EBS and EFS are block storages, and you can attach them to any EC2 instance, then you can edit/modify your files. The difference between EBS and EFS mainly is, EFS can be attached to multiple EC2 instances at the same time, and share the files in between them.
One more thing about EBS and EFS though, to reach your files, you need to attach it to an EC2 instance. There is no other way to reach your files as easily as in S3.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string