I have a small application that I am running on a 'EMR' Cluster with 3 nodes. I have a few gigabytes of csv files that are split across multiple files. The application reads the csv files and then converts into '.orc' files. I have a small program that sequentially and synchronously sends limited (less than ten) files as input to the application.
My problem is, after sometime, the cluster is going down eventually without leaving any trace (or may be I am looking at wrong places). After trying to find out various options, I observed in 'ganglia' that the dfs.FSNameSystem.BlockCapacity is reducing eventually.
Is this because of the application or is it with the server configuration? Can someone please share if you have any exposure on this?
Related
Looking for a solid way to limit the size of Kismet's database files (*.kismet) through the conf files located in /etc/kismet/. The version of Kismet I'm currently using is 2021-08-R1.
The end state would be to limit the file size (10MB for example) or after X minutes of logging the database is written to and closed. Then, a new database is created, connected, and starts getting written to. This process would continue until Kismet is killed. This way, rather than having one large database, there will be multiple smaller ones.
In the kismet_logging.conf file there are some timeout options, but that's for expunging old entries in the logs. I want to preserve everything that's being captured, but break the logs into segments as the capture process is being performed.
I'd appreciate anyone's input on how to do this either through configuration settings (some that perhaps don't exist natively in the conf files by default?) or through plugins, or anything else. Thanks in advance!
Two interesting ways:
One could let the old entries be taken out, but reach in with SQL and extract what you wanted as a time-bound query.
A second way would be to automate the restarting of kismet... which is a little less elegant.. but seems to work.
https://magazine.odroid.com/article/home-assistant-tracking-people-with-wi-fi-using-kismet/
If you read that article carefully... there are lots of bits if interesting information here.
I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system.
Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.
Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:
How to efficiently read large number of large files which are located on one or numerous machines
Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...
IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.
The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.
Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.
On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf
You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.
However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.
But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.
Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.
The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.
I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.
I need to write a Spark app that uses temporary files.
I need to download many many large files, read them with some legacy code, do some processing, delete the file, and write the results to a database.
The files are on S3 and take a long time to download. However, I can do many at once, so I want to download a large number in parallel. The legacy code reads from the file system.
I think I can not avoid creating temporary files. What are the rules about Spark code reading and writing local files?
This must be a common issue, but I haven't found any threads or docs that talk about it. Can someone give me a pointer?
Many thanks
P
I'm ingesting very large files into Cassandra 2.0 and I'm noticing that my ingest rate into Cassandra will be x3 slower than the rate at which I'm getting new files to ingest. Given that, and trying to avoid memory problems, what are my options for keeping up with ingest?
I was initially thinking that I could have multiple clients writing, possibly each to a different "seed" node in the cluster. If I am careful about not accessing the same file twice will that cause problems with the node I/O? What is the best way to go about doing this? Based on google searches I have seen things like batch driver statements can help, but I'm reading in CSV files which need to be cleaned first...
Hi i have a spec for fetch the files from server and predict the un-used files from the directory in this situation i am going to fetch the files from server it will return huge files, the problem is the cpu usage will increase while i am fetching large files, so i like to eliminate this scenario. can any one knows how to avoid this situation please share with me though it might help full for me.
Thanks
You can split your large file on server into several smaller pieces and fetch some metadata about amount of pieces, size etc. and than fetch them one by one from your client c# code and join pieces in binary mode to your larger file.