I'm a newbie in Streamsets and Kudu technologies and I'm trying several solutions to reach my goal:
I've got a folder containing some Avro files and these files need to be processed and afterward sent to a Kudu schema.
https://i.stack.imgur.com/l5Yf9.jpg
When using an Avro file containing a couple hundreds of records all goes right, but when the number of records increases to 16k this error is shown:
Caused by:
org.apache.kudu.client. NonRecoverableException:
MANUAL_FLUSH is enabled but the buffer is too big.
I've searched in all available configurations both on Streamsets and Kudu and the only solution that I was able to apply consists in editing the Java source code, deleting a single row that switched from the default flush mode to the manual one; this works but it's not the optimal solution because it requires to edit and compile this file each time I want to use it on a new machine.
Anyone knows how to avoid this happens?
Thanks in advance!
Related
Looking for a solid way to limit the size of Kismet's database files (*.kismet) through the conf files located in /etc/kismet/. The version of Kismet I'm currently using is 2021-08-R1.
The end state would be to limit the file size (10MB for example) or after X minutes of logging the database is written to and closed. Then, a new database is created, connected, and starts getting written to. This process would continue until Kismet is killed. This way, rather than having one large database, there will be multiple smaller ones.
In the kismet_logging.conf file there are some timeout options, but that's for expunging old entries in the logs. I want to preserve everything that's being captured, but break the logs into segments as the capture process is being performed.
I'd appreciate anyone's input on how to do this either through configuration settings (some that perhaps don't exist natively in the conf files by default?) or through plugins, or anything else. Thanks in advance!
Two interesting ways:
One could let the old entries be taken out, but reach in with SQL and extract what you wanted as a time-bound query.
A second way would be to automate the restarting of kismet... which is a little less elegant.. but seems to work.
https://magazine.odroid.com/article/home-assistant-tracking-people-with-wi-fi-using-kismet/
If you read that article carefully... there are lots of bits if interesting information here.
I just ran this:
dbutils.fs.ls("dbfs:/FileStore/")
I see this result:
[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]
Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!
For instance, when I run this:
dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")
I get this message:
java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.
I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!
In Azure Databricks, this is expected behaviour.
For Files it displays the actual file size.
For Directories it displays the size=0
Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.
dbfs ls -l dbfs:/FileStore/
When you check out the result using dbutils as follows:
dbutils.fs.ls("dbfs:/FileStore/")
Important point to remember while reading the files larger than 2GB:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.
There are multiple way to solve this issue. You may checkout similar SO thread answered by me.
Hope this helps.
A very simple question...
I have downloaded a very large .csv file (around 3.7 GB) and now wish to open it; but excel can't seem to manage this.
Please how do I open this file?
Clearly I am missing a trick!
Please help!
There are a number of other Stackoverflow questions addressing this problem, such as:
Excel CSV. file with more than 1,048,576 rows of data
The bottom line is that you're getting into database territory with that sort of size. The best solution I've found is Bigquery from Google's cloud platform. It's super cheap, astonishingly fast, and it automatically detects schemas on most CSVs. The downside is you'll have to learn SQL to do even the simplest things with the data.
Can you not tell excel to only "open" the file with the first 10 lines ...
This would allow you to inspect the format and then use some database functions on the contents.
Another thing that can impact whether you can open a large Excel file is the resources and capacity of the computer. That's a huge file and you have to have a lot of on-disk swap space (page file in windows terms) + memory to open a file of that size. So, one thing you can do is find another computer that has more memory and resources or increase your swap space on your computer. If you have windows just google how to increase your page file.
This is a common problem. The typical solutions are
Insert your .CSV file into a SQL database such as MySQL, PostgreSQL etc.
Processing you data using Python, or R.
Find a data hub for your data. For example, Acho Studio.
The problem with solution one is that you'll have to design a table schema and find a server to host the database. Also you need to write server side code to maintain or change the database. The problem with Python or R is that running processes on GBs of data will put a of stress to your local computer. A data hub is much easier but its costs may vary.
I have a huge 20Gb csv file to copy into cassandra, of course i need to manage the case of errors ( if the the server or the Transfer/Load application crashes ).
I need to re-start the processing(or an other node or not) and continue the transfer without starting the csv file from it begning.
what is the best and easiest way to do that ?
using the Copy CQLSH Command ? using flume or sqoop ? or using native java application, using spark... ?
thanks a lot
If it was me, I would split the file.
I would pick a preferred way to load any csv data in, ignoring the issues of huge file size and error handling. For example, I would use a python script and the native driver and test it with a few lines of csv to see that it can insert from a tiny csv file with real data.
Then I would write a script to split the file into manageable sized chunks, however you define it. I would try a few chunk sizes to get a file size that loads in about a minute. Maybe you will need hundreds of chunks for 20 GB, but probably not thousands.
Then I would split the whole file into chunks of that size and loop over the chunks, logging how it is going. On an error of any kind, fix the problem and just start loading again from the last chunk that loaded successfully as found in the log file.
Here are a two considerations that I would try first since they are simple and well contained:
cqlsh COPY has been vastly improved in 2.1.13, 2.2.5, 3.0.3 and 3.2+. If you do consider using it, make sure to be at one of those versions or newer.
Another option is to use Brian Hess' cassandra-loader which is an effective way of bulk loading to and from csv files in an efficient manner.
I think CQLSH doesn't handle the case of application crash, so why not using both of the solution exposed above, split the file into several manageable chunks and uses the copy cqlsh command to import the data ?
Could anyone tell the maximum size(no. of rows or file size) of a csv file we can load efficiently in cassandra using copy command. Is there a limit for it? if so is it a good idea to breakdown the size files into multiple files and load or we have any better option to do it? Many thanks.
I've run into this issue before... At least for me there was no clear statement in any datastax or apache documentation of the max size. Basically, it may just be limited to your pc/server/cluster resources (e.g. cpu and memory).
However, in an article by jgong found here it is stated that you can import up to 10MB. For me it was something around 8.5MB. In the docs for cassandra 1.2 here its stated that you can import a few million rows and that you should use the bulk-loader for more heavy stuff.
All in all, I do suggest importing via multiple csv files (just dont make them too small so your opening/closing files constantly) so that you can keep a handle on data being imported and finding errors easier. It can happen that waiting for an hour for a file to load it fails and you start over whereas if you have multiple files you dont need to start over on the ones that already have been successfully imported. Not to mention key duplicate errors.
Check out cassandra-9303 and 9302
and check out brian's cassandra-loader
https://github.com/brianmhess/cassandra-loader