Arangodb - slow cursors - arangodb

Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks

Related

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Azure search index storage size stops at 8MB

I am trying to ingest a load of 13k json documents into azure search engine, but the index stops at around 6k documents without any error for the indexer and the index storage size is 7.96MB and it doesn't surpass this limit no matter what.
I have tried using smaller batches of 3k/indexer and after that 1k/indexer, but I got the same result.
In my json I have around 10 simple fields, and 20 complex fields (which have other nested complex fields, but up to level 5).
Do you have any idea if there is a limit per size for an index? And where I can set it up?
As SLA, I think we are using S1 plan (based on what limits we have - 50 indexers, and so on)
Thanks
Really hard to help without seeing it, but I remember I faced a problem like this in the past. In my case, it was a problem of duplicating with the key field.
I also recommend you smaller batches (~500 documents)
PS: Take a look if your nested jsons are not too big (in case it's marked as retrievable).

Spark: How collect large amount of data without out of memory

I have the following issue:
I do a sql query over a set of parquet files on HDFS and then I do a collect in order to get the result.
The problem is that when there are many rows I get an out of memory error.
This query requires shuffling so I can not do the query on each file.
One solution could be to iterate over the values of a column and save the result on disk:
df = sql('original query goes here')
// data = collect(df) <- out of memory
createOrReplaceTempView(df, 't')
for each c in cities
x = collect(sql("select * from t where city = c")
append x to file
As far as I know it will result in the program taking too much time because the query will be executed for each city.
What is the best way of doing this?
In the case if its running out of memory, which means that the output data is really very huge, so,
you can write down the results into some file itself just like parquet file.
If you want to further perform some operation, on this collected data, you can read data from this file.
For large datasets we should not use collect(), instead you may use take(100) or take(some_integer) in order to check that some values are correct.
As #cricket_007 said, I would not collect() your data from Spark to append it to a file in R.
Additionally, it doesn't make sense to iterate over a list of SparkR::distinct() cities and then select everything from those tables just to append them to some output dataset. The only time you would want to do that is if you are trying to do another operation within each group based upon some sort of conditional logic or apply an operation to each group using a function that is NOT available in SparkR.
I think you are trying to get a data frame (either Spark or R) with observations grouped in a way so that when you look at them, everything is pretty. To do that, add a GROUP BY city clause to your first SQL query. From there, just write the data back out to HDFS or some other output directory. From what I understand about your question, maybe doing something like this will help:
sdf <- SparkR::sql('SELECT SOME GREAT QUERY FROM TABLE GROUP BY city')
SparkR::write.parquet(sdf, path="path/to/desired/output/location", mode="append")
This will give you all your data in one file, and it should be grouped by city, which is what I think you are trying to get with your second query in your question.
You can confirm the output is what you want via:
newsdf<- SparkR::read.parquet(x="path/to/first/output/location/")
View(head(sdf, num=200))
Good luck, hopefully this helps.
Since your data is huge it is no longer possible to collect() anymore. So you can use a strategy to sample data and learn from the sampled data.
import numpy as np
arr = np.array(sdf.select("col_name").sample(False, 0.5, seed=42).collect())
Here you are sampling 50% of the data and just a single column.

ArangoDB - arangoimp on csv files is very slow on large datasets

I am new to arango. I'm trying to import some of my data from Neo4j into arango.
I am trying to add millions of nodes and edges to store playlist data for various people. I have the csv files from neo4j. I ran a script to change the format of the csv files of node to have a _key attribute. And the edges to have a _to and _from attribute.
When I tried this on a very small dataset, things worked perfectly and I could see the graph on the UI and perform queries. Bingo!
Now, I am trying to add millions of rows of data ( each arangoimp batch imports a csv with about 100,000 rows ). Each batch has 5 collections ( a different csv file for each)
After about 7-8 batches of such data, the system all of a sudden gets very slow, unresponsive and throws the following errors:
ERROR error message: failed with error: corrupted collection
This just randomly comes up for any batch, though the format of the data is exactly the same as the previous batches
ERROR Could not connect to endpoint 'tcp://127.0.0.1:8529', database: '_system', username: 'root'
FATAL got error from server: HTTP 401 (Unauthorized)'
Otherwise it just keeps processing for hours with barely any progress
I'm guessing all of this has to do with the large number of imports. Some post said that maybe I have too many file descriptors, but I'm not sure how to handle it.
Another thing I notice, is that the biggest collection of all the 5 collections, is the one that mostly gets the errors ( although the other ones also do). Do the file descriptors remain specific to a certain collection, even on different import statements?
Could someone please help point me in the right direction? I'm not sure on how to begin debugging the problem
Thank you in advance
The problem here is, that the server must not be overrun in terms of available disk I/O. The situation may benefit from more available RAM.
The system also has to maintain indices while importing, which increases complexity with the number of documents in the collections.
With ArangoDB 3.4 we have improved Arangoimp to maximize throughput, without maxing out which should resolve this situation and remove the necessity to split the import data into chunks.
However, as its already is, the CSV format should be prepared, JSONL is also supported.

Why is search performance is slow for about 1M documents - how to scale the application?

I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700

Resources