Postgresql - IN clause optimization for more than 3000 values - excel

I have an application where the user will be uploading an excel file(.xlsx or .csv) with more than 10,000 rows with a single column "partId" containing the values to look for in database
I will be reading the excel values and store it in list object and pass the list as parameter to the Spring Boot JPA repository find method that builds IN clause query internally:
// Read excel file
stream = new ByteArrayInputStream(file.getBytes());
wb = WorkbookFactory.create(stream);
org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
Iterator<Row> rowIterator = sheet.rowIterator();
while(rowIterator.hasNext()) {
Row row = rowIterator.next();
Cell cell = row.getCell(0);
System.out.println(cell.getStringCellValue());
vinList.add(cell.getStringCellValue());
}
//JPA repository method that I used
findByPartIdInAndSecondaryId(List<String> partIds);
I read in many articles and experienced the same in above case that using IN query is inefficient for huge list of data.
How can I optimize the above scenario or write a new optimized query?
Also, please let me know if there is optimized way of reading an excel file than the above mentioned code snippet
It would be much helpful!! Thanks in advance!

If the list is truly huge, you will never be lightning fast.
I see several options:
Send a query with a large IN list, as you mention in your question.
Construct a statement that is a join with a large VALUES clause:
SELECT ... FROM mytable
JOIN (VALUES (42), (101), (43), ...) AS tmp(col)
ON mytable.id = tmp.col;
Create a temporary table with the values and join with that:
BEGIN;
CREATE TEMP TABLE tmp(col bigint) ON COMMIT DROP;
Then either
COPY tmp FROM STDIN; -- if Spring supports COPY
or
INSERT INTO tmp VALUES (42), (101), (43), ...; -- if not
Then
ANALYZE tmp; -- for good statistics
SELECT ... FROM mytable
JOIN tmp ON mytable.id = tmp.col;
COMMIT; -- drops the temporary table
Which of these is fastest is best determined by trial and error for your case; I don't think that it can be said that one of the methods will always beat the others.
Some considerations:
Solutions 1. and 2. may result in very large statements, while solution 3. can be split in smaller chunks.
Solution 3. will very likely be slower unless the list is truly large.

Related

Iceberg: How to quickly traverse a very large table

I'm new to iceberg, and i have a question about query big table.
We have a Hive table with a total of 3.6 million records and 120 fields per record. and we want to transfer all the records in this table to other databases, such as pg, kafak, etc.
Currently we do like this:
Dataset<Row> dataset = connection.client.read().format("iceberg").load("default.table");
// here will stuck for a very long time
dataset.foreachPartition(par ->{
par.forEachRemaining(row ->{
```
});
});
but it can get stuck for a long time in the foreach process.
and I tried the following method, the process does not stay stuck for long, but the traversal speed is very slow, the traverse efficiency is about 50 records/second.
HiveCatalog hiveCatalog = createHiveCatalog(props);
Table table = hiveCatalog.loadTable(TableIdentifier.of("default.table"));
CloseableIterable<Record> records = IcebergGenerics.read(table) .build();
records.forEach( record ->{
```
});
Neither of these two ways can meet our needs, I would like to ask whether my code needs to be modified, or is there a better way to traverse all records? Thanks!
In addition to reading row by row, here is another idea.
If your target database can import files directly, try retrieving files from Iceberg and importing them directly to the database.
Example code is as follows:
Iterable<DataFile> files = FindFiles.in(table)
.inPartition(table.spec(), StaticDataTask.Row.of(1))
.inPartition(table.spec(), StaticDataTask.Row.of(2))
.collect();
You can get the file path and Format from the DataFile.

Delta table merge on multiple columns

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns
DeltaTable.forPath(spark, "path")
.as("data")
.merge(
finalDf1.as("updates"),
"data.column1 = updates.column1 AND data.column2 = updates.column2 AND data.column3 = updates.column3 AND data.column4 = updates.column4 AND data.column5 = updates.column5")
.whenMatched
.updateAll()
.whenNotMatched
.insertAll()
.execute()
When I check the data counts it is not updating as expected.
Could someone help me here on this?
Please try also approach like in this example: https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
Create a changes tables with additional columns which you will note
if a row is new (be inserted)
old (primary key exists) and nothing has changed
old (primary key exists) but other fields needs an update
and then use additional conditions on merge, for example:
.whenMatched("s.new = true")
.insert()
.whenMatched("s.updated = true")
.updateExpr(Map("key" -> "s.key", "value" -> "s.newValue"))
How are you counting your rows?
One thing to keep in mind is that directly reading and counting from the parquet files produced by Delta Lake will potentially give you a different result than reading the rows through the delta table interface. Remember that delta keeps a log and supports time travel so it does store copies of rows as they change over time.
Here's a way to accurately count the current rows in a delta table:
deltaTable = DeltaTable.forPath(spark,<path to your delta table>)
deltaTable.toDF().count()

How can I grab the columns of many tables efficiently in Spark?

I want to find all columns in some Hive tables that meet a certain criteria. However, the code I've written to do this is very slow, since Spark isn't a particularly big fan of looping:
matches = {}
for table in table_list:
matching_cols = [c for c in spark.read.table(table).columns if substring in c]
if matching_cols:
matches[table] = matching_cols
I want something like:
matches = {'table1': ['column1', 'column2'], 'table2': ['column2']}
How can I more efficiently achieve the same result?
A colleague just figured it out. This is the revised solution:
matches = {}
for table in table_list:
matching_cols = spark.sql("describe {}".format(table)) \
.where(col('col_name').rlike(substring)) \
.collect()
if matching_cols:
matches[table] = [c.col_name for c in matching_cols]
The key difference here is that Spark seems to be caching partition information in my prior example, hence why it was getting more and more bogged down with each loop. Accessing the metadata to scrape columns, rather than the table itself, bypasses that issue.
If table fields has comments above code will get into issues on extra info(comment), Also side note HBase link tables will be issue too...
Example:
create TABLE deck_test (
COLOR string COMMENT 'COLOR Address',
SUIT string COMMENT '4 type Suits',
PIP string)
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED AS TEXTFILE;
describe deck_test;
color string COLOR Address
suit string 4 type Suits
pip string
to handle comments issue small change may help...
matches = {}
for table in table_list:
matching_cols = spark.sql("show columns in {}".format(table)).where(col('result').rlike(substring)).collect()
if matching_cols:
matches[table] = [c.col_name for c in matching_cols]

Cassandra read from large dataset

I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.
select count(*) from conv_org where org_id = 'TEST_ORG'
I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?
Would the syntax look something like this?
Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);
I am unsure the above code will work as I do not need a result set back I just need a count.
Here is the data model.
CREATE TABLE ts.conv_org (
org_id text,
create_time timestamp,
test_id text,
org_type int,
PRIMARY KEY (org_id, create_time, conv_id)
)
If org_id isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.
In Java for example you can do something like this:
ResultSet rs = session.execute(...);
Iterator<Row> iter = rs.iterator();
while (iter.hasNext()) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults();
Row row = iter.next()
... process the row ...
}
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html
You could select a small colum and count your self. There is int getAvailableWithoutFetching() and isFullyFetched() that could help you.
In general if you really need a count - maintain it yourself.
On the other hand, if you have really many rows in one partition you can have also some other performance problems.
But that's hard to say without knowing the data model.
Maybe you want to use "counter table" in addition to your dataset.
Pros: get counter fast.
Cons: need to maintained that table.
Reference:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html

Storing time ranges in cassandra

I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.
Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like
SELECT value FROM data WHERE starttime <= x AND endtime >= y
Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?
This is an oddly difficult thing to model efficiently.
I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.
In pycassa, a Python Cassandra client, your query would look like this:
from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
# do stuff with result
There should be something similar in other clients.
The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.

Resources