Google CloudSearch CSV Connector hits a top limit when indexing - google-cloud-search

We are using Google's CSV connector to attempt to index a CSV file with 600k+ records. In the Test datasource, the number of records that get indexed top out at 8k. A different upper bound is seen for the Prod data, but at 130k. The connector keeps running but no additional records are indexed. Is there a datasource limit or some other limiting factor? Below are some of our tuning params from the config file
connector.runOnce=false
traverse.threadPoolSize=1000
traverse.partitionSize=4000
batch.batchSize=20
batch.maxQueueLength=8000
batch.maxActiveBatches=250
batch.maxBatchDelaySeconds=20
batch.readTimeoutSeconds=120
batch.connectTimeoutSeconds=300

Related

What is the max writeBatchSize for the REST as sink in Azure Data Factory

We are using Azure Data Factory to copy data from on-premise SQL table to a REST endpoint, for example, Google Cloud Storage. Our source table has more than 3 million of rows. Based the document https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#copy-activity-properties, the default value for writeBatchSize (number of records write to the REST sink per batch) is 10000. I tried to increase the size up to 5,000,000 and 1,000,000, and noticed the final file size are the same. It shows that not all the 3M records were written to GCS. Does anyone know what is max size for writeBatchSize? The pagination seems only for the case that using REST as source. I wonder if there is any workaround for my case?

Is there a way to increase the request size limit when inserting data into cosmosdb?

I have a requirement of reading multiple files (105 files) from ADLS(Azure data lake storage); parsing them and subsequently adding the parsed data directly to multiple collections in azure cosmos db for mongodb api. All this needs to be done in one request. Average file size is 120kb.
The issue is that after multiple documents are added,an error is raised "request size limit too large"
Please let me know if someone has any inputs on this.
It's unclear how you're performing multi-document inserts but... You can't increase maximum request size. You'll need to perform individual inserts, or insert in smaller batches.

PrestoDB v0.125 SELECT only returns subset of Cassandra records

SELECT statements in PrestoDB v0.125 with a Cassandra connector to a Datastax Cassandra cluster only return 200 rows, even where table contains many more rows than that. Aggregate queries like SELECT COUNT() over the same table also return a result of just 200.
(This behaviour is identical when querying with pyhive connector & with base presto CLI).
Documentation isn't much help, but am guessing that the issue is pagination & a need to set environment variables (which the documentation doesn't explain):
https://prestodb.io/docs/current/installation/cli.html
Does anyone know how I can remove this limit of 200 rows returned? What specific environment variable setting do I need?
For those who come after - the solution is in the cassandra.properties connector configuration for presto. The key setting is:
cassandra.limit-for-partition-key-select
This needs to be set higher than the total number of rows in the table you are querying, otherwise select queries will respond with only a fraction of the stored data (not having located all of the partition keys).
Complete copy of my config file (which may help!):
connector.name=cassandra
# Comma separated list of contact points
cassandra.contact-points=host1,host2
# Port running the native Cassandra protocol
cassandra.native-protocol-port=9042
# Limit of rows to read for finding all partition keys.
cassandra.limit-for-partition-key-select=2000000000
# maximum number of schema cache refresh threads, i.e. maximum number of parallel requests
cassandra.max-schema-refresh-threads=10
# schema cache time to live
cassandra.schema-cache-ttl=1h
# schema refresh interval
cassandra.schema-refresh-interval=2m
# Consistency level used for Cassandra queries (ONE, TWO, QUORUM, ...)
cassandra.consistency-level=ONE
# fetch size used for Cassandra queries
cassandra.fetch-size=5000
# fetch size used for partition key select query
cassandra.fetch-size-for-partition-key-select=20000

Search a large amount of data in DynamoDB

I have a use case, in which my data is stored in DynamoDB with hashkey as UniqueID and range key as Date. The same data is also present in simple storage service of Amazon(S3). I want to search all the data based on time range. I want this to be fast enough. I can think of the following possible approaches:
- Scrap the full S3 and sort them based on time(it is not satisfying my latency requirements)
-Using DynamoDB scan filters will not help, as they scan the whole table. Consider data to be of large amount.
Requirements : fast(can get the result in less than 1 minute) ,
do not access a large amount of data,
can't use any other DB source
I think AWS Elasticsearch might be the answer to your problem. DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data.
Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics.
Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.
You should use Query instead of Scan, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html.

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources