I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/
Related
I am looking into the ways for comparing records from same table but on different databases. I just need to compare and find the missing records.
I tried out a few methods.
loading the records into a pandas data frame, I used read_sql. But it is taking more time and memory to complete the load and if the records are large, I am getting a memory error.
Tried setting up a standalone cluster of spark and run the comparison, it is also throwing java heap space error. tuning the conf is not working as well.
Please let me know if there are other ways to handle this huge record comparison.
--update
Do we have a tool readily available for cross data source comparison
If your data size is huge you can use cloud services to run your spark job and get the results. Here you can use aws glue which is serverless and is charged as you go.
Or if your data is not considerably large and is something one time job then you can use google colab which is free and run your comparision over it .
I have a requirement of reading multiple files (105 files) from ADLS(Azure data lake storage); parsing them and subsequently adding the parsed data directly to multiple collections in azure cosmos db for mongodb api. All this needs to be done in one request. Average file size is 120kb.
The issue is that after multiple documents are added,an error is raised "request size limit too large"
Please let me know if someone has any inputs on this.
It's unclear how you're performing multi-document inserts but... You can't increase maximum request size. You'll need to perform individual inserts, or insert in smaller batches.
Hi I've got a simple collection with 40k records in. It's just an import of a csv (c.4Mb) so it has a consistent object per document and is for an Open Data portal.
I need to be able to offer a full download of the data as well as the capabilities of AQL for querying, grouping, aggregating etc.
If I set batchSize to the full dataset then it takes around 50 seconds to return and is unsurprisingly about 12Mb due to the column names.
eg
{"query":"for x in dataset return x","batchSize":50000}
I've tried things caching and balancing between a larger batchSize and using the cursor to build the whole dataset but I can't get the response time down very much.
Today I came across the attributes and values functions and created this AQL statement.
{"query":"return union(
for x in dataset limit 1 return attributes(x,true),
for x in dataset return values(x,true))","batchSize":50000}
It will mean I have to unparse the object but I use PapaParse so that should be no issue (not proved yet).
Is this the best / only way to have an option to output the full csv and still have a response that performs well?
I am trying to avoid having to store the data multiple times, eg once raw csv then data in a collection. I guess there may be a dataset that is too big to cope with this approach but this is one of our bigger datasets.
Thanks
I am querying the Google cloud Data using Bigquery.
When i am running the query it return about 8 millions of row.
But it throws error :
Response too large to return
How i can get all 8 million records,can anybody help.
1. What is the maximum size of Big Query Response?
As it's mentioned on Quota-policy queries maximum response size: 128 MB compressed (unlimited when returning large query results)
2. How do we select all the records in Query Request not in 'Export Method'?
If you plan to run a query that might return larger results, you can set allowLargeResults to true in your job configuration.
Queries that return large results will take longer to execute, even if the result set is small, and are subject to additional limitations:
You must specify a destination table.
You can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
Window functions can return large query results only if used in conjunction with a PARTITION BY clause.
Read more about how to paginate to get the results here and also read from the BigQuery Analytics book, the pages that start with page 200, where it is explained how Jobs::getQueryResults is working together with the maxResults parameter and int's blocking mode.
Update:
Query Result Size Limitations - Sometimes, it is hard to know what 128 MB of compressed
data means.
When you run a normal query in BigQuery, the response size is limited to 128 MB
of compressed data. Sometimes, it is hard to know what 128 MB of compressed
data means. Does it get compressed 2x? 10x? The results are compressed within
their respective columns, which means the compression ratio tends to be very
good. For example, if you have one column that is the name of a country, there
will likely be only a few different values. When you have only a few distinct
values, this means that there isn’t a lot of unique information, and the column
will generally compress well. If you return encrypted blobs of data, they will
likely not compress well because they will be mostly random. (This is explained on the book linked above on page 220)
try this,
Under the query window, there is an button 'Show Options', click that and then you will see some options,
select or create a new destination table;
click the 'Allow Large Results'
run your query, and see whether it works.
I have a use case, in which my data is stored in DynamoDB with hashkey as UniqueID and range key as Date. The same data is also present in simple storage service of Amazon(S3). I want to search all the data based on time range. I want this to be fast enough. I can think of the following possible approaches:
- Scrap the full S3 and sort them based on time(it is not satisfying my latency requirements)
-Using DynamoDB scan filters will not help, as they scan the whole table. Consider data to be of large amount.
Requirements : fast(can get the result in less than 1 minute) ,
do not access a large amount of data,
can't use any other DB source
I think AWS Elasticsearch might be the answer to your problem. DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data.
Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics.
Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.
You should use Query instead of Scan, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html.