Unable to create big table in Athena by using Databricks notebook

Unable to create big table in Athena by using Databricks notebook - apache-spark

I am trying to run a script to create tables in Athena for an application and all small tables are getting created but one big table with 4152 columns is failing to be created with the error :
Org.apache.hadoop.hive.ql.metadata.HiveException : InvalidObjectException
Message: Entity size has exceeded the maximum allowed size service AWS Glue status code 400 error code InvalidInputExceprion
I have decoded that it's happening because the schema size is 305 kb and it's exceeding Athena limit of 262kb and through Databricks also if the number of lines is less around 3000 columns it's getting created
How do I fix this problem I cannot divide this table into 2 tables

Related

Spark SQL performance issue on large table

We are connecting Tera data from spark SQL with below API
Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB.
This below information we got from DBA. We dont have any logs on SPARK SQL.
SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
1
1
1
1
1
1
1
1
1
Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe.
Please provide me your inputs on this.
I searched on various mailing list and even created sapk question.

AWS Maximum BadRequestException retries reached for query Using Data API to Query RDS Serveless Aurora

I have created a Lambda function that uses awswrangler data api to read in data from an RDS Serverless Aurora PostgreSQL Database from a query. The query contains a conditional that is a list of IDs. If the query has less then 1K ids it works great, if over 1K I get this message:
Maximum BadRequestException retries reached for query
An example query is:
"""select * from serverlessDB where column_name in %s""" % ids_list
I adjusted the serverless RDS instance to force scaling as well as increased the concurrency on the lambda function. Is there a way fix this issue?

With PostgreSQL, 1000+ items in a WHERE IN clause should be just fine.
I believe you are running into a Data API limit.
There isn't a fixed upper limit on the number of parameter sets. However, the maximum size of the HTTP request submitted through the Data API is 4 MiB. If the request exceeds this limit, the Data API returns an error and doesn't process the request. This 4 MiB limit includes the size of the HTTP headers and the JSON notation in the request. Thus, the number of parameter sets that you can include depends on a combination of factors, such as the size of the SQL statement and the size of each parameter set.
The response size limit is 1 MiB. If the call returns more than 1 MiB of response data, the call is terminated.
The maximum number of requests per second is 1,000./
Either your request exceeds 4 MB or the response of your query exceeds 1 MB. I suggest you split your query into multiple smaller queries.
Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?

If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error

I am using ADF copy acivity to copy files on azure blob to azure postgres.. im doing recursive copy i.e. there are multiple files withing the folder.. thats fine.. size of 5 files which i have to copy is total around 6 gb. activity fails after 30-60 min of run. used write batch size from 100- 500 but still fails.
used 4 or 8 orauto DIUS, similarly tried used 1,2,4,8 or auto parallel connections to postgres.normally it seems it uses 1 per source file. azure postgres server has 8 cores and temp buffer size is 8192 kb. max allowed is 16000 something kb. even tried using that but 2 errors which i have been constantly getting. ms support team suggested to use retry option. still awaiting response from there pg team if i get something but below r the errors.
Answer: {
'errorCode': '2200',
'message': ''Type=Npgsql.NpgsqlException,Message=Exception while reading from stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System,'',
'failureType': 'UserError',
'target': 'csv to pg staging data migration',
'details': []
}
or
Operation on target csv to pg staging data migration failed: 'Type=Npgsql.NpgsqlException,Message=Exception while flushing stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System

I was also facing this issue recently and contacted our microsoft rep who got back to me with the following update on 2020-01-16:
“This is another issue we found in the driver, we just finished our
deployment yesterday to fix this issue by upgrading driver version.
Now customer can have up to 32767 columns data in one batch size(which
is the limitation in PostgreSQL, we can’t exceed that).
Please let customer make sure that (Write batch size* column size)<
32767 as I mentioned, otherwise they will face the limitation. “
"Column size" refers to the count of columns in the table. The "area" (row write batch size * column count) cannot be greater than 32,767.
I was able to change my ADF write batch size on copy activity to a dynamic formula to ensure optimum batch sizes per table with the following:
#div(32766,length(pipeline().parameters.config)
pipeline().parameters.config refers to an array containing information about columns for the table. the length of the array = number of columns for table.
hope this helps! I was able to populate the database (albeit slowly) via ADF... would much prefer a COPY based method for better performance.

MemSQL code generation has failed: Failed to codegen

Ihave workstation of 250 gb Ram and 4 tb SSD. The memsql has a table that contains 1 billion records each of which 44 columns with 500 gb data. When I run the following query on that table
SELECT count(*) ct,name,age FROM research.all_data group by name having count(*) >100 order by ct desc
I got the following error
MemSQL code generation has failed
I made a restart to the server and after that I got another error
Not enough memory available to complete the current request. The request was not processed
I gave the server maximum mermory 220 GB and max_table_memory 190 GB.
why that error could happen?
why memsql consuming 140 gb from memory however I am using column store?

For "MemSQL code generation has failed", check the tracelog (http://docs.memsql.com/docs/trace-log) on the MemSQL node where the error was hit for more details - this can mean a lot of different things.
MemSQL needs memory to process query results, hold some metadata, etc. even though columnstore data lives on disk. Check memsql status info to see what is using memory - https://knowledgebase.memsql.com/hc/en-us/articles/208759276-What-is-using-memory-on-my-leaves-.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string