I use Spark SQL v2.4. with the SQL API. I have a sql query, which fails when I run the job in Spark, it fails with the error :-
WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes).
This may impact query planning performance.
ERROR TransportClient: Failed to send RPC RPC 8371705265602543276 to xx.xxx.xxx.xx:52790:java.nio.channels.ClosedChannelException
The issue occurs when I am triggering write command to save the output of the query to parquet file on S3:
The query is:-
create temp view last_run_dt
as
select dt,
to_date(last_day(add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1)), 'yyyy-MM-dd') as dt_lst_day_prv_mth
from date_dim
where dt = add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1);
create temp view get_plcy
as
select plcy_no, cust_id
from (select
plcy_no,
cust_id,
eff_date,
row_number() over (partition by plcy_no order by eff_date desc) AS row_num
from plcy_mstr pm
cross join last_run_dt lrd
on pm.curr_pur_dt <= lrd.dt_lst_day_prv_mth
and pm.fund_type NOT IN (27, 36, 52)
and pm.fifo_time <= '2022-02-12 01:25:00'
and pm.plcy_no is not null
)
where row_num = 1;
I am writing the output as :
df.coalesce(10).write.parquet('s3:/some/dir/data', mode="overwrite", compression="snappy")
The "plcy_mstr" table in the above query is a big table of 500 GB size and is partitioned on eff_dt column. Partitioned by every date.
I have tried to increase the executor memory by applying the following configurations, but the job still fails.
set spark.driver.memory=20g;
set spark.executor.memory=20g;
set spark.executor.cores=3;
set spark.executor.instances=30;
set spark.memory.fraction=0.75;
set spark.driver.maxResultSize=0;
The cluster contains 20 nodes with 8 cores each and 64GB of memory.
Can anyone please help me identify the issue and fix the job ? Any help is appreciated.
Happy to provide more information if required.
Thanks
A third-party application connects to a databricks general-purpose cluster and fires some SQL queries. All worked fine on databricks runtime 7.3LTS, but when we upgraded the cluster to runtime 9.1LTS, the where clause in the query suddenly did not contain quotes around the string value.
This is the incoming query with the 9.1LTS runtime:
22/01/26 08:16:52 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM (SELECT `TimeColumn`,`ValueColumn` FROM `database`.`table1` WHERE `TimeColumn` >= { ts '2022-12-26 09:14:55' } AND `TimeColumn` <= { ts '2022-01-26 09:14:55' } AND **`CounterName`=28STO0004** ORDER BY `TagTimeStamp`) LIMIT_ZERO LIMIT 0'
This is the incoming query with the 7.3LTS runtime:
22/01/26 08:28:48 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM (SELECT C_79736572615f79736572615f74616774696d6576616c7565.`TimeColumn` AS C_0, C_79736572615f79736572615f74616774696d6576616c7565.`ValueColumn` AS C_43 FROM `database`.`table1` C_79736572615f79736572615f74616774696d6576616c7565 WHERE (**(C_79736572615f79736572615f74616774696d6576616c7565.`CounterName` = '28STO0004')** AND (C_79736572615f79736572615f74616774696d6576616c7565.`TimeColumn` >= TIMESTAMP '2022-12-26 09:14:55') AND (C_79736572615f79736572615f74616774696d6576616c7565.`TimeColumn` <= TIMESTAMP '2022-01-26 09:14:55')) ORDER BY C_0 ASC ) LIMIT_ZERO LIMIT 0'
This image is the error we receive in the third-party application.
No changes were made to the third-party application. We also have no configuration options on the application, nor query diagnostics.
When we set up a databricks SQL-endpoint, we received the same error as the databricks runtime 9.1LTS cluster.
I am new to the Azure data factory and looking to copy CSV data into my table having a foreign key relationship. Here are my tables:
Customer table
CREATE TABLE [dbo].[Customer]
(
[Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
[CustomerNumber] NVARCHAR(50) NOT NULL,
[FirstName] NVARCHAR(50) NOT NULL,
[LastName] NVARCHAR(50) NOT NULL,
[CreatedOn] datetime,
[CreatedBy] NVARCHAR(255),
[ModifiedOn] datetime,
[ModifiedBy] NVARCHAR(255)
);
GO
-- Insert rows into table 'Customer' in schema '[dbo]'
INSERT INTO [dbo].[Customer]
VALUES
(
NEWID(),'Tom123', 'Tom', 'Shehu',GETDATE(),'test',GETDATE(),'admin'
),
(
NEWID(),'Harol234', 'Harold', 'Haoxa',GETDATE(),'test',GETDATE(),'admin'
),
(
NEWID(),'Peter345', 'Peter', 'Begu',GETDATE(),'test',GETDATE(),'admin'
),
(
NEWID(),'Marlin09', 'Marlin', 'Hysi',GETDATE(),'test',GETDATE(),'admin'
)
GO
Product Table
CREATE TABLE [dbo].[Product]
(
[Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
[Name] NVARCHAR(50) NOT NULL,
[ErpNumber] NVARCHAR(50) NOT NULL,
[Description] NVARCHAR(50) NOT NULL,
[CreatedOn] datetime,
[CreatedBy] NVARCHAR(255),
[ModifiedOn] datetime,
[ModifiedBy] NVARCHAR(255)
);
GO
-- Insert rows into table 'Product' in schema '[dbo]'
INSERT INTO [dbo].[Product]
VALUES
(
NEWID(), 'EI500CMZ', 'EI500CMZ','7-Day test product',GETDATE(),'Tom',GETDATE(),'Tom'
),
(
NEWID(), 'ST0SMX', 'ST0SMX','7-Day heavy duty product',GETDATE(),'Tom',GETDATE(),'Tom'
),
(
NEWID(), 'EH30MZ', 'EH30MZ','Electronic water test product',GETDATE(),'Tom',GETDATE(),'Tom'
)
CustomerProduct table
CREATE TABLE [dbo].[CustomerProduct]
(
[Id] UNIQUEIDENTIFIER NOT NULL PRIMARY KEY, -- Primary Key column
[CustomerId] UNIQUEIDENTIFIER NOT NULL,
[ProductId] UNIQUEIDENTIFIER NOT NULL,
[Name] NVARCHAR(255) NOT NULL,
[CreatedOn] datetime,
FOREIGN KEY(CustomerId) REFERENCES Customer(Id),
FOREIGN KEY(ProductId) REFERENCES Product(Id)
);
GO
Below is my CSV file data:
CustomerNumber,ErpNumber,Name
Tom123,EI500CMZ,EI500CMZ2340
Harol234,ST0SMX,ST0SMX74770
Peter345,EH30MZ,EH30MZ00234
Now I am looking to insert data into my 3rd table i.e CustomerProduct but I am not understanding how that "CustomerId", "ProductId" and "Name" will get inserted.
In the above CSV data, I am getting the "CustomerNumber" and "ErpNumber" but during the insertion "CustomerId" and "ProductId" should go in the table.
Not understanding how to do this.
So far I have done this in the Azure data factory:
Created a blob storage account. Added a container in blob storage and uploaded my CSV file.
Created a linked service of type Azure blob storage called "CustomerProductInputService" that will talk to blob storage
Created a linked service of type Azure SQL database called "CustomerProductOutputService" that will communicate with the "CustomerProduct" table.
Created a dataset of type azure blob. This will receive the data from "CustomerProductInputService".
Created a dataset of type azure SQL database.
Now I am stuck at copy activity. I am not understanding how to create pipeline for this scenario and insert data into the CustomerProduct table.
As I explained I am getting "CustomerNumber" and "ErpNumber" in the CSV file but I want to insert "CustomerId" and "ProductId" into my "CustomerProduct" table.
Can anybody help me?
You can Insert the CustomerProduct data from CSV to the table using dataflow activity using lookup transformations to get the CustomerID and ProductID from Customer and Product table respectively.
Source:
Add 3 source transformations in dataflow, 1 for CSV source file, 1 for Customer table, and 1 for Product table.
a) Source1 (CSV): Create and CSV dataset to source1 to get Input file data.
b) Source2 (CustomerTable): Connect to Customer table and get all the existing data from the Customer table.
• As we only need ID and CustomerNumber columns from the Customer table, add select transformation (Customer) after source2 to select only the required column list.
c) Source3 (ProductTable): Connect Source3 to Product Table to pull all the existing data from dbo.Product.
• Add Select transformation (Product) after Source3 to get only the required columns ID & ERPNumber from the column list.
Add Lookup transformation to Source1 (CSV) with Primary stream as CSV source and Lookup Stream as Customer (Source2 Select transformation) and Lookup Condition as CSV column “CustomerNumber” equals to (==) Customer table Column “CustomerNumber”.
As Lookup is like Left join here, it includes all columns from Source1 and Lookup columns from Source2 in the select list (which include duplicate columns).
a) So, using select transformation (CustomerSelectList) to select only the required columns in the Output. Also renaming the Column name “ID” which is pulled from Customer table to CustomerID to match with Sink table.
Add another Lookup transformation after Select (CustomerSelectList) to get the data from Product table.
a) Select Primary stream as CustomerSelectList (Select transformation) and Lookup stream as Product (Select of Source3)
b) With Lookup condition as CSV Source column “ErpNumber” equals to (==) Product table column “ErpNumber”.
Again, using the select transformation to ignore other columns and select only required columns from the select list. Also renaming column “ID” from Product table to ProductID.
Add Derived Column transformation to the select (CustomerProductSelectList) to add new columns ID and CreatedOn.
a) ID: as this is UNIQUEIDENTIFIER in the sink table, we can add an expression to generate the id using UUID() in the derived column.
b) CreatedOn: adding expression to get the **Current timestamp** to Sink table.
Finally add Sink transformation to Insert data onto CustomerProduct table.
Add this dataflow to a pipeline and run the pipeline to insert data.
Output:
First thing you would need to identify a key connection between customer and product. Next, create a pipeline from data factory and create 2 sources "Product" and "Customer" apply ADF transformation Join and Alter and sink it to dbo.CustomerProduct.
In my Cassandra Java driver code, I am creating a query and then I print the consitency level of the query
val whereClause = whereConditions(tablename, id); cassandraRepositoryLogger.trace("getRowsByPartitionKeyId: looking in table "+tablename+" with partition key "+partitionKeyColumns +" and values "+whereClause +" fetch size "+fetchSize)
cassandraRepositoryLogger.trace("where clause is "+whereClause)
cassandraRepositoryLogger.trace(s"consistency level ${whereClause.getConsistencyLevel}")
But the print shows taht consistency level is null. Why? Shouldn't it be One by default?
2020-06-10 07:16:44,146 [TRACE] from repository.UsersRepository in scala-execution-context-global-115 - where clause is SELECT * FROM users WHERE bucket=109 AND email='manu.chadha#hotmail.com';
2020-06-10 07:16:44,146 [TRACE] from repository.UsersRepository in scala-execution-context-global-115 - getOneRowByPartitionKeyId: looking in table users with partition key List(bucket, email) and values SELECT * FROM users WHERE bucket=109 AND email='manu.chadha#hotmail.com';
2020-06-10 07:16:44,146 [TRACE] from repository.UsersRepository in scala-execution-context-global-115 - consistency level null <-- Why is this null?
The query if build like follows
def whereConditions(tableName:String,id: UserKeys):Where= {
QueryBuilder.select().from(tableName).where(QueryBuilder.eq("bucket", id.bucket))
.and(QueryBuilder.eq("email", id.email))
}
This is how getConsistencyLevel method is implemented. This method returns consistencylevel of the query or null if no consistency level has been set using setConsistencyLevel.
I need to populate a custom column (User Defined Cost) in SOLine from a Serialized Item Unit Cost from Purchase Receipts with the same Lot/Serial Number (screenshot 1). If the item has splited Lot/Serial number (screenshot 2) then respective unit cost I have to read based on Lot/Serial number user enters in SOLine item.
I have already written SOLine_RowPersisting event to handle if the item is not splited but not sure how to find if there is a splited Serialized items. Below is the code for SOLine_RowPersisting event. Please suggest.
protected virtual void SOLine_RowPersisting(PXCache sender, PXRowPersistingEventArgs e)
{
SOLine row = (SOLine)e.Row;
if (row == null)
return;
if (!string.IsNullOrEmpty(row.LotSerialNbr))
{
SOOrderEntry graph = PXGraph.CreateInstance<SOOrderEntry>();
//select UnitCost, * from POReceiptLine where CompanyID = 2 and ReceiptNbr = 'PR004082' and InventoryID = '8502' and LotSerialNbr = 'SUB1703210365'
//select LotSerialNbr, * from POReceiptLineSplit where CompanyID = 2 and InventoryID = '8502' and LotSerialNbr = 'SUB1704270366'
//TODO : How to get it from POReceiptLineSplit also
POReceiptLine poRow = PXSelect<POReceiptLine,
Where<POReceiptLine.inventoryID, Equal<Required<POReceiptLine.inventoryID>>,
And<POReceiptLine.lotSerialNbr, Equal<Required<POReceiptLine.lotSerialNbr>>,
And<POReceiptLine.pOType, Equal<Required<POReceiptLine.pOType>>>>>>.Select(graph, row.InventoryID, row.LotSerialNbr, "RO");
SOLineExtension ext = PXCache<SOLine>.GetExtension<SOLineExtension>(row);
ext.UsrUserDefinedCost = poRow.UnitCost;
}
}
Screenshot 1:-
Screenshot 2:-
You can iterate on the POReceiptLineSplit records from the 'splits' DataView of the Base DAC.
To do that, open the grid in Acumatica, hold Control+Alt and click on it. This will bring a popup with the DAC name of the records contained in the grid.
From there select your customization project. Click on 'Grid: splits', the DataMember property of the grid is 'splits'. This is the name of the DataView from the Base class.
With that information, you can iterate content of the grid from the graph extension. Note that we use the Base prefix because we reference the base graph from within the extension.
foreach (POReceiptLineSplit split in Base.splits.Select())
{
PXTrace.WriteInformation("ReceiptNbr: {0}{1}LineNbr: {2}{3}SplitLineNbr: {4}",
split.ReceiptNbr, Environment.NewLine,
split.LineNbr, Environment.NewLine,
split.SplitLineNbr);
}
It is SOLineSplit_LotSerialNbr_FieldUpdated event need to extend in SOOrder Extension and POReceiptLine_RowSelected in POReceiptEntry Extension. Below code will help get unit cost from purchase receipt.
POReceiptLine poRow = PXSelectJoin<POReceiptLine,
LeftJoin<POReceiptLineSplit,
On<POReceiptLine.receiptNbr, Equal<POReceiptLineSplit.receiptNbr>,
And<POReceiptLine.inventoryID, Equal<POReceiptLineSplit.inventoryID>,
And<POReceiptLine.lineNbr, Equal<POReceiptLineSplit.lineNbr>>>>>,
Where<POReceiptLine.inventoryID, Equal<Required<POReceiptLine.inventoryID>>,
And<POReceiptLineSplit.lotSerialNbr, Equal<Required<POReceiptLineSplit.lotSerialNbr>>,
And<POReceiptLine.receiptType, Equal<Required<POReceiptLine.receiptType>>>>>>.Select(graph, row.InventoryID, row.LotSerialNbr, "RT");