is it possible to change the delta import command to delete the unwanted documents based on a criteria? each time when delta import runs - search

I have dataImportScheduler configured which posts and HTTP request to import the increments or changes into the index. what I want to be able to do is each time when delta import runs it should run a delete query as well which has some criteria e.g. documenttype:deleted to delete the unwanted data in the index.
the delta import query i am using is
http://address:8080/solr-multicore/dataimport?command=delta-import&clean=false&commit=true

You can use deletedPkQuery to clean up the records which have been deleted.
deletedPkQuery : Only used in delta-import
Example -
<entity name="album" query="SELECT * from albums" deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at > '${dataimporter.last_index_time}'">
This would help you to delete the records as well without the timestamp.

Yes, It is possible.
If you want to do delete only, you can remove both deltaQuery and deltaImportQuery and use the only deletedPkQuery as:
SELECT id FROM db WHERE deletion = 1 AND solrsync_date > '${dataimporter.db.last_index_time}'"
Note:- Condition for delete can be anything.
And then run :
http://host:8983/solr/core/dataimport?command=delta-import

Related

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Get data from between temporary table create and drop

I have optimized the peformance of one of my SQLite scripts by adding a temporary table so now it looks like this:
create temp table temp.cache as select * from (...);
--- Complex query.
select * from (...);
drop table temp.cache;
The issue with this solution is that I no longer can use Pandas' pd.read_sql_query because it doesn't return any result and throws an exception saying I'm allowed to execute only a single statement.
What would you say is the preferable solution? I can think of two:
Plan-A: There is some trick to extract the data anyway or
Plan-B: I need to call python's SQLite execute function before and after using Pandas to handle the temporary table.

How to manage mangled data when importing from your source in sqoop or pyspark

I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?

Spark Datasource Hudi table read using instant time

I'm reading Hudi table using Spark.read.format("hudi")
want to understand how is this option works hoodie.datasource.read.begin.instanttime
Will it similar to hudi's hoodie_commit_ts column available in parquets files?
I'm not able to get same count between an external table on top of the same hudi path using hoodie_commit_ts column and below approach.
Sample code is here
beginTime = '20201201194517'
incremental_read_options = {'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': beginTime}
Incremental_DF = spark.read.format("org.apache.hudi").
options(**incremental_read_options).
load()
Incremental queries and hoodie.datasource.read.begin.instanttime are based on _hoodie_commit_time data from the metadata table.
What you are accomplishing with this is an incremental read starting from the beginTime to the most recent data upsert. If you pass the exact time of a commit as beginTime, your query won't contain that commit. You would have to pass that (beginTime - 1) for that.
Also, you can use point in time queries, by passing the option hoodie.datasource.read.end.instanttime to restrain your query to a time point between beginTime and endTime (also exclusive).

Combine Hybris Impex Remove with Flexible Search

I would like to remove some items from a table using Impex. The following example throws no errors, but nothing is removed.
REMOVE ProductReference;pk[unique=true]
"#% impex.exportItemsFlexibleSearch(""select {pk} from {ProductReference as pr} where {pr.referenceType}=( {{select {pk} from {ProductReferenceTypeEnum as prte} where {prte.code} = 'CROSSELLING'}})"");"
The query produces results as expected. Is REMOVE not compatible with flexible search, or am I missing something?
The problem is, that I am running an import over hotfolder and I want to remove all existing items beforehand. Alternative solutions are welcome.
Importing the query-
REMOVE ProductReference;pk[unique=true]
"#% impex.exportItemsFlexibleSearch(""select {pk} from {ProductReference as pr} where {pr.referenceType}=( {{select {pk} from {ProductReferenceTypeEnum as prte} where {prte.code} = 'CROSSELLING'}})"");"
is not working because you have not selected the Enable code execution checkbox.Also, as suggested by #B.M replacing the script with impex.includeSQLData() and #% impex.initDatabase() would not have any effect if the checkbox is not selected.However, selecting the checkbox and running the above script will give error, because there is no method by the name, exportItemsFlexibleSearch in the class MyImpExImportReader(which is called on running import).The method exportItemsFlexibleSearch is available in DeprecatedExporter (which is called on running export not import).Now, running this impex script in export will execute successfully without any error, but it won't remove anything. Instead, it will create a zip file with an impex and a script file. This script file will have the impex header for removing the items returned by the query. Using this zip file we can delete the items, conditionally.
Go to HMC -> Cronjobs -> Create a new cronjob of type Impex import job -> Upload the zip file in media attribute -> Create -> Run the impex.
This would delete the items returned by the query.There is another way of deleting the items selected by the query.
We need to export the script to generate zip file of import script and media file.Resultant zip file need to be imported with enablecodeexecution checked
Alternatively groovy script can be executed, an example:
import de.hybris.platform.servicelayer.search.FlexibleSearchQuery;
flexibleSearchService = spring.getBean("flexibleSearchService")
modelService = spring.getBean("modelService")
query = "select {pk} from {trigger}";
flexibleSearchService.search(query).result.each
{
modelService.remove(it)
}
use HAC -> SQL Query console to delete using direct SQL command.
Enable Commit code
After running the update go to Monitoring tab and clear the cache

Resources