Speeding up DeltaTable .history() - delta-lake

Is there a way to speed up running .history() on a DeltaTable if the Operation name is given and I only want to find the most recent occurrence of the operation? For instance: Given a delta table dt I want to find the most recent "DELETE" in dt.history() and grab the value from operationMetrics. Right now it seems the only way to do this is grab the entire history and filter out all given "DELETE" operations, then return the top most row from that. This is extremely slow. Is there another way to go about this, or possibly manually step through the history without grabbing the entire history?

Related

Select last row from csv in Azure Data Factory

I'm pulling in a small ( less than 100kb ) dataset as csv. All I want to do is select the last row of that data and sink it into a different location.
I cannot seem to find a simple way to do this.
I have tried a wrangling data flow, but the "keep rows" M function is not supported - though you can select it, it just results in an error. That's annoying because it does exactly what I need in one fell swoop.
I sort of get it working using a last() function on each field, but that is a lot of messing around and it's slow.
Surely there is a better way to do this simple task?
Would greatly appreciate any assistance.
Thanks
Mapping Data Flows: Surrogate Key, Aggregate (max), Filter (max row)

How to overwrite fields when you COPY FROM data?

Is it possible to somehow overwrite existing counters-fields when COPY FROM data (from CSV),
or completely delete rows from the database?
When I COPY FROM data to existing rows, the counters are summarized.
I can’t completely DELETE these rows as well:
although it seems that the rows are deleted, when you re-COPY FROM the data from CSV,
the counters-fields continue to increase.
You can't set counters to the specific value - for them the only supported operation is either increase, or decrease. To set them to specific value you need either to decrease it to its current value, and then increase to desired value, but this will require that you read value. Or you need to delete corresponding cells (or whole row), and perform increase operation using desired number.
The second approach could be implemented easier, but will require that you first generate a file with CQL DELETE commands based on the content of your CSV file, and then use COPY FROM - if nobody increased values since deletion, then counters will get correct values.

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

What is the best way to create a new Spark dataframe column based on an existing column that requires an external API call?

I have a dataframe that I am working with in a Python based Jupyter notebook. I want to add an additional column based on the content of an existing column, where the content of the new column is derived from running an external API call on the original column.
The solution I attempted was to use a Python based UDF. The first cell contains something like this:
def analysis(old_column):
new_column = myapi.analyze(text=old_column)
return(new_column)
analysis_udf = udf(analysis)
and the second cell this:
df2 = df1.withColumn("col2",analysis_udf('col1'))
df2.select('col2').show(n=5)
My dataframe is relatively large, with some 70000 rows, and where col1 can have a 100 to 10000+ characters of text. When I ran the code above in cell 2, it actually seemed to run fairly quickly (minutes), and dumped out the 5 rows of the df2 dataframe. So I thought I was in business. However, my next cell had the following code:
df2.cache()
df2.filter(col('col2').isNull()).count()
The intent of this code is to cache the contents of the new dataframe to improve access time to the DF, and then count how many of the entries in the dataframe have null values generated by the UDF. This surprisingly (to me) took many hours to run, and eventually provided an output of 6. Its not clear to me why the second cell ran quickly and the third was slow. I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on all of the rows, and that one would have been slow, and then subsequent calls to access the new column of the dataframe would be quick. But that wasn't the case, so I supposed then that the cache call was the one that was actually causing the UDF to run on all of the rows so any subsequent calls now should be quick. So added another cell with:
df2.show(n=5)
Assuming it would run quickly, but again, it was taking much longer than I expected and it seems like perhaps the UDF was running again. (?)
My questions are
Which Spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on
It is not a correct assumption. Spark will evaluate as little data as possible, given limitations of the API. Because you use Python udf it will evaluate minimum number of partitions required to collect 5 rows.
Which spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
Any evaluation, if data is no longer cached (evicted from memory).
Possibly any usage of the resulting column, unless udf is marked as non-deterministic.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
Unless you want to switch to Scala or RDD API, the only alternative is pandas_udf, which is somewhat more efficient, but supports only a limited subset of types.

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Resources