Does anyone know a dataset to test the delta lake/apache iceberg?

Does anyone know a dataset to test the delta lake/apache iceberg? - apache-spark

I'm looking for an example dataset (or several) to test Delta Lake and Apache Iceberg, but I couldn't find any.
I want to test the MERGE function of both and compare, but a small example is not possible to measure performance and define which one is better.
I would like a dataset with primary keys that starts with the first version of the table, and with multiple datasets (small or large) with the changes, that way I could test MERGE.
If anyone can help me, I appreciate it in advance.

Related

sort values in pyspark > Does it give good and reliable results?

I am new with pyspark.
I need to analyze big CSV files .
at the beginning of the analyze I had to sort the data by ID and TIME .
I tried to use dask in order to do it but I realize that it give wrong answer and also many times it stuck in the middle. So dask not good with sorting values as mentioned in the link . apparently Because it do in parallel way.
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html
My question is how pyspark handle with this issue?
Does it give good and reliable results?
If the answer is yes I would like to know how spark is sorting data in parallel way and why it is difficult to dask

Best way to process huge .csv

I need to process a pretty huge .css (at least 10 millions rows, hundred of columns) with Python. I'd like:
To filter the content based on several criteria (mostly strings, maybe some regular expressions)
To consolidate the filtered data. For instance, grouping them by date, and for each date counting occurences based on a specific criterium. Pretty similar to what a pivot table could do.
I'd like to have an user-friendly access to that consolidated data
I'd like to generate charts (mostly basic line charts)
Processing must be fast AND light, because computers at work cannot handle much and we're always in a hurry
Given these prerequisites, could you please suggest some ideas? I thought about using pandas. I also thought about dumping the csv into a SQLite database (because it may be easier to query if I code an User Interface). But it is really my first foray into this world, so I don't know where to start. I don't have much time, but I'll would be very glad if you could offer some pieces of advice, some good (and fresh) things to read etc, interesting libs and so forth. Sorry if Stackoverflow is not the best place to ask for this kind of help. I'll delete the post if needed. Regards.

Give xsv a shot. It is quite convenient with decent speed. And it fits in the Unix philosopy. However if the dataset is used more than ten times, I'd suggest converting csv to some binary format, and ClickHouse is a good choice for that.

There are 2 rather different situations:
when your reports (charts, pivot tables) use limited number of columns from orignal CSV, and you can pre-aggregate your large CSV file only once to get much smaller dataset. This one-time processing can take some time (minutes) and no need to load whole CSV into memory as it can be processed as data stream (row-by-row). After that you can use this small dataset for fast processing (filtering, grouping etc).
you don't know which columns of original CSV may be used for grouping and filtering, and pre-aggregation is not possible. In other words, all 10M rows should be processed in the real-time (very fast) - this is OLAP use-case. This is possible if you load CSV data into memory once, and then iterate over 10M rows quickly when needed; if this is not possible, only option is to import it into the database. SQLite is a good lightweight DB and you can easily import CSV with sqlite3 command line tool. Note that SQL queries for 10M rows might be not so fast, and possibly you'll need to add some indexes.
Another option might be using specialized OLAP database like Yandex ClickHouse - you can use it to query CSV file directly with SQL (table engine=FILE) or import CSV into its column store. This database is lightning fast with GROUP BY queries (it can process 10M rows in <1s).

Time Series calculations in spark

I am pretty new to spark and would like some advise on how to approach the following problem.
I have candle data (high, low, open, close) for ever minute of a trading day spread across a year. This represents about 360,000 data points.
What I want to do is run some simulations across that data (and possibly every data point) and what I would like is for a given data point, get the previous (or next) x data points and then run some code across that to give a result.
Ideally, this would be in a map style function but you cannot do a nested operation in Spark. The only way that I can think about doing it is to create a DataSet of the Candle as a key and have the related data un-normalised or partitioning it on every key - either way seems inefficient.
Ideally I am looking for something that does (Candle, List) -> Double or something similar.
I am sure there is a better approach.
I am using Spark 2.1.0 and using Yarn as the scheduling engine.

I've done a fair bit of time series processing in Spark, and have spent some time thinking about exactly the same problem.
Unfortunately, in my opinion, there is no nice way to process all of the data, in the way you want, without structuring it as you suggested. I think we just have to accept that this kinda thing is an expensive operation, whether we are using Spark, pandas or Postgres.
You may hide the code complexity by using Spark SQL window functions (look at rangeBetween / RANGE BETWEEN), but the essence of what you are doing cannot be escaped.
Protip: map the data to features->label once and write it to disk to make dev/testing faster!

Apache hive: Compare data between 2 tables and generate a report

I have 2 tables with similiar schema in the same cluster.
I want to compare the data between both the tables and generate a report. Is it possible only within hql?
Do you suggest any better approach?
Thanks.

You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
It doesn't currently give you a "full report" but it will instead just pinpoint some of the differences found (the tool is more intended in a development cycle, to check if code is correct) but I guess you could extend the final part of the program for that.

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.

From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string