Apache Spark search inside a dataset - apache-spark

I am new to spark and trying to achieve the following. I am not able to find the best way to do this. Please suggest.
I am using Spark 2.0 & its Dataset, Cassandra 3.7 and Cassandra Java connector
I have one ColumnFamily with partition key and 2 clustering keys. For example
myksyspace.myTable (some_id, col1, col2, col3, col4, col5, PRIMARY KEY (some_id, col1, col2))
I can get the data of myksyspace.myTable in a myTableDataset. The data has large number of rows (may be 200000).
After every 1 hour I get updated data from some other source where there could be some new data which is not in my database and I want to save it in database.
The data which I receive from some other source contains updated data but no value of “col2”. I get the rest of the data in a list, “dataListWithSomeNewData” in my Java code.
Now I want to match the data in list with data in myTableDataset and copy col2 from dataset to the list “dataListWithSomeNewData” and then generate new dataset and save it back in Database. This way my existing data will be updated. i want the new data to be inserted with new generated unique value for col2 for each list item. How do I achieve this?
I want to avoid collectAsList() on dataset to avoid out of memory as I may load large data in memory. With collectAsList(), the code works with small amount of data.
Any suggestion/idea on this. How can I achieve this?
Thank you in advance

Related

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

Sorting Results by time in Cassandra

I'm trying to get some time series data from Cassandra
My table is presented on picture , and when I query, I'm getting data as presented next:
first I'm seeing all false data regardless of time when I inserted them in Cassandra, and next I'm seeing all true data.
My question is: how can I sort or roder data by time when I inserted, consistently, in order to I'm be able to get data in order when I insert them.
When I try "select c1 from table1 order by c2", I get the following error "ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
Thank you
My boolean table
Assuming that your schema is something like:
CREATE TABLE table1 (
c1,
c2,
PRIMARY KEY (c1))
This will result in 2 partitions in your table (c1 = true and c1=false). Each partition will be managed by a single node.
Your initial query will retrieve data from your table across all partitions. So it will go to the first partition, retrieve all the rows then the second and retrieve all the rows, which is why you're seeing the results you do.
Cassandra is optimised for retrieving data across one partition only, so you should look at adjusting your schema to allow that - to use ORDER BY in the query, you need to be retrieving data across one partition only.
Depending on your use case, you could look at bucketing your data or performing the sorting in your application.

Cassandra data modelling - Select subset of rows in large table for batch Spark processing

I am working on a project in which we ingest large volumes of structured raw data into a Cassandra table and then transform it to a target schema using Spark.
Since our raw data table is getting pretty large we would like to process it in batches. That means Spark has to look at our raw data table to identify not yet processed rows (by partition key) and then load the subset of these rows into Spark.
Being new to Cassandra I am now wondering how to implement this. Using Spark, I can quite efficiently load raw data keys and compare them with the keys in the transformed table to identify the subsets. But what is the best strategy to load the subset of these rows from my raw data table.
Here is an example. If my schema looks like this:
CREATE TABLE raw_data (
dataset_id text PRIMARY KEY,
some_json_data text
);
...and if I have dataset_ids 1,2,3,4,5 in my table and know that I now need to process the rows with ids 4, 5, how can I efficiently select those rows knowing that the list of ids can be pretty long in practice?

Dataframe column with two different names

I want to know if a a spark data frame can have two different names to a column.
I knew that by using "withColumn" i can add a new column but i do not want to add a new column to the data frame, but i just want to have alias name to the existing column in a data frame.
Example If there is a data frame with 3 columns "Col1, Col2, Col3".
So can anyone please let me know if i can give a alias name to Col3 so that i can retrieve the data of 'Col3' with name "Col4" as well.
EDIT: possible duplicate: Usage of spark DataFrame "as" method
It looks like there are several ways depending on the spark library and client library you're using.

Resources