how to create a dataframe from relational database with sparse columns in Spark? - apache-spark

I have to read some data from a relational database to do some machine learning in Spark. However, the table I have to read has some sparse columns. Also, it has a column called "SpecialPurposeColumns" which contains non-zero data in XML format, like:
<Age>76</Age><ID>78</ID><Income>87000</Income> ... <ZIP>96733</ZIP>
what is the good way to create a dataframe in spark using this data?
Thanks in advance

Related

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Data is not getting written in sorted format on target oracle table through SPARK

I have a table in hive with below schema
emp_id:int
emp_name:string
I have created data frame from above hive table
df = sql_context.sql('SELECT * FROM employee ORDER by emp_id')
df.show()
After above code is run I see that data is sorted properly on emp_id
I am trying to write the data to Oracle table through below code
df.write.jdbc(url=url, table='target_table', properties=properties, mode="overwrite")
As per my understanding, This is happening because of multiple executor processes running at the same time on every data partitions and sorting applied through query is been applied on specific partition and when multiple processes writing data to Oracle at the same time the result table ordering is distorted
I further tried to repartition the data to just one partition(Which is not ideal solution) and post writing the data to oracle the sorting worked properly
Is there any way to write sorted data to RDBMS from SPARK
TL;DR When working with relational systems you should never depend on the insert order. Spark is not really relevant here.
Relational databases, including Oracle, don't guarantee any intrinsic order of the stored data. Exact order of stored records is a detail of implementation, and can change during lifetime of the data.
The sole exception in Oracle are Index Organized Tables where:
data for an index-organized table is stored in a B-tree index structure in a primary key sorted manner.
This of course requires a primary key which can reliably determine order.

Cassandra data modelling - Select subset of rows in large table for batch Spark processing

I am working on a project in which we ingest large volumes of structured raw data into a Cassandra table and then transform it to a target schema using Spark.
Since our raw data table is getting pretty large we would like to process it in batches. That means Spark has to look at our raw data table to identify not yet processed rows (by partition key) and then load the subset of these rows into Spark.
Being new to Cassandra I am now wondering how to implement this. Using Spark, I can quite efficiently load raw data keys and compare them with the keys in the transformed table to identify the subsets. But what is the best strategy to load the subset of these rows from my raw data table.
Here is an example. If my schema looks like this:
CREATE TABLE raw_data (
dataset_id text PRIMARY KEY,
some_json_data text
);
...and if I have dataset_ids 1,2,3,4,5 in my table and know that I now need to process the rows with ids 4, 5, how can I efficiently select those rows knowing that the list of ids can be pretty long in practice?

Transfer time series data from PySpark to Cassandra

I have a Spark Cluster and a Cassandra cluster. In pyspark I read a csv file then transform it to an RDD. I then go through every row in my RDD and use a mapper and reducer function. I end up getting the following output (I've made this list short for demonstration purposes):
[(u'20170115', u'JM', u'COP'), (u'20170115', u'JM', u'GOV'), (u'20170115', u'BM', u'REB'), (u'20170115', u'OC', u'POL'), (u'20170114', u'BA', u'EDU')]
I want to go through each row in the array above and store each tuple into one table in Cassandra. I want the unique key to be the date. Now I know that I can turn this array into a dataframe and then store it into Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md#saving-a-dataframe-in-python-to-cassandra). If I turn the list into a dataframe and then store it into Cassandra will Cassandra still be able to handle it? I guess I'm not fully understanding how Cassandra stores values. In my array the dates are repeated, but the other values are different.
What is the best way for me to store the data above in Cassandra? Is there a way for me to store data directly from Spark to Cassandra using python?
Earlier versions of DSE 4.x supported RDDs, but the current connector for DSE and open source Cassandra is "limited to DataFrame only operations."
PySpark with Data Frames
You stated "I want the unique key to be the date". I assume you mean partion key, since date is not unique in your example. Its ok to use date as the partion key (assuming partitons will not be too large) but your primary key needs to be unique.

Resources