How to insert a data frame into Cassandra in Scala - apache-spark

I have a data frame with as shown below and want to insert this data into cassandra table
+---------+------+-----------+
| name | id | city |
+---------+------+-----------+
| sam | 123 | Atlanta |
| John | 456 | Texas |
+---------+------+-----------+
I am using below code but it inserts only last row.
df.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "tablename", "keyspace" -> "keyspace"))
.mode(saveMode = "Append").save(`)
How to insert a data frame into Cassandra in Scala?

The code you provided should insert all the rows into Cassandra. There are a few reasons that it might not.
That is actually all the data there is, df for some reason only
contains a single row.
There are multiple rows but they share the same Partition key,
this means that subsequent writes will overwrite the initial write.
An exception is being thrown, this should be obvious in the logs.

Related

How to add rows to an existing partition in Spark?

I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)

Can Spark SQL not count correctly or can I not write SQL correctly?

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)
After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
With that and the DataFrame API, I can count the call types that occurred:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... or with SQL in Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
| 33|
+------------------------+
... or with an SQL cell:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)
To answer the question
Can Spark SQL not count correctly or can I not write SQL correctly?
from the title: I can't write SQL correctly.
Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
Also, i apparently can't read:
pault suggested in a comment
With only 30 something values, you could just sort and print all the distinct items to see where the difference is.
Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:
+--------------------------------------------+
|CallType |
+--------------------------------------------+
|Elevator / Escalator Rescue |
|Marine Fire |
|Aircraft Emergency |
|Confined Space / Structure Collapse |
|Administrative |
|Alarms |
|Odor (Strange / Unknown) |
|Lightning Strike (Investigation) |
|null |
|Citizen Assist / Service Call |
|HazMat |
|Watercraft in Distress |
|Explosion |
|Oil Spill |
|Vehicle Fire |
|Suspicious Package |
|Train / Rail Fire |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other |
|Transfer |
|Outside Fire |
|Traffic Collision |
|Assist Police |
|Gas Leak (Natural and LP Gases) |
|Water Rescue |
|Electrical Hazard |
|High Angle Rescue |
|Structure Fire |
|Industrial Accidents |
|Medical Incident |
|Mutual Aid / Assist Outside Agency |
|Fuel Spill |
|Smoke Investigation (Outside) |
|Train / Rail Incident |
+--------------------------------------------+

Spark create multiple Data frames from one Data frame

I am using Spark 2.1 with Cassandra (3.9) as data source. C* has a big table with 50 columns, which is not a good data model for my use case. so I created split tables for each of those sensors along with partition key and clustering key cols.
All sensor table
-----------------------------------------------------
| Device | Time | Sensor1 | Sensor2 | Sensor3 |
| dev1 | 1507436000 | 50.3 | 1 | 1 |
| dev2 | 1507436100 | 90.2 | 0 | 1 |
| dev1 | 1507436100 | 28.1 | 1 | 1 |
-----------------------------------------------------
Sensor1 table
-------------------------------
| Device | Time | value |
| dev1 | 1507436000 | 50.3 |
| dev2 | 1507436100 | 90.2 |
| dev1 | 1507436100 | 28.1 |
-------------------------------
Now I am using spark to copy data from old table to new ones.
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
df.createOrReplaceTempView("data")
query = ('''select device,time,sensor1 as value from data ''' )
vgDF = spark.sql(query)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="sensor1", keyspace="dataks")\
.save()
copying data one by one is taking a lot of time (2.1) hours for a single table. is there any way i can select * and create multiple df for each sensors and save at once ? (or even sequentially).
One issue in the code is the cache
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
Here I don't see how df is used multiple times apart from save. SO here cache is counter productive. You are reading the data, filter it and saving it to a separate cassandra table. Now the only action happening on the dataframe is the save and nothing else.
So there is no benefit from caching the data here. Removing the cache will give you some speed up.
To create multiple tables sequentially. I would suggest to use partitionBy and write the data first to HDFS as partitioned data w.r.t sensor and then write it back to cassandra.

Cassandra COPY command to Update records

I want to update certain columns in a table using Cassandra COPY feature. But Copy inserts new record even when row is not found. I want to restrict in COPY command, only when PRIMARY KEY row is found the column in csv file to be updated. Sample table and COPY command are shared below.
CREATE TABLE Orders(
Ord_Id Text Primary Key,
Ord_Date Int,
Ord_Acct Text,
Ord_Comp_Dt Int,
Ord_Status Text)
Sample Data:
Ord_Id | Ord_Date | Ord_Acct | Ord_Comp_Dt | Ord_Status
ORD001 | 20170602 | A001 | 20170615 | InProgress
ORD002 | 20170603 | A002 | 20170607 | Failed
ORD003 | 20170604 | A003 | 20170616 | InProgress
ORD004 | 20170605 | A003 | 20170617 | InProgress
Above table gets row entry when order is placed with Initial Ord_Status='InProgress'. Based on order completion network provides the data with Ord_Id, Ord_Status.
Network Data
ORD_ID,ORD_STATUS
ORD001,Failed
ORD003,Success
ORD004,Rejected
ORD005,DataIncomplete
Copy command is provided below
COPY ord_schema.Orders(Ord_Id,Ord_Status) FROM 'NW170610.csv'
Table SnapShot after executing COPY command
Sample Data:
Ord_Id | Ord_Date | Ord_Acct | Ord_Comp_Dt | Ord_Status
ORD001 | 20170602 | A001 | 20170615 | Failed
ORD002 | 20170603 | A002 | 20170607 | Failed
ORD003 | 20170604 | A003 | 20170616 | InProgress
ORD004 | 20170605 | A003 | 20170617 | Rejected
ORD005 | Null | Null | Null | DataIncomplete
ORD005 should not be inserted when Primary Key is not found.
Kindly assist is there any way to check data exists before insert or prevent entry when data does not exist.
Cassandra does an UPSERT. Which means, it will insert a column if there is none (based on primary key).
What I'd suggest is add another column maybe Ord_Acct (something that can bring uniqueness to the data) as a clustering/composite key. Now, if the Ord_Acct is null it won't do an insert. So, to summarize I'd suggest change the data model that meets your requirements.
It is not possible to check existence of row before inserting with copy command
Copy command only parse the csv and insert directly. So you have to write your own code to read the csv NW170610 and for each record check existance with select query, If exist then insert.
Or Dump csv of Orders table using copy to command
COPY orders (ord_id) TO 'orders_id.csv';
Now for each record of NW170610 check that the id present in the orders_id.csv, if yes then write the record to another file complete_order.csv.
Now just load the complete_order.csv file using copy from command

Better way to refresh imported columns?

I have a table in spotfire with a couple columns imported from another table as a lookup. As an example, Col2 is used to match for the import of ImportedCol:
+------+------+-------------+
| Col1 | Col2 | ImportedCol |
+------+------+-------------+
| 1 | A | Val1 |
| 2 | B | Val2 |
| 3 | A | Val1 |
| 4 | C | Val3 |
| 5 | B | Val2 |
| 6 | A | Val1 |
| 7 | D | Val4 |
+------+------+-------------+
However, the data in Col2 is subject to change. In that event, I need ImportedCol to change with it, however Spotfire seems to just keep the old imported data. Right now I've been deleting the imported column and re-adding it to refresh the link. Is there a way to dynamically import the data as the document loads or with any refresh of the information links?
I have found that this happens sometimes although I'm not exactly sure how to explain why. my workaround is to create "virtual" data tables based on your existing ones.
consider your linked table as A and your embedded table as B. start from a default state -- that is, before importing any columns.
add a new data table. the source for this table should be "From Current Analysis" and using A. we will consider this one as C, and it becomes your main data table, and C will update when any changes are made to A or B.
to illustrate:
I found the issue.
Turns out that pivoting on data in the same table creates a circular reference which overrides the embed/link setting on that table. My workaround was to make the pivot as its own information link, then have the table join the original link and the new pivot one.

Resources