Spark create multiple Data frames from one Data frame - apache-spark

I am using Spark 2.1 with Cassandra (3.9) as data source. C* has a big table with 50 columns, which is not a good data model for my use case. so I created split tables for each of those sensors along with partition key and clustering key cols.
All sensor table
-----------------------------------------------------
| Device | Time | Sensor1 | Sensor2 | Sensor3 |
| dev1 | 1507436000 | 50.3 | 1 | 1 |
| dev2 | 1507436100 | 90.2 | 0 | 1 |
| dev1 | 1507436100 | 28.1 | 1 | 1 |
-----------------------------------------------------
Sensor1 table
-------------------------------
| Device | Time | value |
| dev1 | 1507436000 | 50.3 |
| dev2 | 1507436100 | 90.2 |
| dev1 | 1507436100 | 28.1 |
-------------------------------
Now I am using spark to copy data from old table to new ones.
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
df.createOrReplaceTempView("data")
query = ('''select device,time,sensor1 as value from data ''' )
vgDF = spark.sql(query)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="sensor1", keyspace="dataks")\
.save()
copying data one by one is taking a lot of time (2.1) hours for a single table. is there any way i can select * and create multiple df for each sensors and save at once ? (or even sequentially).

One issue in the code is the cache
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
Here I don't see how df is used multiple times apart from save. SO here cache is counter productive. You are reading the data, filter it and saving it to a separate cassandra table. Now the only action happening on the dataframe is the save and nothing else.
So there is no benefit from caching the data here. Removing the cache will give you some speed up.
To create multiple tables sequentially. I would suggest to use partitionBy and write the data first to HDFS as partitioned data w.r.t sensor and then write it back to cassandra.

Related

Multiple metrics over large dataset in spark

I have a big dataset groupped by certain field that I need to run descriptive statistics on each field.
Let's say dataset is 200m+ records and there's about 15 stat functions that I need to run - sum/avg/min/max/stddev etc. Problem that it's very hard to scale that task since there's not clear way to partition dataset.
Example dataset:
+------------+----------+-------+-----------+------------+
| Department | PartName | Price | UnitsSold | PartNumber |
+------------+----------+-------+-----------+------------+
| Texas | Gadget1 | 5 | 100 | 5943 |
| Florida | Gadget3 | 484 | 2400 | 4233 |
| Alaska | Gadget34 | 44 | 200 | 4235 |
+------------+----------+-------+-----------+------------+
Right now I am doing it this way (example):
columns_to_profile = ['Price', 'UnitSold', 'PartNumber']
functions = [
Function(F.mean, 'mean'),
Function(F.min, 'min_value'),
Function(F.max, 'max_value'),
Function(F.variance, 'variance'),
Function(F.kurtosis, 'kurtosis'),
Function(F.stddev, 'std'),
Function(F.skewness, 'skewness'),
Function(count_zeros, 'n_zeros'),
Function(F.sum, 'sum'),
Function(num_hist, "hist_data"),
]
functions_to_apply = [f.function(c).alias(f'{c}${f.alias}')
for c in columns_to_profile for f in get_functions(column_types, c)]
df.groupby('Department').agg(*functions_to_apply).toPandas()
Problem here is that list of functions is bigger than this (there's about 16-20) which applies to each column but cluster spend most of the time in shuffling and CPU load is about 5-10%.
How should I partition this data or maybe my approach is incorrect?
If departments are skewed (i.e. Texas have 90% of volume) what should be my approach?
this is my spark dag for this job:

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

How to add rows to an existing partition in Spark?

I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)

PySpark: Sending each partition to same worker node for processing

I've a PySpark dataframe with 500+ clusters in the following format:
-------------------------------
| Cluster_num | text |
-------------------------------
| 1 | some_text_1 |
-------------------------------
| 1 | some_text_2 |
-------------------------------
| 2 | some_text_3 |
-------------------------------
| 2 | some_text_4 |
-------------------------------
I want to apply gensim.summarization on the text of each cluster to get each cluster's summary. The dataset is huge and want to parallelize it to the maximum extent. Gensim is installed on all worker nodes.
Is there a way to apply this function in such a manner that all the text of same cluster goes to same worker nodes, where it could apply this summary function on it?
I was trying to convert it to RDD and then using partitionBy and reduceByKey but couldn't make it work.

Can Spark SQL not count correctly or can I not write SQL correctly?

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)
After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
With that and the DataFrame API, I can count the call types that occurred:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... or with SQL in Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
| 33|
+------------------------+
... or with an SQL cell:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)
To answer the question
Can Spark SQL not count correctly or can I not write SQL correctly?
from the title: I can't write SQL correctly.
Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
Also, i apparently can't read:
pault suggested in a comment
With only 30 something values, you could just sort and print all the distinct items to see where the difference is.
Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:
+--------------------------------------------+
|CallType |
+--------------------------------------------+
|Elevator / Escalator Rescue |
|Marine Fire |
|Aircraft Emergency |
|Confined Space / Structure Collapse |
|Administrative |
|Alarms |
|Odor (Strange / Unknown) |
|Lightning Strike (Investigation) |
|null |
|Citizen Assist / Service Call |
|HazMat |
|Watercraft in Distress |
|Explosion |
|Oil Spill |
|Vehicle Fire |
|Suspicious Package |
|Train / Rail Fire |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other |
|Transfer |
|Outside Fire |
|Traffic Collision |
|Assist Police |
|Gas Leak (Natural and LP Gases) |
|Water Rescue |
|Electrical Hazard |
|High Angle Rescue |
|Structure Fire |
|Industrial Accidents |
|Medical Incident |
|Mutual Aid / Assist Outside Agency |
|Fuel Spill |
|Smoke Investigation (Outside) |
|Train / Rail Incident |
+--------------------------------------------+

Resources