All,
I am trying to read the file with multiple record types in spark, but have no clue how to do it.. Can someone point out, if there is a way to do it? or some existing packages? or some user git packages
the example below - where we have a text file with 2 separate ( it could be more than 2 ) record type :
00X - record_ind | First_name| Last_name
0-3 record_ind
4-10 firstname
11-16 lastname
============================
00Y - record_ind | Account_#| STATE | country
0-3 record_ind
4-8 Account #
9-10 STATE
11-15 country
input.txt
------------
00XAtun Varma
00Y00235ILUSA
00XDivya Reddy
00Y00234FLCANDA
sample output/data frame
output.txt
record_ind | x_First_name | x_Last_name | y_Account | y_STATE | y_country
---------------------------------------------------------------------------
00x | Atun | Varma | null | null | null
00y | null | null | 00235 | IL | USA
00x | Divya | Reddy | null | null | null
00y | null | null | 00234 | FL | CANDA
One way to achieve this is to load data as 'text'. Complete row will be loaded inside one column named 'value'. Now call a UDF which modifies each row based on condition and transform the data in way that all row follow same schema.
At last, use schema to create required dataframe and save in database.
I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)
i have a table like this:
user | sg_message_id | event | datetime
--------------------------------------------------------------------------------
john player | ekjf939e9313140_34k | delivered | 04/13/2018 12:56:30
john player | ekjf939e9313140_34k | opened | 04/15/2018 16:05:00
cristian dior | dsfsk0340344030fkjkj | delivered | 04/12/2018 18:45:21
cristian dior | dsfsk0340344030fkjkj | opened | 04/13/2018 15:40:17
For a user, for each unique sg_message_id, how do i create an Excel pivot table that can display hours elapsed between when an email was delivered and when it was opened?
You can use Calculated Item... in your pivot table. If your data set is not that big. Follow this steps:
You can use Power Query. It is an add-in developed by Microsoft since Excel2010+ (by default in Excel2016 known as Get and Transform). There you can connect directly any type of source of data and edit it as you want.
Here an example for your case:
I am currently querying against an Informix SQL table that is datetime, however only holds time, such as "07:30:44". I need to be able to pull this data and have the send_time show what the actual time was, and not "0:00:00".
Example:
Informix Table Data
--------------------------
| send_date | 02/09/2016 | --datetime field
--------------------------
| send_time | 07:30:44 | --datetime field
--------------------------
When I query through Excel via Microsoft Query editor, I can see the correct value, "07:30:44" in the preview, however what is returned in my sheet is "0:00:00". Normally I would change the formatting on the cell, however the literal value for the entire column is "12:00:00 AM".
Excel pulls/displays
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 12:00:00 AM| --displayed is "0:00:00"
--------------------------
I have cast the field as char to see if it would return correctly, and it does! However you can't use parameters/build quick reports with this method.
Desired Excel output
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 07:30:44 |
--------------------------
Is there another way to resolve this? I'd love to give the users an easy way to build queries without having to cast anything.
Here is my Sample Data
****Date | Name | Result****
01-09-14 | John | Fail
01-09-14 | John | PASS
01-09-14 | Raja | Pending
01-09-14 | Raja | Pending
01-09-14 | Natraj | No Response
01-09-14 | Natraj | PASS
02-09-14 | John | PASS
02-09-14 | John | No Response
02-09-14 | Raja | Fail
02-09-14 | Raja | Pending
02-09-14 | Natraj | No Response
02-09-14 | Natraj | PASS
02-09-14 | Natraj | Fail
02-09-14 | Natraj | Fail
Where i need to create a pivot chart for table like this where i need to count the Number of Result for a particular date and for particular Name
Example:
the chart should produce result something like this
Date| Name | Pass | Fail | Pending | No Response
01-09-14| John | 1 | 1 | 0 | 0
01-09-14| Raja | 0 | 0 | 2 | 0
-------------------------------------------
Where i tried of adding an another sheet and table for counting seperately for pass, fail, pending... values and from there i created a pivot chart, but the actual table extends further
For example new user can add, new date will be added in future, as like that chart also have to be responsive of adding new user and new dates , so i failed in that point,
so is there any way to make the pivot chart to be responsive though new user or new date is added, the chart has to built automatically and to show the count result of particular date and particular name
It should look something like this: