Apache Cassandra COPY FROM, datetime entered incorrectly - cassandra

I can see a strange issue with COPY FROM command in Cassandra with datetime values.
My timezone and my server's timezone is same. IST (GMT+5:30)
First i tried inserting a value with INSERT query.
INSERT INTO activity (home_id, datetime, event, code_used) VALUES ('H01474777', '2014-05-21 07:32:16', 'alarm set', '5599');
It gave me the below row.
home_id | datetime | code_used | event
-----------+---------------------------------+-----------+-----------
H01474777 | 2014-05-21 02:02:16.000000+0000 | 5599 | alarm set
Here Cassandra is showing the time value in GMT by removing +5:30
But when i tried to insert the below via the COPY FROM command and you can see that it added +5:30 when showing the GMT value, its like when adding the row it added 11 hours to the time.
See file, query and output below respectively.
home_id|datetime|event|code_used
H02257222|2014-05-21 05:29:47|alarm set|1566
H01474777|2014-05-21 07:32:16|alarm set|5599
Query:
COPY activity (home_id, datetime, event, code_used) FROM '/home/cass/events.csv' WITH HEADER = TRUE AND DELIMITER = '|';
Result:
home_id | datetime | code_used | event
-----------+---------------------------------+-----------+-----------
H01474777 | 2014-05-21 13:02:16.000000+0000 | 5599 | alarm set
H01474777 | 2014-05-21 02:02:16.000000+0000 | 5599 | alarm set --Old row from insert query.
H02257222 | 2014-05-21 10:59:47.000000+0000 | 1566 | alarm set
Here the first 2 rows are same data and the first 2 columns of the table are primary key but still another row has been created where as there should have been 2 rows only.

I was able to replicate the scenario you have mentioned.
My server timezone EST .
I ran the insert that you provided and used the file you provided to load the data with copy command
INSERT INTO activity (home_id, datetime, event, code_used) VALUES ('H01474777', '2014-05-21 07:32:16', 'alarm set', 5599);
home_id | datetime | code_used | event
-----------+---------------------------------+-----------+-----------
H01474777 | 2014-05-21 07:32:16.000000+0000 | 5599 | alarm set
COPY activity (home_id, datetime, event, code_used) FROM 'temp.csv' WITH HEADER = TRUE AND DELIMITER = '|'
home_id | datetime | code_used | event
-----------+---------------------------------+-----------+-----------
H01474777 | 2014-05-21 02:32:16.000000+0000 | 5599 | alarm set
H01474777 | 2014-05-21 07:32:16.000000+0000 | 5599 | alarm set
H02257222 | 2014-05-21 00:29:47.000000+0000 | 1566 | alarm set
In the file you did not mention any timezone for datetime column. so it considering local zone as its zone. In my case it is EST so -0500 and your case it is IST so +0530
your file will be like
home_id|datetime|event|code_used
H02257222|2014-05-21 05:29:47.000+0530|alarm set|1566
H01474777|2014-05-21 07:32:16.000+0530|alarm set|5599
If you modify you csv file as below. Then you will only see one row for H01474777.
home_id|datetime|event|code_used
H02257222|2014-05-21 05:29:47.000+0000|alarm set|1566
H01474777|2014-05-21 07:32:16.000+0000|alarm set|5599
Hope this helps! Let me know if you have any questions.

Related

Fixed Length file Reading Spark with multiple Records format in one

All,
I am trying to read the file with multiple record types in spark, but have no clue how to do it.. Can someone point out, if there is a way to do it? or some existing packages? or some user git packages
the example below - where we have a text file with 2 separate ( it could be more than 2 ) record type :
00X - record_ind | First_name| Last_name
0-3 record_ind
4-10 firstname
11-16 lastname
============================
00Y - record_ind | Account_#| STATE | country
0-3 record_ind
4-8 Account #
9-10 STATE
11-15 country
input.txt
------------
00XAtun Varma
00Y00235ILUSA
00XDivya Reddy
00Y00234FLCANDA
sample output/data frame
output.txt
record_ind | x_First_name | x_Last_name | y_Account | y_STATE | y_country
---------------------------------------------------------------------------
00x | Atun | Varma | null | null | null
00y | null | null | 00235 | IL | USA
00x | Divya | Reddy | null | null | null
00y | null | null | 00234 | FL | CANDA
One way to achieve this is to load data as 'text'. Complete row will be loaded inside one column named 'value'. Now call a UDF which modifies each row based on condition and transform the data in way that all row follow same schema.
At last, use schema to create required dataframe and save in database.

How to add rows to an existing partition in Spark?

I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)

calculate hours between 2 timestamps using excel pivot table

i have a table like this:
user | sg_message_id | event | datetime
--------------------------------------------------------------------------------
john player | ekjf939e9313140_34k | delivered | 04/13/2018 12:56:30
john player | ekjf939e9313140_34k | opened | 04/15/2018 16:05:00
cristian dior | dsfsk0340344030fkjkj | delivered | 04/12/2018 18:45:21
cristian dior | dsfsk0340344030fkjkj | opened | 04/13/2018 15:40:17
For a user, for each unique sg_message_id, how do i create an Excel pivot table that can display hours elapsed between when an email was delivered and when it was opened?
You can use Calculated Item... in your pivot table. If your data set is not that big. Follow this steps:
You can use Power Query. It is an add-in developed by Microsoft since Excel2010+ (by default in Excel2016 known as Get and Transform). There you can connect directly any type of source of data and edit it as you want.
Here an example for your case:

How to return datetime in Excel when SQL table has only time?

I am currently querying against an Informix SQL table that is datetime, however only holds time, such as "07:30:44". I need to be able to pull this data and have the send_time show what the actual time was, and not "0:00:00".
Example:
Informix Table Data
--------------------------
| send_date | 02/09/2016 | --datetime field
--------------------------
| send_time | 07:30:44 | --datetime field
--------------------------
When I query through Excel via Microsoft Query editor, I can see the correct value, "07:30:44" in the preview, however what is returned in my sheet is "0:00:00". Normally I would change the formatting on the cell, however the literal value for the entire column is "12:00:00 AM".
Excel pulls/displays
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 12:00:00 AM| --displayed is "0:00:00"
--------------------------
I have cast the field as char to see if it would return correctly, and it does! However you can't use parameters/build quick reports with this method.
Desired Excel output
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 07:30:44 |
--------------------------
Is there another way to resolve this? I'd love to give the users an easy way to build queries without having to cast anything.

Pivot Chart Table in Excel To Calculate the Count and Display the Chart

Here is my Sample Data
****Date | Name | Result****
01-09-14 | John | Fail
01-09-14 | John | PASS
01-09-14 | Raja | Pending
01-09-14 | Raja | Pending
01-09-14 | Natraj | No Response
01-09-14 | Natraj | PASS
02-09-14 | John | PASS
02-09-14 | John | No Response
02-09-14 | Raja | Fail
02-09-14 | Raja | Pending
02-09-14 | Natraj | No Response
02-09-14 | Natraj | PASS
02-09-14 | Natraj | Fail
02-09-14 | Natraj | Fail
Where i need to create a pivot chart for table like this where i need to count the Number of Result for a particular date and for particular Name
Example:
the chart should produce result something like this
Date| Name | Pass | Fail | Pending | No Response
01-09-14| John | 1 | 1 | 0 | 0
01-09-14| Raja | 0 | 0 | 2 | 0
-------------------------------------------
Where i tried of adding an another sheet and table for counting seperately for pass, fail, pending... values and from there i created a pivot chart, but the actual table extends further
For example new user can add, new date will be added in future, as like that chart also have to be responsive of adding new user and new dates , so i failed in that point,
so is there any way to make the pivot chart to be responsive though new user or new date is added, the chart has to built automatically and to show the count result of particular date and particular name
It should look something like this:

Resources