Database : Mariadb
Platform : CentOS
Need to do
Import data from a text file to table. Problem with DATETIME format.
Original date time format in test file : mmddYYYYHHiissmmm
Database default format : YYYYmmddHHiiss
LOAD DATA LOCAL
INFILE '/home/test.txt'
INTO TABLE cdr FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
(ID , APARTY , BPARTY , #T1, ENDTIME, DURATION, INTG, OUTTG, INRC, OUTRC)
set STARTTIME = STR_TO_DATE(#T1,'%m-%d-%Y %H:%i:%s:%f');
After importing the values are showing NULL.
Assuming your example of 'mmddYYYYHHiissmmm' is correct, change '%m-%d-%Y %H:%i:%s:%f' to '%m%d%Y%H%i%s%f'. Here's a test:
mysql> SELECT STR_TO_DATE('12312015235959123', '%m%d%Y%H%i%s%f');
+----------------------------------------------------+
| STR_TO_DATE('12312015235959123', '%m%d%Y%H%i%s%f') |
+----------------------------------------------------+
| 2015-12-31 23:59:59.123000 |
+----------------------------------------------------+
1 row in set (0.00 sec)
You cannot change the internal representation of DATETIME or TIMESTAMP.
Related
I'm trying to load a csv from Google Cloud Storage into Bigquery using schema autodetect.
However I'm getting stumped by a parsing error on one of my columns. I'm perplexed why bigquery can't parse the field. In the documentation, it should be able to parse fields that look like YYYY-MM-DD HH:MM:SS.SSSSSS (which is exactly what my BQInsertTimeUTC column is).
Here's my code:
from google.cloud import bigquery
from google.oauth2 import service_account
project_id = "<my_project_id>"
table_name = "<my_table_name>"
gs_link = "gs://<my_bucket_id>/my_file.csv"
creds = service_account.Credentials.from_service_account_info(gcs_creds)
bq_client = bigquery.Client(project=project_id, credentials=creds)
dataset_ref = bq_client.dataset(<my_dataset_id>)
# create job_config object
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format="CSV",
write_disposition="WRITE_TRUNCATE",
)
# prepare the load_job
load_job = bq_client.load_table_from_uri(
gs_link,
dataset_ref.table(table_name),
job_config=job_config,
)
# execute the load_job
result = load_job.result()
Error Message:
Could not parse '2021-07-07 23:10:47.989155' as TIMESTAMP for field BQInsertTimeUTC (position 4) starting at location 64 with message 'Failed to parse input string "2021-07-07 23:10:47.989155"'
And here's the csv file that is living in GCS:
first_name, last_name, date, number_col, BQInsertTimeUTC, ModifiedBy
lisa, simpson, 1/2/2020T12:00:00, 2, 2021-07-07 23:10:47.989155, tim
bart, simpson, 1/2/2020T12:00:00, 3, 2021-07-07 23:10:47.989155, tim
maggie, simpson, 1/2/2020T12:00:00, 4, 2021-07-07 23:10:47.989155, tim
marge, simpson, 1/2/2020T12:00:00, 5, 2021-07-07 23:10:47.989155, tim
homer, simpson, 1/3/2020T12:00:00, 6, 2021-07-07 23:10:47.989155, tim
Loading CSV files to BigQuery assumes that all the timestamp fields are going to follow the same format. In your CSV file, since the first timestamp value is "1/2/2020T12:00:00" so it is going to consider the timestamp format that the CSV file uses is [M]M-[D]D-YYYYT[H]H:[M]M:[S]S[.F]][time zone].
Therefore, it complains that the value "2021-07-07 23:10:47.989155" could not be parsed. If you change "2021-07-07 23:10:47.989155" to "7/7/2021T23:10:47.989155", it will work.
To fix this, you can either
Create a table with date column's type and BQInsertTimeUTC column's type as STRING. Load the CSV into it. And then expose a view which will have the expected TIMESTAMP column types for date and BQInsertTimeUTC, using SQL to transform the data from the base table.
Open the CSV file and transform either the "date" values or "BQInsertTimeUTC" values to make their formats consistent.
By the way, the CSV sample you pasted here has extra space after the delimiter ",".
Working version:
first_name,last_name,date,number_col,BQInsertTimeUTC,ModifiedBy
lisa,simpson,1/2/2020T12:00:00,7/7/2021T23:10:47.989155,tim
bart,simpson,1/2/2020T12:00:00,3,7/7/2021T23:10:47.989155,tim
maggie,simpson,1/2/2020T12:00:00,4,7/7/2021T23:10:47.989155,tim
marge,simpson,1/2/2020T12:00:00,5,7/7/2021T23:10:47.989155,tim
homer,simpson,1/3/2020T12:00:00,6,7/7/2021T23:10:47.989155,tim
As per the limitaions mentioned here,
When you load JSON or CSV data, values in TIMESTAMP columns must use a dash - separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon : separator.
So can you try passing the BQInsertTimeUTC as 2021-07-07 23:10:47 without the milli seconds instead of 2021-07-07 23:10:47.989155
If you want to still use different Date formats you can do the following:
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to BQInsertTimeUTC:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognized date format.
Do a PARSE_DATE for the BQInsertTimeUTC and use that view for your analysis
I have stored input data in date format in postgres database, but when I am showing the date in browser it's showing date with timezone and converting it from utc. For example I have stored the date in 2020-07-16 format. But when i am showing the date it becomes 2020-07-15T18:00:00.000Z. I have tried using select mydate::DATE from table to get only date but its still showing date with timezone. I am using node-postgres module in my node app. I suspect it's some configuration on node-postgres module? From their doc:
node-postgres converts DATE and TIMESTAMP columns into the local time
of the node process set at process.env.TZ
Is their any way i can configure it to only parse date? If i query like this SELECT TO_CHAR(mydate :: DATE, 'yyyy-mm-dd') from table i get 2020-07-16 but thats lot of work just to get date
You can make your own date and time type parser:
const pg = require('pg');
pg.types.setTypeParser(1114, function(stringValue) {
return stringValue; //1114 for time without timezone type
});
pg.types.setTypeParser(1082, function(stringValue) {
return stringValue; //1082 for date type
});
The type id can be found in the file: node_modules/pg-types/lib/textParsers.js
It is spelled out here:
https://node-postgres.com/features/types
date / timestamp / timestamptz
console.log(result.rows)
// {
// date_col: 2017-05-29T05:00:00.000Z,
// timestamp_col: 2017-05-29T23:18:13.263Z,
// timestamptz_col: 2017-05-29T23:18:13.263Z
// }
bmc=# select * from dates;
date_col | timestamp_col | timestamptz_col
------------+-------------------------+----------------------------
2017-05-29 | 2017-05-29 18:18:13.263 | 2017-05-29 18:18:13.263-05
(1 row)
I have a UNIX epoch time stored in a set of data I'm trying to import in the following format without milliseconds 1006,785,1502054277,8 (third entry). I noticed I can only store this in Cassandra as a timestamp. However, when I try to convert the time when querying it comes across as follows using this query:
select player_id, server_id, dateof(mintimeuuid(last_login)) as timestamp, sessions from servers_by_user where server_id = 440 and player_id = 217442
player_id | server_id | timestamp | sessions
-----------+-----------+---------------------------------+---------------
217442 | 440 | 1970-01-18 06:38:03.382000+0000 | 1
That's obviously not right because that epoch time is actually 2017-08-06T21:17:57+00:00.
I tried to store the data as timeuuid but then I get this error presumably because it is not a 13-digit epoch time: Failed to import 1 rows: ParseError - Failed to parse 1502054277 : badly formed hexadecimal UUID string,.
What would be the best way to store a 10-digit UNIX epoch time and to query it back into something that is human-readable?
The problem you notice is that unix timestamps are seconds since epoch - but timestamps in cassandra are stored as milliseconds since epoch instead.
First row is what you actually stored, the second one is what you want:
cqlsh:demo> SELECT id, blobAsBigint(timestampAsBlob(ts)) FROM demo3;
id | system.blobasbigint(system.timestampasblob(ts))
--------------------------------------+-------------------------------------------------
b7bac930-7b3e-11e7-a5b3-73178ecf2b4e | 1502054277
bfb37f10-7b3e-11e7-a5b3-73178ecf2b4e | 1502054277000
(2 rows)
cqlsh:demo> SELECT id, dateof(mintimeuuid(blobAsBigint(timestampAsBlob(ts)))) FROM demo3;
id | system.dateof(system.mintimeuuid(system.blobasbigint(system.timestampasblob(ts))))
--------------------------------------+------------------------------------------------------------------------------------
b7bac930-7b3e-11e7-a5b3-73178ecf2b4e | 1970-01-18 09:14:14+0000
bfb37f10-7b3e-11e7-a5b3-73178ecf2b4e | 2017-08-06 21:17:57+0000
(2 rows)
cqlsh:demo>
(using something like timestampasblob() in regular code is not a good idea, just as demo here to see whats going on under the hood)
If you can - do not store unix timestamps in cassandra but use timestamps if you want the 'magic'. Of course you can deal with conversion from seconds to timestamp in your code - using timestamps directly is much more convienent.
While you note you are importing some data, simply multiply them with 1000 before importing and you are done.
I can't try on my cluster right now, but with cassandra 3.x you can have user defined functions (UDF) to do conversions, but they need to be enabled in your cluster in cassandra.yaml in java or javascript (others as python are possible). See https://docs.datastax.com/en/cql/latest/cql/cql_using/useCreateUDF.html.
CREATE FUNCTION IF NOT EXISTS toMilliseconds(input int)
CALLED ON NULL INPUT
RETURNS int
LANGUAGE java AS '
return int*1000;
';
Or just convert directly to timestamp. Some blog post from Datastax with more examples: https://www.datastax.com/dev/blog/user-defined-functions-in-cassandra-3-0
Using Cassandra 2.28, Drive 3, Sparks2.
I have a timestamp column in Cassandra I need to query it by the date portion only. If I query by date: .where("TRAN_DATE= ?", "2012-01-21" : it does not bring any result. If I include the time portion it says Invalid Date. My data (as I can read in cqlsh) is: 2012-01-21 08:01:00+0000
param: "2012-01-21" > No error but no result
param: "2012-01-21 08:01:00" > Error : Invalid Date
param: "2012-01-21 08:01:00+0000" > Error : Invalid Date
SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyy/mm/dd");
TRAN_DATE = DATE_FORMAT.parse("1/19/2012");
Have used the bulk loader/SSLoader to load the table
Data in table:
tran_date | id
--------------------------+-------
2012-01-14 08:01:00+0000 | ABC
2012-01-24 08:01:00+0000 | ABC
2012-01-23 08:01:00+0000 | ALM
2012-01-29 08:01:00+0000 | ALM
2012-01-13 08:01:00+0000 | ATC
2012-01-15 08:01:00+0000 | ATI
2012-01-18 08:01:00+0000 | FKT
2012-01-05 08:01:00+0000 | NYC
2012-01-11 08:01:00+0000 | JDU
2012-01-04 08:01:00+0000 | LST
How do I solve this.
Thanks
If you insert data into timestamp column without providing timezone like this one :
INSERT INTO timestamp_test (tran_date , id ) VALUES ('2016-12-19','TMP')
Cassandra will choose coordinator timezone
If no time zone is specified, the time zone of the Cassandra coordinator node handing the write request is used. For accuracy, DataStax recommends specifying the time zone rather than relying on the time zone configured on the Cassandra nodes.
If you execute select with Datastax Driver, you Need To Convert the String date into java.util.Date and set the time zone of coordinator node, In my case it was GMT+6
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
Date date = dateFormat.parse("2012-01-21");
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT+6")); //Change this time zone
Now You can query with
QueryBuilder.eq("TRAN_DATE", date)
Here is a complete demo :
try (Cluster cluster = Cluster.builder().addContactPoints("127.0.0.1").withCredentials("username", "password").build(); Session session = cluster.connect("tests")) {
session.execute("INSERT INTO test_trans(tran_date , id ) VALUES ('2016-12-19','TMP')");
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT+6"));
Date date = dateFormat.parse("2016-12-19");
System.out.println(date);
for (Row row : session.execute(QueryBuilder.select().from("timestamp_test").where(QueryBuilder.eq("tran_date", date)))) {
System.out.println(row);
}
}
Source : https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html
I know it can be done in traditional way, but if I were to use Cassandra DB, is there a easy/quick and agaile way to add csv to the DB as a set of key-value pairs ?
Ability to add a time-series data coming via CSV file is my prime requirement. I am ok to switch to any other database such as mongodb, rike, if it is conviniently doable there..
Edit 2 Dec 02, 2017
Please use port 9042. Cassandra access has changed to CQL with default port as 9042, 9160 was default port for Thrift.
Edit 1
There is a better way to do this without any coding. Look at this answer https://stackoverflow.com/a/18110080/298455
However, if you want to pre-process or something custom you may want to so it yourself. here is a lengthy method:
Create a column family.
cqlsh> create keyspace mykeyspace
with strategy_class = 'SimpleStrategy'
and strategy_options:replication_factor = 1;
cqlsh> use mykeyspace;
cqlsh:mykeyspace> create table stackoverflow_question
(id text primary key, name text, class text);
Assuming your CSV is like this:
$ cat data.csv
id,name,class
1,hello,10
2,world,20
Write a simple Python code to read off of the file and dump into your CF. Something like this:
import csv
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
pool = ConnectionPool('mykeyspace', ['localhost:9160'])
cf = ColumnFamily(pool, "stackoverflow_question")
with open('data.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print str(row)
key = row['id']
del row['id']
cf.insert(key, row)
pool.dispose()
Execute this:
$ python loadcsv.py
{'class': '10', 'id': '1', 'name': 'hello'}
{'class': '20', 'id': '2', 'name': 'world'}
Look the data:
cqlsh:mykeyspace> select * from stackoverflow_question;
id | class | name
----+-------+-------
2 | 20 | world
1 | 10 | hello
See also:
a. Beware of DictReader
b. Look at Pycassa
c. Google for existing CSV loader to Cassandra. I guess there are.
d. There may be a simpler way using CQL driver, I do not know.
e. Use appropriate data type. I just wrapped them all into text. Not good.
HTH
I did not see the time-series requirement. Here is how you do for time series.
This is your data
$ cat data.csv
id,1383799600,1383799601,1383799605,1383799621,1383799714
1,sensor-on,sensor-ready,flow-out,flow-interrupt,sensor-killAll
Create traditional wide row. (CQL suggests not to use COMPACT STORAGE, but this is just to get you going quickly.)
cqlsh:mykeyspace> create table timeseries
(id text, timestamp text, data text, primary key (id, timestamp))
with compact storage;
This the altered code:
import csv
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
pool = ConnectionPool('mykeyspace', ['localhost:9160'])
cf = ColumnFamily(pool, "timeseries")
with open('data.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print str(row)
key = row['id']
del row['id']
for (timestamp, data) in row.iteritems():
cf.insert(key, {timestamp: data})
pool.dispose()
This is your timeseries
cqlsh:mykeyspace> select * from timeseries;
id | timestamp | data
----+------------+----------------
1 | 1383799600 | sensor-on
1 | 1383799601 | sensor-ready
1 | 1383799605 | flow-out
1 | 1383799621 | flow-interrupt
1 | 1383799714 | sensor-killAll
Let's say your CSV looks like
'P38-Lightning', 'Lockheed', 1937, '.7'
cqlsh to your DB
And..
CREATE TABLE airplanes (
name text PRIMARY KEY,
manufacturer ascii,
year int,
mach float
);
then...
COPY airplanes (name, manufacturer, year, mach) FROM '/classpath/temp.csv';
Refer: http://www.datastax.com/docs/1.1/references/cql/COPY
Do Backup
./cqlsh -e"copy <keyspace>.<table> to '../data/table.csv';"
Use backup
./cqlsh -e"copy <keyspace>.<table> from '../data/table.csv';"
There is now an open-source program for bulk-loading data (local or remote) into Cassandra from multiple files (CSVs or JSONs) called DataStax Bulk Loader (see docs, source, examples):
dsbulk load -url ~/my_data_folder -k keyspace1 -t table1 -header true