I am trying to export and import data from a cassandra table for changing a timestamp column to unixepoch column ( ie type timestamp to bigint)
I tried exporting data to csv using below command
COPY raw_data(raw_data_field_id, toUnixTimestamp(dt_timestamp), value) TO 'raw_data_3_feb_19.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
but getting error as : Improper COPY command.
How can I fix this issue or is there a better way to achieve this?
from
raw_data_field_id | dt_timestamp | value
-------------------+---------------------------------+-------
23 | 2018-06-12 07:15:00.656000+0000 | 131.3
to
raw_data_field_id | dt_unix_timestamp_epoch | value
-------------------+---------------------------------+-------
23 | 1528787700656 | 131.3
The COPY command does not support adding extra functions to process the output.
I would say you have several solutions:
export the data in csv using COPY, convert the timestamp value (using sh commands or a high level language) and import it to a new table
export using echo "select raw_data_field_id, toUnixTimestamp(dt_timestamp), value from raw.raw_data;" | ccm node1 cqlsh > output.csv, change the csv so it has a proper format and import it to a new table (this solution is from here)
write your own conversion tool using one of Cassandra drivers (python, java etc).
maybe you could try something with a UDF, but I haven't tested this.
You should be aware that COPY FROM supports datasets that have less than 2 milion rows.
Related
I have been working on a project to import the Danish 2.5Million ATM transaction data set to derive some visualizations.
The data is hosted on a mysql server provided by the university. The objective is to import the data using Sqoop and then apply a few transformations to it using pyspark.
Link to the dataset here : https://www.kaggle.com/sparnord/danish-atm-transactions
The Sql server, that hosts this information has a few rows which are intentionally or unintentionally mangled.
Example:
So I have a very basic sqoop command which gets the details from the source database. However I run into an issue where there are values which have a double-quote " especially in the column message_text
Sqoop Command :
sqoop import --connect jdbc:mysql:{source-connection-string} --table SRC_ATM_TRANS --username {username}--password {password} --target-dir /user/root/etl_project --fields-terminated-by '|' --lines-terminated-by "\n" -m 1
Here is sample row that is imported in the transaction.
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction|0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds
However the expected output should be
2017|January|1|Sunday|21|Active|85|Diebold Nixdorf|København|Regnbuepladsen|5|1550|55.676|12.571|DKK|MasterCard|4531|Withdrawal|4017|"Suspected malfunction,0.000|55.676|13|2618425|0.000|277|1010|93|3|280.000|0|75|803|Clouds|Cloudy
At first I was okay with this hoping that pyspark would handle the mangled data since the delimiters are specified.
But now I run into issues when populating my dataframe.
transactions = spark.read.option("sep","|").csv("/user/root/etl_project/part-m-00000", header = False,schema = transaction_schema)
However when I inspect my rows I see that the mangled data has caused the dataframe to put these affected values into a single column!
transactions.filter(transactions.message_code == "4017").collect()
Row(year=2017, month=u'January', day=1, weekday=u'Sunday', hour=17, atm_status=u'Active', atm_id=u'35', atm_manufacturer=u'NCR', atm_location=u'Aabybro', atm_streetname=u'\xc3\u0192\xcb\u0153stergade', atm_street_number=6, atm_zipcode=9440, atm_lat=57.162, atm_lon=9.73, currency=u'DKK', card_type=u'MasterCard', transaction_amount=7387, service=u'Withdrawal', message_code=u'4017', message_text=u'Suspected malfunction|0.000|57.158|10|2625037|0.000|276|1021|83|4|319.000|0|0|800|Clear', weather_lat=None, weather_lon=None, weather_city_id=None, weather_city_name=None, temp=None, pressure=None, humidity=None, wind_speed=None, wind_deg=None, rain_3h=None, clouds_all=None, weather_id=None, weather_main=None, weather_description=None)
At this point I am not sure on what to do?
Do I go ahead and create temporary columns to manage this and use a regex replacement to fill in these values ?
Or is there any better way I can import the data and manage these mangled values either in sqoop or in pyspark ?
I am new Azure was using SAS before now we are moving to azure synapse
In current environment
I want to extract a XML tag value stored in column C (varcharmax) as variable.
[Dataset][1]
[1]: https://i.stack.imgur.com/tbSIF.png
Below XML is saved in the column C (PKDATA)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:DataSet xmlns:ns2="http://www.test.com/t/cn/el">
<EnumObject>
<name>Inpatient</name>
<value>262784067</value>
<radiobutton>false</radiobutton>
</EnumObject>
<StringObject>
<name>xxx</name>
<prompt></prompt>
<value>/widget.jsp</value>
<width>99</width>
</StringObject>
</ns2:DataSet>
If name is Inpatient then 262784067 as Inpatient type
output
| A | B | Inpatient type |
| 11212 | 2587140 | 262784067 |
I used the following code
select a,b,
pkdata.value('/EnumObject/name') as Inpatient type
from dbo.extdata
i get the following error
Cannot find either column "pkkddata" or the user-defined function or aggregate "pkdata.value", or the name is ambiguous.
I tried using following query but gives me the error Msg 104220, Level 16, State 1, Line 26 Cannot find data type 'xml'. SELECT a,b,(pkdata).value('(/EnumObject/name/text())[1]', 'varchar(100)') FROM [dbo].extdata CROSS APPLY (SELECT CAST(pkdata AS xml)) AS x(pkdata)
i get the following error when I use the below code The XMLDT method 'nodes' can only be invoked on columns of type xml. i tried to use the following but get incorrect syntax near passing select x.* from [dbo].[EXTDATA] rt cross join xmltable( '/EnumObject/name' passing xmltype(rt.pkdata) columns name number path 'name/#value' ) x
Not sure how to proceed
Azure SQL version
Microsoft Azure SQL Data Warehouse - 10.0.16003.0 Apr 28 2021 04:55:16 Copyright (c) Microsoft Corporation
Azure Synapse Analytics, specifically dedicated SQL pools does not support the XML datatype or any of the functions accompanying it including FOR XML, .nodes, .value, .query, .modify etc.
If you need this type of processing, you can either use traditional SQL Server, eg SQL Server 2019 or Azure SQL DB. One option would be to use Synapse Pipelines to move data there. As an alternative you could look at using Synapse Notebooks and some custom Python / Scala / c# code, but I have only done a simple test on this.
Simple example in Scala:
Cell 1
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
Cell 2
%%sql
-- Use SparkSQL to interrogate the XML
-- https://spark.apache.org/docs/2.3.0/api/sql/index.html#xpath
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
Cell 3
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
df2.show
Cell 4
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
Screenprint from sample notebook:
XML is kind of old-fashioned these days - have you thought about switching to JSON? Also if your data volumes aren't that big it would be a lot cheaper to just Azure SQL DB rather than Synapse.
I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark).
However, all the timestamp fields appear as 1970-01-19 10:45:37.009 .
So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 .
When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it?
I didn't manage to find any spark.sql configurations that solved it.
You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()
While converting : 1593938489000 I was getting below
timestamp_col_original| timestamp_col|
+----------------------+-------------------+
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+
There is NO documentation regarding how to convert pCollections into the pCollections necessary for input into .CoGroupByKey()
Context
Essentially I have two large pCollections and I need to be able to find differences between the two, for type II ETL changes (if it doesn't exist in pColl1 then add to a nested field found in pColl2), so that I am able to retain history of these records from BigQuery.
Pipeline Architecture:
Read BQ Tables into 2 pCollections: dwsku and product.
Apply a CoGroupByKey() to the two sets to return --> Results
Parse results to find and nest all changes in dwsku into product.
Any help would be recommended. I found a java link on SO that does the same thing I need to accomplish (but there's nothing on the Python SDK).
Convert from PCollection<TableRow> to PCollection<KV<K,V>>
Is there a documentation / support for Apache Beam, especially Python SDK?
In order to get CoGroupByKey() working, you need to have PCollections of tuples, in which the first element would be the key and second - the data.
In your case, you said that you have BigQuerySource, which in current version of Apache Beam outputs PCollection of dictionaries (code), in which every entry represents a row in the table which was read. You need to map this PCollections to tuples as stated above. This is easy to do using ParDo:
class MapBigQueryRow(beam.DoFn):
def process(self, element, key_column):
key = element.get(key_column)
yield key, element
data1 = (p
| "Read #1 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #1"))
| "Map #1 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_1"))
data2 = (p
| "Read #2 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #2"))
| "Map #2 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_2"))
co_grouped = ({"data1": data1, "data2": data2} | beam.CoGroupByKey())
# do your processing with co_grouped here
BTW, documentation of Python SDK for Apache Beam can be found here.
I am trying to import a csv file to Cassandra. The csv file has been generated from Postgres and it contains some null values.
Cassandra version:
[cqlsh 5.0.1 | Cassandra 3.5 | CQL spec 3.4.0 | Native protocol v4]
I am using this query to import:
copy reports
(id,name,user_id,user_name,template_id,gen_epoch,exp_epoch,file_name,format,refile_size,is_sch,job_id,status,status_msg)
from '/home/reports.csv' with NULL='' and header=true and DELIMITER =
',';
I keep on receiving this error:
Failed to import 66 rows: ParseError - invalid literal for int() with
base 10: '', given up without retries
However, when I changed all the null values to some random value, I was able to import that row using the same command. I have already tried all the solutions I found on internet.
Please advise.
You can put just empty field in the CSV.
I.e. I want to rewrite a value in DB to null - I can do something like:
cqlsh$> copy my_table(id,value_column1,value_column2) from 'myimport.csv';
And in myimport.csv, there will be
1234,,3
Like this, value_column1 will have 'null' value.
This looks like a Cassandra bug (see https://issues.apache.org/jira/browse/CASSANDRA-11549). I haven't been able to find a way to get Cassandra to accept the nulls. You may have to stick with a workaround for now substitute some sentinel value for the nulls.