During my study of pycassa API i downloaded a sample project Twissandra.
I configured it with cassandra and after login when i ADD tweet the following error occurs ...
Environment:
Request Method: POST
Request URL: http://127.0.0.1:8000/
Django Version: 1.3.1
Python Version: 2.7.2
Installed Applications:
['django.contrib.sessions', 'tweets', 'users']
Installed Middleware:
('django.middleware.common.CommonMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'users.middleware.UserMiddleware')
Traceback:
File "C:\Python27\lib\site-packages\django\core\handlers\base.py" in get_response
111. response = callback(request, *callback_args, **callback_kwargs)
File "C:\Users\Muhammad Umair\workspace\Twissandra\src\Twissandra\tweets\views.py" in
timeline
20. 'body': form.cleaned_data['body'],
File "C:\Users\Muhammad Umair\workspace\Twissandra\src\Twissandra\cass.py" in
save_tweet
216. USERLINE.insert(str(username), {ts: str(tweet_id)})
File "C:\Python27\lib\site-packages\pycassa-1.3.0-py2.7.egg\pycassa\columnfamily.py" in insert
860. colval = self._pack_value(columns.values()[0], colname)
File "C:\Python27\lib\site-packages\pycassa-1.3.0-py2.7.egg\pycassa\columnfamily.py" in _pack_value
428. return packer(value)
File "C:\Python27\lib\site-packages\pycassa-1.3.0-py2.7.egg\pycassa\marshal.py" in pack_uuid
202. randomize=True)
File "C:\Python27\lib\site-packages\pycassa-1.3.0-py2.7.egg\pycassa\util.py" in convert_time_to_uuid
66. 'neither a UUID, a datetime, or a number')
Exception Type: ValueError at /
Exception Value: Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number
Did you modify the Cassandra column families or create them yourself? Maybe you're using an old version of Twissandra?
This particular stacktrace shows that pycassa is expecting a UUID for a column value, but in recent versions of Twissandra, the column values are all BytesType (untyped).
Related
I'm trying to load a csv from Google Cloud Storage into Bigquery using schema autodetect.
However I'm getting stumped by a parsing error on one of my columns. I'm perplexed why bigquery can't parse the field. In the documentation, it should be able to parse fields that look like YYYY-MM-DD HH:MM:SS.SSSSSS (which is exactly what my BQInsertTimeUTC column is).
Here's my code:
from google.cloud import bigquery
from google.oauth2 import service_account
project_id = "<my_project_id>"
table_name = "<my_table_name>"
gs_link = "gs://<my_bucket_id>/my_file.csv"
creds = service_account.Credentials.from_service_account_info(gcs_creds)
bq_client = bigquery.Client(project=project_id, credentials=creds)
dataset_ref = bq_client.dataset(<my_dataset_id>)
# create job_config object
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
source_format="CSV",
write_disposition="WRITE_TRUNCATE",
)
# prepare the load_job
load_job = bq_client.load_table_from_uri(
gs_link,
dataset_ref.table(table_name),
job_config=job_config,
)
# execute the load_job
result = load_job.result()
Error Message:
Could not parse '2021-07-07 23:10:47.989155' as TIMESTAMP for field BQInsertTimeUTC (position 4) starting at location 64 with message 'Failed to parse input string "2021-07-07 23:10:47.989155"'
And here's the csv file that is living in GCS:
first_name, last_name, date, number_col, BQInsertTimeUTC, ModifiedBy
lisa, simpson, 1/2/2020T12:00:00, 2, 2021-07-07 23:10:47.989155, tim
bart, simpson, 1/2/2020T12:00:00, 3, 2021-07-07 23:10:47.989155, tim
maggie, simpson, 1/2/2020T12:00:00, 4, 2021-07-07 23:10:47.989155, tim
marge, simpson, 1/2/2020T12:00:00, 5, 2021-07-07 23:10:47.989155, tim
homer, simpson, 1/3/2020T12:00:00, 6, 2021-07-07 23:10:47.989155, tim
Loading CSV files to BigQuery assumes that all the timestamp fields are going to follow the same format. In your CSV file, since the first timestamp value is "1/2/2020T12:00:00" so it is going to consider the timestamp format that the CSV file uses is [M]M-[D]D-YYYYT[H]H:[M]M:[S]S[.F]][time zone].
Therefore, it complains that the value "2021-07-07 23:10:47.989155" could not be parsed. If you change "2021-07-07 23:10:47.989155" to "7/7/2021T23:10:47.989155", it will work.
To fix this, you can either
Create a table with date column's type and BQInsertTimeUTC column's type as STRING. Load the CSV into it. And then expose a view which will have the expected TIMESTAMP column types for date and BQInsertTimeUTC, using SQL to transform the data from the base table.
Open the CSV file and transform either the "date" values or "BQInsertTimeUTC" values to make their formats consistent.
By the way, the CSV sample you pasted here has extra space after the delimiter ",".
Working version:
first_name,last_name,date,number_col,BQInsertTimeUTC,ModifiedBy
lisa,simpson,1/2/2020T12:00:00,7/7/2021T23:10:47.989155,tim
bart,simpson,1/2/2020T12:00:00,3,7/7/2021T23:10:47.989155,tim
maggie,simpson,1/2/2020T12:00:00,4,7/7/2021T23:10:47.989155,tim
marge,simpson,1/2/2020T12:00:00,5,7/7/2021T23:10:47.989155,tim
homer,simpson,1/3/2020T12:00:00,6,7/7/2021T23:10:47.989155,tim
As per the limitaions mentioned here,
When you load JSON or CSV data, values in TIMESTAMP columns must use a dash - separator for the date portion of the timestamp, and the date must be in the following format: YYYY-MM-DD (year-month-day). The hh:mm:ss (hour-minute-second) portion of the timestamp must use a colon : separator.
So can you try passing the BQInsertTimeUTC as 2021-07-07 23:10:47 without the milli seconds instead of 2021-07-07 23:10:47.989155
If you want to still use different Date formats you can do the following:
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to BQInsertTimeUTC:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognized date format.
Do a PARSE_DATE for the BQInsertTimeUTC and use that view for your analysis
I am new to Spark Streaming and Pandas UDF. I am working on pyspark consumer from kafka, payload is of xml format and trying to parse the incoming xml by applying pandas udf
#pandas_udf("col1 string, col2 string",PandasUDFType.GROUPED_MAP)
def test_udf(df):
import xmltodict
from collections import MutableMapping
xml_str=df.iloc[0,0]
df_col=['col1', 'col2']
doc=xmltodict.parse(xml_str,dict_constructor=dict)
extract_needed_fields = { k:doc[k] for k in df_col }
return pd.DataFrame( [{'col1': 'abc', 'col2': 'def'}] , index=[0] , dtype="string" )
data=df.selectExpr("CAST(value AS STRING) AS value")
data.groupby("value").apply(test_udf).writeStream.format("console").start()
I get the below error
File "pyarrow/array.pxi", line 859, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Is this the right approach ? What am i doing wrong
It looks like, as if this is a more kind of undocumented limitation than a bug.
You cannot use any pandas type which will be stored as an array object, which has a method named __arrow_array__, because pyspark always defines a mask. The string type you used, is stored in a StringArray, which is such a case. After I converted the string dtype into object, the error went away.
While converting a pandas dataframe to a pyspark one, I stumbled upon this error as well :
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol
My pandas dataframe had datetime-like values that I tried to convert to "string". I initially used the astype("string") method, which looked like this :
df["time"] = (df["datetime"].dt.time).astype("string")
When I tried to get the info of this dataframe, it seemed like it was indeed converted to a string type :
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null string
But the error kept coming back to me.
Solution
To avoid it, I instead went on to use the apply(str) method :
df["time"] = (df["datetime"].dt.time).apply(str)
Which gave me a type of object
df.info(verbose=True)
> ...
> # Column Non-Null Count Dtype
> ...
> 6 time 295452 non-null object
After that, the conversion was successful
spark.createDataFrame(df)
# DataFrame[datetime: string, date: string, year: bigint, month: bigint, day: bigint, day_name: string, time: string, hour: bigint, minute: bigint]
I have created a table in AWS Glue Catalog pointing to an S3 location. I am using AWS Glue ETL to read any new file in the S3 location. A data file will have first record as header. However, certain times there is an empty file being dropped in S3 with no data and no headers. Since the file doesn't have any header information as well, this causes my ETL job to fail saying 'Cannot resolve given input columns'.
My question - Is there a way to NOT read schema from file headers but just from the AWS Glue Catalog. I have already defined the schema in the catalog. I would still want to skip the first line from data files while reading but not treat it as header.
Below is the code I am trying -
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "test", transformation_ctx = "datasource1")
datasource1DF = datasource1.toDF()
datasource1DF.select(col('updatedtimestamppdt')).show()
Error -
Fail to execute line 1:
datasource1DF.orderBy(col('updatedtimestamppdt'),
ascending=False).select(col('updatedtimestamppdt')).distinct().show()
Traceback (most recent call last): File
"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63,
in deco
return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value
format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o1616.sort. :
org.apache.spark.sql.AnalysisException: cannot resolve
'updatedtimestamppdt' given input columns: [];;
have you tried...( as long as youve checked the box that exposes your glue catalog as a hive metastore)
df= spark.sql('select * from yourgluedatabase.yourgluetable')
My Ignite instance (2.7.5) is up and working, I can connect to it with DBeaver and create tables, store and retrieve data etc.
I am now trying to connect from a Python 3 (Python 3.6.8) script using pyodbc. I have compiled and installed the Apache Ignite ODBC driver from the source code provided in the ${IGNITE_HOME}/platforms/cpp directory. The script is able to create a table with 2 columns one int and one varchar, but when I try to insert a string value into the varchar column an exception is thrown:-
Traceback (most recent call last):
File "/path/Projects/test_ignite/main.py", line 27, in <module>
main()
File "/path/Projects/test_ignite/main.py", line 23, in main
create_table(conn)
File "/path/Projects/test_ignite/main.py", line 16, in create_table
cursor.execute(sql, (row_counter, col_1))
pyodbc.Error: ('HYC00', '[HYC00] Data type is not supported. [typeId=-9] (0) (SQLBindParameter)')
Changing the data type on the second column works as expected.
The sample script is below:-
import pyodbc
def create_table(conn):
sql = 'CREATE TABLE IF NOT EXISTS sample (key int, col_1 varchar, PRIMARY KEY(key))'
cursor = conn.cursor()
cursor.execute(sql)
sql = 'insert into sample (key, col_1) values (?, ?)'
num_rows = 10
row_counter = 0
while row_counter < num_rows:
row_counter = row_counter + 1
col_1 = 'Foo'
cursor.execute(sql, (row_counter, col_1)) # Exception thrown here
def main():
conn = pyodbc.connect('DRIVER={Apache Ignite};' +
'SERVER=10.0.1.48;' +
'PORT=10800;')
create_table(conn)
if __name__ == '__main__':
main()
PyODBC seems to use WVARCHAR type. Ignite's ODBC does not seem to support it. I would recommend using jdbc bindings or Python thin client.
I have filed an issue against Apache Ignite JIRA: IGNITE-12175
I am trying to make a connection between Jupyter Notebook and a Neo4j server graph. I looked at different methods to achieve this but none of them are working for me. Is giving me the same error.
from py2neo import Graph
graph = Graph(host="neo4j#bolt://63.35.194.218:7687", auth=("neo4j", "neo4j"))
%reload_ext cypher
query= "MATCH (a)-[]-(b) RETURN a.id, b.id limit 1"
data = graph.cypher.execute(query)
data
this gives me an attribute error :
AttributeError Traceback (most recent call last)
<ipython-input-10-5bbea41de85c> in <module>
3 get_ipython().run_line_magic('reload_ext', 'cypher')
4 query= "MATCH (a)-[]-(b) RETURN a.id, b.id limit 1"
----> 5 data = graph.cypher.execute(query)
6 data
AttributeError: 'Graph' object has no attribute 'cypher'
I expect to establish a connection between the 2 applications and have returned the id of the nodes.
In your example you are doing a mix between the use of the cypher extension of Jupyter, and the use a pure python script (but it's not your main problem)
So you have to make a choice between :
pip install py2neo
from py2neo import Graph
graph = Graph(host="neo4j#bolt://63.35.194.218:7687", auth=("neo4j", "neo4j"))
graph.run("MATCH (a)-[]-(b) RETURN a.id, b.id limit 1").data()
In this example I'm using graph.run and not graph.cypher.run.
The graph.cypher.run has been removed from the version 3 of py2neo.
And
pip install ipython-cypher
%load_ext cypher
%cypher MATCH (a)-[]-(b) RETURN a.id, b.id limit 1