Existing column can't be found by DataFrame#filter in PySpark - apache-spark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.

From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)

What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Related

Example for CREATE TABLE on TRINO using HUDI

I am using Spark Structured Streaming (3.1.1) to read data from Kafka and use HUDI (0.8.0) as the storage system on S3 partitioning the data by date. (no problems with this section)
I am looking to use Trino (355) to be able to query that data. As a pre-curser, I've already placed the hudi-presto-bundle-0.8.0.jar in /data/trino/hive/
I created a table with the following schema
CREATE TABLE table_new (
columns, dt
) WITH (
partitioned_by = ARRAY['dt'],
external_location = 's3a://bucket/location/',
format = 'parquet'
);
Even after calling the below function, trino is unable to discover any partitions
CALL system.sync_partition_metadata('schema', 'table_new', 'ALL')
My assessment is that I am unable to create a table under trino using hudi largely due to the fact that I am not able to pass the right values under WITH Options.
I am also unable to find a create table example under documentation for HUDI.
I would really appreciate if anyone can give me a example for that, or point me to the right direction, if in case I've missed anything.
Really appreciate the help
Small Update:
Tried Adding
connector = 'hudi'
but this throws the error:
Catalog 'hive' does not support table property 'connector'
Have you tried below?
Reference: https://hudi.apache.org/docs/next/querying_data/#trino
https://hudi.apache.org/docs/query_engine_setup/#PrestoDB

Spark wrongly casting integers as `struct<int:int,long:bigint>`

In a spark job, I am using
.withColumn("year", year(to_timestamp(lit(col("timestamp")))))
This code used to work. But now I get the error :
"cannot resolve 'CAST(`timestamp` AS TIMESTAMP)' due to data type mismatch: cannot cast struct<int:int,long:bigint> to timestamp;"
I looks like spark is reading my timestamp column as a struct<int:int,long:bigint> instead of a int
How can I prevent that ?
Context the initial data is in jsonline. I read it using AWS GLUE glueContext.create_dynamic_frame.from_catalog. In the GLUE catalog the timestamp column is typed int.
Finally I solved it this way :
GF_resolved = ResolveChoice.apply(
frame=GF_raw,
specs=[("timestamp", "cast:int")],
transformation_ctx="resolve timestamp type",
)
ResolveChoice is method avaible on AWS Glue DynamicFrame
The short answer is that you cannot prevent it if creating a dynamic frame from catalog because, as the name suggests, the schema is dynamic. See this SO for more information.
Alternative approach that is a little more compact is...
gf_resolved = gf_raw.resolveChoice(specs = [('timestamp','cast:int')])
Official documentation for the resolve choice class can be found here.
AWS Resolve Choice

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

How can i extract values from cassandra output using python?

I'm trying to connect cassandra database through python using cassandra driver .And it went successful with out any problem . When i tried to fetch the values from cassandra ,it has some formatted output like Row(values) .
python version 3.6
package : cassandra
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect('employee')
k=session.execute("select count(*) from users")
print(k[0])
Output :
Row(count=11)
Expected :
11
From documentation:
By default, each row in the result set will be a named tuple. Each row will have a matching attribute for each column defined in the schema, such as name, age, and so on. You can also treat them as normal tuples by unpacking them or accessing fields by position.
So you can access your data by name as k[0].count, or by position as rows[0][0]
Please read Getting started document from driver's documentation - it will answer most of your questions.
Cassandra reply everything using something called row factory, which by default is a named tuple.
In your case, to access the output you should access k[0].count.

Unable to read column types from amazon redshift using psycopg2

I'm trying to access the types of columns in a table in redshift using psycopg2.
I'm doing this by running a simple query on pg_table_def like as follows:
SELECT * FROM pg_table_def;
This returns the traceback:
psycopg2.NotSupportedError: Column "schemaname" has unsupported type "name"
So it seems like the types of the columns that store schema (and other similar information on further queries) are not supported by psycopg2.
Has anyone run into this issue or a similar one and is aware of a workaround? My primary goal in this is to be able to return the types of columns in the table. For the purposes of what I'm doing, I can't use another postgresql adapter.
Using:
python- 3.6.2
psycopg2- 2.7.4
pandas- 0.17.1
You could do something like below, and could return the result back to calling service.
cur.execute("select * from pg_table_def where tablename='sales'")
results = cur.fetchall()
for row in results:
print ("ColumnNanme=>"+row[2] +",DataType=>"+row[3]+",encoding=>"+row[4])
Not sure about exception, if all the permissions are fine, then, it should work fine, print something like below.
ColumnNanme=>salesid,DataType=>integer,encoding=>lzo
ColumnNanme=>commission,DataType=>numeric(8,2),encoding=>lzo
ColumnNanme=>saledate,DataType=>date,encoding=>lzo
ColumnNanme=>description,DataType=>character varying(255),encoding=>lzo

Resources