ParseException Error using Regex in pyspark 2.4 - apache-spark

I am trying to get only those rows where colADD contain non alphanumeric character.
Code :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("Customers");
data.registerTempTable("data");
spark.sql("SELECT colADD from data WHERE colADD REGEXP '^[A-Za-z0-9]+$'; ");
Error:
pyspark.sql.utils.ParseException: u"\nextraneous input ';'
expecting <EOF>(line 1, pos 56)\n\n== SQL ==\nSELECT CNME from data WHERE CNME REGEXP '^[A-Za-z0-9]+$';
Please help, am i missing somethhing.

spark used this
spark.sql("SELECT col2 from test WHERE col2 REGEXP '^[A-Za-z0-9]*\\-' ").show

Have note used pyspark - but how about just removing the ; - seems not to be needed.

Related

Spark NumberFormatException when using spark.read.jdbc from Hive by JDBCprotocal

The problem occurs when using spark.read.jdbc() ↓
spark = SparkSession.builder \
.[some options...]\
.config("spark.sql.warehouse.dir", "hdfs://ip:port/path") \
.config("hive.metastore.uris", "thrift://ip:port") \
.enableHiveSupport() \
.getOrCreate()
spark.read.jdbc(url="jdbc:hive2://ip:port", table="db.table", properties={...})
java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "c1"
"c1" is the 1st column of table with simply values (1, 2, 3, 4)
However, if I read data with
spark.read.table("db.table")
↑ it does successfully work
Is there any suggestions that could make me resolve this problem?

Can PySpark write array of strings to a database through a JDBC driver?

I am working with PySpark and I want to insert an array of strings into my database that has a JDBC driver but I am getting the following error:
IllegalArgumentException: Can't get JDBC type for array<string>
This error happens when I have an ArrayType(StringType()) format for a UDF. And when I try to overwrite the column type:
.option("createTableColumnTypes", "col1 ARRAY, col2 ARRAY, col3 ARRAY, col4 ARRAY")
I get:
DataType array is not supported.(line 1, pos 18)
This makes me wonder if the problem is within Spark 3.1.2 where there is no mapping for array and I have to convert it into a string or is it coming from the driver that I am using?
For reference, I am using CrateDB as database. And here is its driver: crate.io/docs/jdbc/en/latest
Probably switching to use Postgres JDBC with CrateDB instead of crate-jdbc could solve your issue.
Sample PySpark program tested with CrateCB 4.6.1 and postgresql 42.2.23:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a = [1, 2]),
Row(a = [3, 4])
])
df
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql://<url-to-server>:5432/?sslmode=require") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "<tableName>") \
.option("user", "<username>") \
.option("password", "<password>") \
.save()
Could you maybe try adding the datatype for the array i.e. ARRAY(TEXT) ?
.option("createTableColumnTypes", "col1 ARRAY(TEXT), col2 ARRAY(TEXT), col3 ARRAY(TEXT), col4 ARRAY(TEXT)")
SELECT ['Hello']::ARRAY;
--> SQLParseException[line 1:25: no viable alternative at input 'SELECT ['Hello']::ARRAY limit']
SELECT ['Hello']::ARRAY(TEXT);
--> SELECT OK, 1 record returned (0.002 seconds)

not null not working in Pyspark when using impala jdbc driver

I am new to Pyspark. I am using Impala JDBC driver ImpalaJDBC41.jar . In my pyspark code, I use the below.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:impala://<instance>:21051") \
.option("query", "select dst_val,node_name,trunc(starttime,'SS') as starttime from def.tbl_dst where node_name is not null and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')") \
.option("user", "") \
.option("password", "") \
.load()
But the above does not work and the "node_name is not null" is not working. Also the trunc(starttime,'SS') is also not working. Any help would be appreciated.
sample input data :
dst_val,node_name,starttime
BCD098,,2021-03-26 15:42:06.890000000
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
Expected output :
dst_val,node_name,starttime
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
For debugging , I am tryin to print the df.show. But no use.
I am using df.show() , but it is still showing the record with null. The datatype of node_name is "STRING"
can you please use this ?
select dst_val,node_name,cast( from_timestamp(starttime,'SSS') as bigint) as starttime from def.tbl_dst where (node_name is not null or node_name<>'' ) and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')
I think node_name has empty space in it and above sql(i added or node_name<>'') will take care of it.
Now, if you have some non-printable character, then we may have to check accordingly.
EDIT : Since not null is working in Impala, i think this may be a spark issue.

Pyspark - How can I remove the leading and trailing whitespace in my dataframe?

I have a spark dataframe with 10 columns that I am writing to a table in hdfs. I am having issues with leading and trailing whitespace in the columns(all fields and all rows).
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Networks').getOrCreate()
dataset = spark.read.csv('Networks_arin_db_2-20-2019_parsed.csv', header=True, inferSchema=True)
#dataset.show(5)
I use the following options that I have found searching around:
dataset.write \
.option("parserLib","univocity") \
.option("ignoreLeadingWhiteSpace","false") \
.option("ignoreTrailingWhiteSpace","false") \
.mode("append") \
.option("path", "/user/hive/warehouse/analytics.db/arin_network") \
.saveAsTable("analytics.arin_network")
But I am still getting whitespace in my tables in hdfs:
Most of the examples I can find are Scala. Is there a way I can successfully accomplish this using pyspark? My version of Spark is 2.2.0.
When I query with 5 spaces after the first quote I get a result:
I wanted to keep it pyspark so I went back to the python code and added a line that removes all trailing and leading white-space.
df_out = df_out.applymap(lambda x: x.strip() if isinstance(x, str) else x)
This took care of the problem and I was able to import into an hdfs table with no issues.

PySpark How to read CSV into Dataframe, and manipulate it

I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file.
I'd like to read CSV file into spark dataframe, drop some columns, and add new columns.
How should I do that?
I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far:
def make_dataframe(data_portion, schema, sql):
fields = data_portion.split(",")
return sql.createDateFrame([(fields[0], fields[1])], schema=schema)
if __name__ == "__main__":
sc = SparkContext(appName="Test")
sql = SQLContext(sc)
...
big_frame = data.flatMap(lambda line: make_dataframe(line, schema, sql))
.reduce(lambda a, b: a.union(b))
big_frame.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://<...>") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("append") \
.save()
sc.stop()
This produces an error TypeError: 'JavaPackage' object is not callable at the reduce step.
Is it possible to do this? The idea with reducing to a dataframe is to be able to write the resulting data to a database (Redshift, using the spark-redshift package).
I have also tried using unionAll(), and map() with partial() but can't get it to work.
I am running this on Amazon's EMR, with spark-redshift_2.10:2.0.0, and Amazon's JDBC driver RedshiftJDBC41-1.1.17.1017.jar.
Update - answering also your question in comments:
Read data from CSV to dataframe:
It seems that you only try to read CSV file into a spark dataframe.
If so - my answer here: https://stackoverflow.com/a/37640154/5088142 cover this.
The following code should read CSV into a spark-data-frame
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/path/to_csv.csv"))
// these lines are equivalent in Spark 2.0 - using [SparkSession][1]
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.read.format("csv").option("header", "true").load("/path/to_csv.csv")
spark.read.option("header", "true").csv("/path/to_csv.csv")
drop column
you can drop column using "drop(col)"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
drop(col)
Returns a new DataFrame that drops the specified column.
Parameters: col – a string name of the column to drop, or a Column to drop.
>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
add column
You can use "withColumn"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
Note: spark has a lot of other functions which can be used (e.g. you can use "select" instead of "drop")

Resources