I am trying to get only those rows where colADD contain non alphanumeric character.
Code :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("Customers");
data.registerTempTable("data");
spark.sql("SELECT colADD from data WHERE colADD REGEXP '^[A-Za-z0-9]+$'; ");
Error:
pyspark.sql.utils.ParseException: u"\nextraneous input ';'
expecting <EOF>(line 1, pos 56)\n\n== SQL ==\nSELECT CNME from data WHERE CNME REGEXP '^[A-Za-z0-9]+$';
Please help, am i missing somethhing.
spark used this
spark.sql("SELECT col2 from test WHERE col2 REGEXP '^[A-Za-z0-9]*\\-' ").show
Have note used pyspark - but how about just removing the ; - seems not to be needed.
Related
The problem occurs when using spark.read.jdbc() ↓
spark = SparkSession.builder \
.[some options...]\
.config("spark.sql.warehouse.dir", "hdfs://ip:port/path") \
.config("hive.metastore.uris", "thrift://ip:port") \
.enableHiveSupport() \
.getOrCreate()
spark.read.jdbc(url="jdbc:hive2://ip:port", table="db.table", properties={...})
java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "c1"
"c1" is the 1st column of table with simply values (1, 2, 3, 4)
However, if I read data with
spark.read.table("db.table")
↑ it does successfully work
Is there any suggestions that could make me resolve this problem?
I am working with PySpark and I want to insert an array of strings into my database that has a JDBC driver but I am getting the following error:
IllegalArgumentException: Can't get JDBC type for array<string>
This error happens when I have an ArrayType(StringType()) format for a UDF. And when I try to overwrite the column type:
.option("createTableColumnTypes", "col1 ARRAY, col2 ARRAY, col3 ARRAY, col4 ARRAY")
I get:
DataType array is not supported.(line 1, pos 18)
This makes me wonder if the problem is within Spark 3.1.2 where there is no mapping for array and I have to convert it into a string or is it coming from the driver that I am using?
For reference, I am using CrateDB as database. And here is its driver: crate.io/docs/jdbc/en/latest
Probably switching to use Postgres JDBC with CrateDB instead of crate-jdbc could solve your issue.
Sample PySpark program tested with CrateCB 4.6.1 and postgresql 42.2.23:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(a = [1, 2]),
Row(a = [3, 4])
])
df
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql://<url-to-server>:5432/?sslmode=require") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "<tableName>") \
.option("user", "<username>") \
.option("password", "<password>") \
.save()
Could you maybe try adding the datatype for the array i.e. ARRAY(TEXT) ?
.option("createTableColumnTypes", "col1 ARRAY(TEXT), col2 ARRAY(TEXT), col3 ARRAY(TEXT), col4 ARRAY(TEXT)")
SELECT ['Hello']::ARRAY;
--> SQLParseException[line 1:25: no viable alternative at input 'SELECT ['Hello']::ARRAY limit']
SELECT ['Hello']::ARRAY(TEXT);
--> SELECT OK, 1 record returned (0.002 seconds)
I am new to Pyspark. I am using Impala JDBC driver ImpalaJDBC41.jar . In my pyspark code, I use the below.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:impala://<instance>:21051") \
.option("query", "select dst_val,node_name,trunc(starttime,'SS') as starttime from def.tbl_dst where node_name is not null and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')") \
.option("user", "") \
.option("password", "") \
.load()
But the above does not work and the "node_name is not null" is not working. Also the trunc(starttime,'SS') is also not working. Any help would be appreciated.
sample input data :
dst_val,node_name,starttime
BCD098,,2021-03-26 15:42:06.890000000
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
Expected output :
dst_val,node_name,starttime
BCD043,HKR_NODEF,2021-03-26 20:31:09
BCD038,BCF_NODEK,2021-03-26 21:29:10
For debugging , I am tryin to print the df.show. But no use.
I am using df.show() , but it is still showing the record with null. The datatype of node_name is "STRING"
can you please use this ?
select dst_val,node_name,cast( from_timestamp(starttime,'SSS') as bigint) as starttime from def.tbl_dst where (node_name is not null or node_name<>'' ) and trunc(starttime,'HH') >= trunc(hours_add(now(),-1),'HH') and trunc(starttime,'HH') < trunc(now(),'HH')
I think node_name has empty space in it and above sql(i added or node_name<>'') will take care of it.
Now, if you have some non-printable character, then we may have to check accordingly.
EDIT : Since not null is working in Impala, i think this may be a spark issue.
I have a spark dataframe with 10 columns that I am writing to a table in hdfs. I am having issues with leading and trailing whitespace in the columns(all fields and all rows).
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Networks').getOrCreate()
dataset = spark.read.csv('Networks_arin_db_2-20-2019_parsed.csv', header=True, inferSchema=True)
#dataset.show(5)
I use the following options that I have found searching around:
dataset.write \
.option("parserLib","univocity") \
.option("ignoreLeadingWhiteSpace","false") \
.option("ignoreTrailingWhiteSpace","false") \
.mode("append") \
.option("path", "/user/hive/warehouse/analytics.db/arin_network") \
.saveAsTable("analytics.arin_network")
But I am still getting whitespace in my tables in hdfs:
Most of the examples I can find are Scala. Is there a way I can successfully accomplish this using pyspark? My version of Spark is 2.2.0.
When I query with 5 spaces after the first quote I get a result:
I wanted to keep it pyspark so I went back to the python code and added a line that removes all trailing and leading white-space.
df_out = df_out.applymap(lambda x: x.strip() if isinstance(x, str) else x)
This took care of the problem and I was able to import into an hdfs table with no issues.
I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file.
I'd like to read CSV file into spark dataframe, drop some columns, and add new columns.
How should I do that?
I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far:
def make_dataframe(data_portion, schema, sql):
fields = data_portion.split(",")
return sql.createDateFrame([(fields[0], fields[1])], schema=schema)
if __name__ == "__main__":
sc = SparkContext(appName="Test")
sql = SQLContext(sc)
...
big_frame = data.flatMap(lambda line: make_dataframe(line, schema, sql))
.reduce(lambda a, b: a.union(b))
big_frame.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://<...>") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("append") \
.save()
sc.stop()
This produces an error TypeError: 'JavaPackage' object is not callable at the reduce step.
Is it possible to do this? The idea with reducing to a dataframe is to be able to write the resulting data to a database (Redshift, using the spark-redshift package).
I have also tried using unionAll(), and map() with partial() but can't get it to work.
I am running this on Amazon's EMR, with spark-redshift_2.10:2.0.0, and Amazon's JDBC driver RedshiftJDBC41-1.1.17.1017.jar.
Update - answering also your question in comments:
Read data from CSV to dataframe:
It seems that you only try to read CSV file into a spark dataframe.
If so - my answer here: https://stackoverflow.com/a/37640154/5088142 cover this.
The following code should read CSV into a spark-data-frame
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/path/to_csv.csv"))
// these lines are equivalent in Spark 2.0 - using [SparkSession][1]
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.read.format("csv").option("header", "true").load("/path/to_csv.csv")
spark.read.option("header", "true").csv("/path/to_csv.csv")
drop column
you can drop column using "drop(col)"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
drop(col)
Returns a new DataFrame that drops the specified column.
Parameters: col – a string name of the column to drop, or a Column to drop.
>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
add column
You can use "withColumn"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
Note: spark has a lot of other functions which can be used (e.g. you can use "select" instead of "drop")