I'm trying to use Spark SQL Data Frame to read some data in and apply a bunch of text clean up functions to each row.
import langid
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
from pyspark.sql import HiveContext
hsC = HiveContext(sc)
df = hsC.sql("select * from sometable")
def check_lang(data_str):
language = langid.classify(data_str)
# only english
record = ''
if language[0] == 'en':
# probability of correctly id'ing the language greater than 90%
if language[1] > 0.9:
record = data_str
return record
check_lang_udf = udf(lambda x: check_lang(x), StringType())
clean_df = df.select("Field1", check_lang_udf("TextField"))
However when I attempt to run this I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.select.
: java.lang.AssertionError: assertion failed: Unable to evaluate PythonUDF. Missing input attributes
I've spent a good deal trying to gather up more information on this but I can't find anything.
As a sidenote, I know the code below works but I'd like to stay with dataframes.
removeNonEn = data.map(lambda record: (record[0], check_lang(record[1])))
I haven't tried this code, but from the API docs suggest this should work:
hsC.registerFunction("check_lang", check_lang)
clean_df = df.selectExpr("Field1", "check_lang('TextField')")
Related
I was trying to handle null variables with imputer but I have a problem.
The problem is I have some null variables and to fill those values, I used imputer. Then I wanted to check new values which are applied imputer. So, I used the filter function and SQL to see new values but there is 0 raws when I use these methods.
My question is why I can't see these rows?
from pyspark.ml.feature import Imputer
imputer = Imputer()
imputer.setInputCols(["Life_expectancy","Adult_Mortality","Hepatitis_B","GDP","BMI","Polio"]) \
.setOutputCols(["Life_expectancy_Mean","Adult_Mortality_Mean","Hepatitis_B_Mean","GDP_Mean","BMI_Mean","Polio_Mean"])
model = imputer.fit(df)
from pyspark.sql import SQLContext
sqlContext = SQLContext(spark)
spark_df = spark.createDataFrame(df2)
spark_df.createTempView("table")
spark.sql("""
select * from table where Country = "Bahamas"
order by "Bahamas"
""").toPandas().head() # pyspark saying 0 raw 24 columns when ı use pyspark on spark
spark_df.filter(col("Country") == "Bahamas").toPandas().head(50) # pyspark saying 0 raw 28 columns when ı use filter
[![[![code screenshots](https://i.stack.imgur.com/Bs1AK.png)]](https://i.stack.imgur.com/jYqQh.png)]
I tried use pandas dataframe or spark dataframe but It didn't work. I didn't find any solve about this problem.
i am facing a strange issue , i am trying to display values of my JSON object, it works fine with select() but it dont work with selectExp(), i get a weird error, following in my implementation,
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("JsonPractice").getOrCreate()
my_json_df = spark.range(1).selectExpr(
"""'{"sample_json":{"sample_json1":["1st_vale","2nd_val"]}}' as my_json_column""")
my_json_df.selectExpr(get_json_object(col("my_json_column"), "$.sample_json.sample_json1[1]")).show(2)
my_select_expr = get_json_object(col('my_json_column'), '$.sample_json.sample_json1')
my_json_df.selectExpr(my_select_expr).show()
I am getting following error
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
We don't need to specify col while using selectExpr
my_select_expr = "get_json_object(my_json_column, '$.sample_json.sample_json1')"
my_json_df.selectExpr(my_select_expr).show(10,False)
#or
my_json_df.selectExpr("get_json_object(my_json_column,'$.sample_json.sample_json1')").show(10,False)
#+-----------------------------------------------------------+
#|get_json_object(my_json_column, $.sample_json.sample_json1)|
#+-----------------------------------------------------------+
#|["1st_vale","2nd_val"] |
#+-----------------------------------------------------------+
UPDATE:
from pyspark.sql.functions import *
my_select_expr=get_json_object(col('my_json_column'),'$.sample_json.sample_json1')
my_json_df.select(my_select_expr).show(10,False)
#+-----------------------------------------------------------+
#|get_json_object(my_json_column, $.sample_json.sample_json1)|
#+-----------------------------------------------------------+
#|["1st_vale","2nd_val"] |
#+-----------------------------------------------------------+
I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))
I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame.
schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on
jsonDF = spark.read.schema(schema).json('/path/test.json')
Since the schema is too big I want to defined inside the code. Can anyone please suggest which is the best way to do.
I tried below ways but doesn't work.
schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')
I figured out how to do this.
1. Define the schema of json file
json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
2. Print the json output
print(sampletmp.json())
3. Copy paste the above output to file sample.schema
4. In the code, recreate the schema as below
schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))
5. Create a DF using above schema
spark.read.schema(schema).json('/path/test.json')
6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")
Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
I haven't tested it with hdfs but I assume it is similar to reading from a local file. The idea is to store the file as a dict and then parse it to create the desidered schema. I have taken inspiration from here. Currently it lacks support for nullable and I have not tested with deeper levels of nested structs.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json
spark = SparkSession.builder.appName('myPython').getOrCreate()
f = open("/path/schema_file", "r")
dictString = f.read()
derived_schema = StructType([])
jdata = json.loads(dictString)
def get_type(v):
if v == "StringType":
return StringType()
if v == "TimestampType":
return TimestampType()
if v == "IntegerType":
return IntegerType()
def generate_schema(jdata, derived_schema):
for k, v in sorted(jdata.items()):
if (isinstance(v, str)):
derived_schema.add(StructField(k, get_type(v), True))
else:
added_schema = StructType([])
added_schema = generate_schema(v, added_schema)
derived_schema.add(StructField(k, added_schema, True))
return derived_schema
generate_schema(jdata, derived_schema)
from datetime import datetime
data = [("first", "the", datetime.utcnow(), ["as", 1])]
input_df = spark.createDataFrame(data, derived_schema)
input_df.printSchema()
With the file being:
{
"col1" : "StringType",
"col2" : "StringType",
"col3" : "TimestampType",
"col4" : {
"col5" : "StringType",
"col6" : "IntegerType"
}
}
I'm trying to use Spark 1.4 window functions in pyspark 1.4.1
but getting mostly errors or unexpected results.
Here is a very simple example that I think should work:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
l = [(1,101),(2,202),(3,303),(4,404),(5,505)]
df = sqlContext.createDataFrame(l,["a","b"])
wSpec = Window.orderBy(df.a).rowsBetween(-1,1)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
==> Failure org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
===> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")).collect()
[Row(a=1, prev=None, b=101, next=None), Row(a=2, prev=None, b=202, next=None), Row(a=3, prev=None, b=303, next=None)]
As you can see, if I add rowsBetween frame specification, neither rank() nor lag/lead() window functions recognize it: "Window function does not take a frame specification".
If I omit the rowsBetween frame specification at leas lag/lead() do not throw exceptions but return unexpected (for me) result: always None. And the rank() still doesn't work with different exception.
Can anybody help me to get my window functions right?
UPDATE
All right, that starts to look as a pyspark bug.
I have prepared the same test in pure Spark (Scala, spark-shell):
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val l: List[Tuple2[Int,Int]] = List((1,101),(2,202),(3,303),(4,404),(5,505))
val rdd = sc.parallelize(l).map(i => Row(i._1,i._2))
val schemaString = "a b"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, IntegerType, true)))
val df = sqlContext.createDataFrame(rdd, schema)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec = Window.orderBy("a").rowsBetween(-1,1)
df.select(df("a"), rank().over(wSpec).alias("rank"))
==> org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.;
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
val wSpec = Window.orderBy("a")
df.select(df("a"), rank().over(wSpec).alias("rank")).collect()
====> res10: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
====> res12: Array[org.apache.spark.sql.Row] = Array([1,null,101,202], [2,101,202,303], [3,202,303,404], [4,303,404,505], [5,404,505,null])
Even though the rowsBetween cannot be applied in Scala, both rank() and lag()/lead() work as I expect when rowsBetween is omitted.
As far as I can tell there two different problems. Window frame definition is simply not supported by Hive GenericUDAFRank, GenericUDAFLag and GenericUDAFLead so errors you see are an expected behavior.
Regarding issue with the following PySpark code
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
it looks like it is related to my question https://stackoverflow.com/q/31948194/1560062 and should be addressed by SPARK-9978. As far now you can make it work by changing window definition to this:
wSpec = Window.partitionBy().orderBy(df.a)