Weird Spark SQL behavior while selecting column [duplicate] - apache-spark

A pyspark dataframe containing dot (e.g. "id.orig_h") will not allow to groupby upon unless first renamed by withColumnRenamed. Is there a workaround? "`a.b`" doesn't seem to solve it.

In my pyspark shell, the following snippets are working:
from pyspark.sql.functions import *
myCol = col("`id.orig_h`")
result = df.groupBy(myCol).agg(...)
and
myCol = df["`id.orig_h`"]
result = df.groupBy(myCol).agg(...)
I hope it helps.

Related

Pyspark sql count returns different number of rows than pure sql

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
Take this code:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
hc.table("diamonds").count()
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
Is that pyspark count including the header?
I've tried to look into:
df = hc.sql("select * from diamonds").collect()
df[0]
df[1]
to see if header was included:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
The 0th element doesn't look like the header.
Anyone has an explanation for this?
Thanks!
Ale
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;
Alternatively, refresh the statistics. If your table is not partitioned:
ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;
If it is partitioned:
ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;
Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

User defined function over all rows in window

I have a set of timestamped location data with a set of string feature ids that are attached to each location. I'd like to use a Window in spark to pull together an array of all of these feature id strings across the current N and next N rows, ala:
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
windowSpec = Window \
.partitionBy(df['userid']) \
.orderBy(df['timestamp']) \
.rowsBetween(-50, 50)
dataFrame = sqlContext.table("locations")
featureIds = featuresCollector(dataFrame['featureId']).over(windowSpec)
dataFrame.select(
dataFrame['product'],
dataFrame['category'],
dataFrame['revenue'],
featureIds.alias("allFeatureIds"))
Is this possible with Spark and if so, how do I write a function like featuresCollector that can collect all the feature ids in the window?
Spark UDFs cannot be used for aggregations. Spark provides a number of tools (UserDefinedAggregateFunctions, Aggregators, AggregateExpressions) which can be used for custom aggregations, and some of these can be used with windowing, but none can be defined in Python.
If all you want is to collect records, collect_list should do the trick. Please keep in mind that is a very expensive operation.
from pyspark.sql.functions import collect_list
featureIds = collect_list('featureId').over(windowSpec)

Timestamp parsing in pyspark

df1:
Timestamp:
1995-08-01T00:00:01.000+0000
Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.
You can parse this timestamp using unix_timestamp:
from pyspark.sql import functions as F
format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))
Then, you can use dayofmonth in the new Timestamp column:
df2.select(F.dayofmonth('Timestamp2'))
More detials about these functions can be found in the pyspark functions documentation.
Code:
df1.select(dayofmonth('Timestamp').alias('day'))

Apache SPARK with SQLContext:: IndexError

I am trying to execute a basic example provided in Inferring the Schema Using Reflection segment of Apache SPARK documentation.
I'm doing this on Cloudera Quickstart VM(CDH5)
The example I'm trying to execute is as below ::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
The input file book6_sample is available at
book6_sample.csv.
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
Please suggest pointers on where I'm going wrong.
Thanks in advance.
Regards,
Sri
Your file has one empty line at the end which is causing this error.Open your file in text editor and remove that line hope it will work

pyspark access column of dataframe with a dot '.'

A pyspark dataframe containing dot (e.g. "id.orig_h") will not allow to groupby upon unless first renamed by withColumnRenamed. Is there a workaround? "`a.b`" doesn't seem to solve it.
In my pyspark shell, the following snippets are working:
from pyspark.sql.functions import *
myCol = col("`id.orig_h`")
result = df.groupBy(myCol).agg(...)
and
myCol = df["`id.orig_h`"]
result = df.groupBy(myCol).agg(...)
I hope it helps.

Resources