pyspark dataframe column : Hive column - apache-spark

I have one Hive table as following:
hive> describe stock_quote;
OK
tickerid string
tradeday string
tradetime string
openprice string
highprice string
lowprice string
closeprice string
volume string
Following code of Spark reads csv files and tries to insert records into Hive table:
sc = spark.sparkContext
lines = sc.textFile('file:///<File Location>')
rows = lines.map(lambda line : line.split(','))
rows_map = rows.map(lambda row : Row(TickerId = row[0], TradeDay = row[1], TradeTime = row[2], OpenPrice = row[3], HighPrice = row[4], LowPrice = row[5], ClosePrice = row[6], Volume = row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.mode('append').insertInto('default.stock_quote')
Problem I am facing is that when I call show() function on dataframe, it prints columns in alphabetical order like following
|ClosePrice|HighPrice|LowPrice|OpenPrice|TickerId|TradeDay|TradeTime|Volume|
, and in table, it inserts the value of ClosePrice(1st column in DF) in TickerId(1st column in Hive table) column, value of HighPrice in TradeDay column and so on.
Tried to call select() function on dataframe, didn't help.
Tried to put list of column names as following:
rows_df = spark.createDataFrame(rows_map, ["TickerId", "TradeDay", "TradeTime", "OpenPrice", "HighPrice", "LowPrice", "ClosePrice", "Volume"])
Above changes column names order, but values remained at the same position, which is even more incorrect.
Any help would really be appreciated.

You can also use saveAsTable instead of insertInto
From the docs:
Unlike insertInto, saveAsTable will use the column names to find the correct column positions

You should go with namedtuple instead of Row because 'Row' tries to order the column names. Thus the ordered column names didn't match with the column order of default.stock_quote table Please check What is the Scala case class equivalent in PySpark? for more details
So you should be doing
from collections import namedtuple
table = namedtuple('table', ['TickerId', 'TradeDay', 'TradeTime', 'OpenPrice', 'HighPrice', 'LowPrice', 'ClosePrice', 'Volume'])
rows_map = rows.map(lambda row : table(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
And as #user6910411 suggested, "a normal tuple would do as well"
rows_map = rows.map(lambda row : (row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
rows_df = spark.createDataFrame(rows_map, ['TickerId', 'TradeDay', 'TradeTime', 'OpenPrice', 'HighPrice', 'LowPrice', 'ClosePrice', 'Volume'])
now the insertInto should work

How it's happened it was sorted in alphabetical order? Is that how it's in csv file?
Anyway, I'd do it in following steps:
select columns from your table
rearrange dataframe based on columns from table
# pyspark below
list_columns = spark.sql('select * from table').columns # there might be simpler way
dataframe.select(*list_columns)

Related

How to make my identity column consecutive on delta table in Azure Databricks?

I am trying to create a delta table with a consecutive identity column. The goal is for our clients to see if there is some data they did not receive from us.
It looks like the generated identity column is not consecutive. Which makes the "INCREMENT BY 1" quite misleading.
store_visitor_type_name = ["apple","peach","banana","mango","ananas"]
card_type_name = ["door","desk","light","coach","sink"]
store_visitor_type_desc = ["monday","tuesday","wednesday","thursday","friday"]
colnames = ["column2","column3","column4"]
data_frame = spark.createDataFrame(zip(store_visitor_type_name,card_type_name,store_visitor_type_desc),colnames)
data_frame.createOrReplaceTempView('vw_increment')
data_frame.display()
%sql
CREATE or REPLACE TABLE TEST(
`column1SK` BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
,`column2` STRING
,`column3` STRING
,`column4` STRING
,`inserted_timestamp` TIMESTAMP
,`modified_timestamp` TIMESTAMP
)
USING delta
LOCATION '/mnt/Marketing/Sales';
MERGE INTO TEST as target
USING vw_increment as source
ON target.`column2` = source.`column2`
WHEN MATCHED
AND (target.`column3` <> source.`column3`
OR target.`column4` <> source.`column4`)
THEN
UPDATE SET
`column2` = source.`column2`
,`modified_timestamp` = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (
`column2`
,`column3`
,`column4`
,`modified_timestamp`
,`inserted_timestamp`
) VALUES (
source.`column2`
,source.`column3`
,source.`column4`
,current_timestamp()
,current_timestamp()
)
I'm getting the following results. You can see this is not sequential.What is also very confusing is that it is not starting at 1, while explicitely mentionned in the query.
I can see in the documentation (https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters) :
The automatically assigned values start with start and increment by
step. Assigned values are unique but are not guaranteed to be
contiguous. Both parameters are optional, and the default value is 1.
step cannot be 0.
Is there a workaround to make this identity column consecutive ?
I guess I could have another column and do a ROW_NUMBER operation after the MERGE, but it looks expensive.
You can utilize Pyspark to achieve the requirement instead of using row_number() function.
I have read the TEST table as a spark dataframe and converted it to pandas on spark dataframe. In pandas dataframe, using reset_index(), I have created a new index column.
Then I have converted it back to spark dataframe. I have added 1 to the index column values since the index starts with 0.
df = spark.sql("select * from test")
pdf = df.to_pandas_on_spark()
#to create new index column.
pdf.reset_index(inplace=True)
final_df = pdf.to_spark()
#Since index starts from 0, I have added 1 to it.
final_df.withColumn('index',final_df['index']+1).show()

Spark SQL - Get Column Names of a Hive Table in a String

I'm trying to get the column names of a Hive table in a comma separated String. This is what I'm doing
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.mkString(", ")
And the output I get is
res0: String = [col_1], [col_2], [col_3]
But what I want is col_1, col_2, col_3. I can remove [ and ] from the String, but I'm curious as to whether we can get the column names without the brackets in the first place.
Edit: The column names in the Hive table don't contain [ ]
Instead of show columns, Try below approach as it is faster than yours.
val colNameDF = spark.sql("select * from hive_table").limit(0)
Or
val colNameDF = spark.table("hive_table").limit(0)
val colNameStr = colNameDF.columns.mkString(", ")
The collect returns to you an array of Row which is particularly represented internally as array of values, so you need to trick it like this:
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.map(r=>r.getString(0)).mkString(", ")
Building on #Srinivas' answer above, here is the equivalent Python code. It is very fast:
colNameStr = ",".join(spark.table(hive_table).limit(0).columns)

Scala How to Find All Unique Values from a Specific Column in a CSV?

I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like:
0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9
002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me
004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me
005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok
005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj
00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9
00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me
010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8
011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe
011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc
I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
I have the following code which returns a List containing every distinct value found anywhere in the csv file:
val in_file = new File("input_file.csv")
val source = scala.io.Source.fromFile(in_file, "utf-8")
val labels = try source.getLines.mkString("\t") finally source.close()
val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct
Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc)
How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?
You can ignore the first two columns and then split the third by the comma.
Finally a toSet will get rid of the duplicate identifiers.
val f = Source.fromFile("input_file.csv")
val lastColumns = f.getLines().map(_.split("\t")(2))
val uniques = lastColumns.flatMap(_.split(",")).toSet
uniques foreach println
Using Scala 2.13 resource management.
util.Using(io.Source.fromFile("input_file.csv")){
_.getLines()
.foldLeft(Array.empty[String]){
_ ++ _.split("\t")(2).split(",")
}.distinct.toList
}
//res0: scala.util.Try[List[String]] =
// Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc))
The .toList can be dropped if an Array result is acceptable.
This is what you can do , Am doing on a sample DF, you can replace with yours
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val finalDf = Df.select(reqCols map Df.columns map col: _*)
finalDf.show
Note : This is 0-based index, so pass 2 to get third column.
If you want distinct values from your desired column.you can use distinct along with mkstring
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet)
println(distinctValues)
Dates are duplicate , above code is removing duplicates.
Another method using regex
val data = scala.io.Source.fromFile("source.txt").getLines()
data.toList.flatMap {
line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList)
}.flatten.distinct
// res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)

Splitting Multiple values inside a Pandas Column into Separate Columns

I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])

subtract mean from pyspark dataframe

I'm trying to calculate the average for each column in a dataframe and subtract from each element in the column. I've created a function that attempts to do that, but when I try to implement it using a UDF, I get an error: 'float' object has no attribute 'map'. Any ideas on how I can create such a function? Thanks!
def normalize(data):
average=data.map(lambda x: x[0]).sum()/data.count()
out=data.map(lambda x: (x-average))
return out
mapSTD=udf(normalize,IntegerType())
dats = data.withColumn('Normalized', mapSTD('Fare'))
In your example there is problem with UDF function which can not be applied to row and whole DataFrame. UDF can be applied only to single row, but Spark also enables implementing UDAF (User Defined Aggregate Functions) working on whole DataFrame.
To solve your problem you can use below function:
from pyspark.sql.functions import mean
def normalize(df, column):
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
return df.select(df[column] - average)
Use it like this:
normalize(df, "Fare")
Please note that above only works on single column, but it is possible to implement something more generic:
def normalize(df, columns):
selectExpr = []
for column in columns:
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
selectExpr.append(df[column] - average)
return df.select(selectExpr)
use it like:
normalize(df, ["col1", "col2"])
This works, but you need to run aggregation for each column, so with many columns performance could be issue, but it is possible to generate only one aggregate expression:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = []
for column in columns:
selectExpr.append(df[column] - averages[column])
return df.select(selectExpr)
Adding onto Piotr's answer. If you need to keep the existing dataframe and add normalized columns with aliases, the function can be modified as:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = ['*']
for column in columns:
selectExpr.append((df[column] - averages[column]).alias('normalized_'+column))
return df.select(selectExpr)

Resources