Pyspark dataframe - how to determine which column the error occurred on?

Pyspark dataframe - how to determine which column the error occurred on? - apache-spark

Using Pyspark, when importing data from data file to a Azure SQL Db table, I am getting following error. The error itself is self-explanatory. But the data file and target table have about 100 columns with 75 of them as string columns. And, the error does not specify which column the error is on. Question: In pyspark, how can we determine which column the error is on?
Error:
com.microsoft.sqlserver.jdbc.SQLServerException: The given value of type VARCHAR(56) from the data source cannot be converted to type varchar(45) of the specified target column.
Code:
df = spark.read.csv(".../Test/MyFile.csv", header="true", inferSchema="false")
.............
#write to Azure SQL table. Error occurs here
df.write(...)

You can check the max length of each column before writing to SQL DB
from pyspark.sql import functions as F
df = df.select([F.length(c).alias(c) for c in df.columns])
df.groupby().max().show()
Then you can write an udf/function/use pandas to transpose the row to column

Related

How to use map to select columns in RDD

I have an RDD dataset of flights and I have to select specific columns from it.
I have to select column numbers 9,4,5,8,17 and then create an sql dataframe with the results. The data is an RDD.
I tried the following but I get an error in the map.
q9 = data.map(lambda x: [x[i] for i in [9,4,5,8,17]])
sqlContext.createDataFrame(q9_1, ['Flight Num', 'DepTime', 'CRSDepTime', 'UniqueCarrier', 'Dest']).show(n=20)
What would you do? thanks!

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.

I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.

Writing to delta table fails with "not enough data columns"?

I was trying to execute below spark-sql code in data bricks which is doing Insert Overwriting on other table. which are have same no.of columns with same names.
INSERT OVERWRITE TABLE cs_br_prov
SELECT NAMED_STRUCT('IND_ID',stg.IND_ID,'CUST_NBR',stg.CUST_NBR,'SRC_ID',stg.SRC_ID,
'SRC_SYS_CD',stg.SRC_SYS_CD,'OUTBOUND_ID',stg.OUTBOUND_ID,'OPP_ID',stg.OPP_ID,
'CAMPAIGN_CD',stg.CAMPAIGN_CD,'TREAT_KEY',stg.TREAT_KEY,'PROV_KEY',stg.PROV_KEY,
'INSERTDATE',stg.INSERTDATE,'UPDATEDATE',stg.UPDATEDATE,'CONTACT_KEY',stg.CONTACT_KEY) AS key,
stg.MEM_KEY,
stg.INDV_ID,
stg.MBR_ID,
stg.OPP_DT,
stg.SEG_ID,
stg.MODA,
stg.E_KEY,
stg.TREAT_RUNDATETIME
FROM cs_br_prov_stg stg
Error which i am getting was :
AnalysisException: Cannot write to 'delta.`path`', not enough data columns;
target table has 20 column(s) but the inserted data has 9 column(s)

The reason is as the exception says, the SELECT subquery creates a logical plan with just 9 columns (not 20 as the cs_br_prov table expects).
Unless the table uses generated columns, the exception is perfectly fine.

Why am I getting this TypeError when I try to slice my Pandas DataFrame?

I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:
df.to_csv('data.csv', index_label='date')
In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
When I attempt to do this, I get the following TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>
So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:
df.to_csv('data.csv')
In my second script, I no longer get the TypeError:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
Now it works just fine. The only problem is I have the default Pandas index column to deal with.
Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?

It seems that in your "first" instance of df, date column was an
ordinary column (not the index) and this DataFrame had a default
index - consecutive integers (its name is not important).
In this situation running df.to_csv('data.csv', index_label='date')
causes that the output file contains:
date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the index column (integers) was given date name, passed by you in
index_label parameter,
the next column, which in df was named date was also
given date name.
Then if you read it running
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date'), then:
the first date column (integers) has been read as date and
set as the index,
the second date column (dates) has been read as date.1 and
it is an ordinary column.
Now when you run df['2020-03-28':'2020-04-28'], you attempt to find rows
with index in the range given. But the index column is of Int64Index
type (check this in your installation), hence just the mentioned exception
was thrown.
Things look other way when you run df.to_csv('data.csv').
Now this file contains:
,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the first column (which in df was the index) has no name and int
values,
the only column named date is the second column and contains
dates.
Now when you read it, the result is:
date (converted do DatetimeIndex) is the index,
the original index column got name Unnamed: 0, no surprise,
since in the source file it had no name.
And now, when you run df['2020-03-28':'2020-04-28'] everything is OK.
The thing to learn for the future:
Running df.to_csv('data.csv', index_label='date') does not set this
column as the index. It only saves the current index column
under the given name, without any check whether any other column has
just the same name.
The result is that 2 columns can have the same name.

pandas read_csv create new column and usecols at the same time

I'm trying to load multiple csv files into a single dataframe df while:
adding column names
adding and populating a new column (Station)
excluding one of the columns (QD)
All of this works fine until I attempt to exclude a column with usecols, which throws the error Too many columns specified: expected 5 and found 4.
Is it possible to create a new column and pass usecols at the same time?
The reason I'm creating & populating a new 'Station' column during read_csv is my dataframe will contain data from multiple stations. I can work around the error by doing read_csv in one statement and dropping the QD column in the next with df.drop('QD', axis=1, inplace=True) but want to make sure I understand how to do this the most pandas way possible.
Here's the code that throws the error:
df = pd.concat(pd.read_csv("http://lgdc.uml.edu/common/DIDBGetValues?ursiCode=" + row['StationCode'] + "&charName=MUFD&DMUF=3000",
skiprows=17,
delim_whitespace=True,
parse_dates=[0],
usecols=['Time','CS','MUFD','Station'],
names=['Time','CS','MUFD','QD','Station']
).fillna(row['StationCode']
).set_index(['Time', 'Station'])
for index, row in stationdf.iterrows())
Example StationCode from stationdf BC840.
Data sample 2016-09-19T00:00:05.000Z 100 19.34 //

You can create a new column using operator chaining with assign:
df = pd.read_csv(...).assign(StationCode=row['StationCode'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark dataframe - how to determine which column the error occurred on? - apache-spark

You can check the max length of each column before writing to SQL DB from pyspark.sql import functions as F df = df.select([F.length(c).alias(c) for c in df.columns]) df.groupby().max().show() Then you can write an udf/function/use pandas to transpose the row to column

Related

How to use map to select columns in RDD

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

Writing to delta table fails with "not enough data columns"?

Why am I getting this TypeError when I try to slice my Pandas DataFrame?

pandas read_csv create new column and usecols at the same time

Categories

Resources