I have a CSV file and when I load it to data bricks using Pyspark, It loses its structure. I tried to delimit the same using a pipe and set header = True but nothing works out. Below is what I am facing and is only an example :
I have written the following code :
df = spark.read.csv(df_path, header = True, sep = "|")
and the result is :
----------------------------------
region, sub-region, country, owner
----------------------------------
new_jersey, daffodil, USA, Anker
Dubai, Bahamas, UAE, Nikon
All the values get inside a single column and are separated by ','.
How do I convert the same into a structured data frame?
Related
As per below data, using Python how can I get Headers column value for the corresponding given input from DB & Table column.
DB Table Headers
Oracle Cust Id,Name,Mail,Phone,City,County
Oracle Cli Cid,shopNo,State
Oracle Addr Street,Area,City,Country
SqlSer Usr Name,Id,Addr
SqlSer Log LogId,Env,Stg
MySql Loc Flat,Add,Pin,Country
MySql Data Id,Txt,TaskId,No
Output: Suppose if i pass, Oracle & Cli as parameters, then it should return the value as "Cid,shopNo,State" in a list.
Trying with python dictionary, but it takes 2 values key and value. But i have 3 values. how to get ?
Looks like your data is in some sort of tabular format. In that case I would recommend using the pandas package, which is very convenient if you are working with tabular data.
pandas can read data into a DataFrame from a CSV file using pandas.read_csv. This dataframe you can then filter using the column names and the required values.
In the example below I assume that your data is tab (\t) separated. I read in the data from a string using io.StringIO. Normally you would just use pandas.read_csv('filename.csv').
import pandas as pd
import io
data = """DB\tTable\tHeaders
Oracle\tCust\tId,Name,Mail,Phone,City,County
Oracle\tCli\tCid,shopNo,State
Oracle\tAddr\tStreet,Area,City,Country
SqlSer\tUsr\tName,Id,Addr
SqlSer\tLog\tLogId,Env,Stg
MySql\tLoc\tFlat,Add,Pin,Country
MySql\tData\tId,Txt,TaskId,No"""
dataframe = pd.read_csv(io.StringIO(data), sep='\t')
db_is_oracle = dataframe['DB'] == 'Oracle'
table_is_cli = dataframe['Table'] == 'Cli'
filtered_dataframe = dataframe[db_is_oracle & table_is_cli]
print(filtered_dataframe)
This will result in :
DB Table Headers
1 Oracle Cli Cid,shopNo,State
Or to get the actual headers of the first match:
print(filtered_dataframe['Headers'].iloc[0])
>>> Cid,shopNo,State
I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"
You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")
I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).
I have a problem in saving large panda dataframe into the csv file.
below is just snapshot of first 3 lines of panda dataframes.
parent_sernum parent_pid created_date sernum pid pn
0 FCH21467XBN UXXXX-XXX-XXXX 2017-12-20 11:02:00.177 SSGQA20741370EA85A,SSGQA... UXXXX, 22RV-A,Uxxx-xx... 15-104065-01,15-104065-0...
1 FCH21467XBN Uxxxx-xxx-xx 2017-12-20 11:38:45.373 SSGQA20741370EA85A,SSGQA... Uxx-xxx-xxxx-A,Uxx-xx-... 15-104065-01,15-104065-0...
2 FCH2145V0UW Uxxx-xxxx-M4S 2017-12-02 11:01:26.993 SSH8A2071935A2ACDE,SSH8A... Uxx-xx-1X324RV-A,UCS-ML-... 15-104064-01,15-104064-0...
When it comes to save this dataframe into csv file. it only captures First letter and drops the rest from most of columns as below.
parent_sernum,parent_pid,created_date,sernum,pid,pn
F,U,2017-12-20 11:02:00.177,S,U,1
F,U,2017-12-20 11:38:45.373,S,U,1
F,U,2017-12-02 11:01:26.993,S,U,1
Below is my codes when it comes to save data. (df = dataframe).
Is there any option that I need to set if I need to save all from dataframe?
Or is there a limitation in csv file size so it automatically captures only fraction of data to meet the size restriction?
df.to_csv('sample.csv', index = False, header = True)
I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Using alias to change your field names without those special characters.
I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)
It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.