In Spark 1.6 , How to read a CSV file with duplicated column name - apache-spark

I am unable to find a solution for reading a CSV file which has a column name repeated twice but while reading the CSV file it's giving an error complaining duplicate column names
Is there a way to handle this in spark without altering the CSV file ?.
My CSV data looks like this delimited by Tab (\t) & some extra spaces in each column.
col1 col2 col3
2020 100 sometext

You can also try using textfile method to read csv files and then convert them to DF or use them as RDDs after splitting and mapping them back!
Hope this works!

Related

How to ingest multiple csv files into a Spark dataframe?

I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()
I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.
I have reproduced the above and got the below results.
For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.
To get the desired schema enable mergeSchemaand header while reading the files.
Code:
df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()
If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.
The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])
# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)
Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.

Keeping Special Characters in Spark Table Column Name

Is there any way to keep special characters for a column in a spark 3.0 table?
I need to do something like
CREATE TABLE schema.table
AS
SELECT id=abc
FROM tbl1
I was reading in Hadoop that you would put back ticks around the column name but this does not work in spark.
If there is a way to do this in PySpark that would work as well
Turns out you parquet and delta formats do not accept special characters under any circumstance. You must use Row Format Delimited
spark.sql("""CREATE TABLE schema.test
ROW FORMAT DELIMITED
SELECT 1 AS `brand=one` """)

How to fetch the column count from dat file in Azure data lake analytics files

I have different Dat and CSV files. it's containing more than 255 columns and delimiter as '|' and tab. How to fetch the column count. Anyone please share sample U-Sql code
I know this was down voted, so I hope it is still OK to supply an answer (although I'm not including a code sample).
Extract just the first row in your file (using FETCH 1 ROWS) into a single column rowset. You should then be able to use String.Split to get a column count.

Is the first row of Dataset<Row> which is created from a csv file equals to the first row in the file?

I'm trying to remove header from the Dataset<Row> which is created with the data from csv file. There are bunch of ways to do it.
So, I'm wondering whether the first row in Dataset<Row> is always equals to the first row in the file (from which the Dataset<Row> is created)?
When you read the files, the records in the RDD/Dataframe/Dataset are in the order as they were in the files. But if you perform any operation that requires shuffling the order changes.
So you can remove the first row as soon as reading the file and before any operation that requires shuffling.
The best option would be using csv data source as
spark.read.option("header", true).csv(path)
This will take the first row as a header and use it as column name.

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources