How to ingest multiple csv files into a Spark dataframe? - azure

I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()
I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.

I have reproduced the above and got the below results.
For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.
To get the desired schema enable mergeSchemaand header while reading the files.
Code:
df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()
If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.

The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])
# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)
Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.

Related

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

Levenshtein distance algorithm on Spark

I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.
I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.
This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).
I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.
Thank you very much in advance.
I guess you have csv file so you can read the directly to the dataframe:
val df1 = spark.read.option("header","true").csv("hdfs:///pathtoyourfile_1")
The spark.sql.functions module conatins deflevenshtein(l: Column, r: Column): Column function so you need to pass as a parameter - dataframe column with String type, if you want to pass a group of columns you can take concat('col1,'col2,..) function to concatenate multiple columns and pass them to the previous function. If you have 2 or more dataframes you have to join them into one dataframe and then perform distance calculation. Finally you can save your results to csv using df.write.csv("path")

Exclude column/list of columns while loading into dataframe

I have a weird question or issue which I am facing, my parquet file is having around 1200 columns out of which few are corrupt columns for some reason. I can't edit the files for some reason so I want to exclude those columns from being introduced to a data frame I want to create. I don't seem to find any such feature which would help me to do so.
The idea of creating a data frame first and then dropping the columns do crossed my mind but that can't be done because getting the issue of
Failed to merge incompatible data types string and bigint
So any way we can exclude the columns before hand even before loading in called is welcomed. Declaring a schema is going to be tedious job which I want to avoid at any cost because I want to keep the process as dynamic as possible.
Any leads or ideas are more than welcome.
Thanks
Ashit
can you provide some more information?
do all files have the same schema? except for some "corrupted" columns - which seems that they have a different type?
if this is the case - I would recommend not avoiding providing a schema. you can do something like this (this is the scala version) :
val df = spark.read.parquet("path to on file that has a good schema")
val allColumns = df.columns.toSet //get all columns
val goodColumns = allColumns - "badColumn"
val allData = spark.read.parquet("path to all data").select(goodColumns: _*) //syntax for exploding the set and providing the strings
spark should project only the needed columns and skip the corrupted once
you can also do the same with getting the schemaType from one file drop the not needed columns and provide this schema

spark reading missing columns in parquet

I have parquet files which I need to read from spark. Some files have few columns missing which are present in new files.
Since I do not know which files have column missing, I need to read all the files in spark. I have list of columns that I need to read. It may also be the case that all the files may have some column missing. I need to put a null in those columns which are missing.
When I try to do a
sqlContext.sql('query') it gives me error saying that columns are missing
If I define the schema and do a
sqlContext.read.parquet('s3://....').schema(parquet_schema)
It gives me the same error.
Help me here
You need to use parquet schema evolution strategy to address this situation.
As defined in the spark documentation
Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
All you need to do is
val mergedDF = spark.read.option("mergeSchema", "true").parquet("'s3://....'")
This will give you parquet data with complete schema.
Pain point
In case your schema is non compatible for example one parquet file has col1 DataType as String and another parquet file has col1 DataType as Long.
Then the merge schema will fail.

Spark Python: Converting multiple lines from inside a loop into a dataframe

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

Resources