I have a data frame with two columns - filepath (wasbs file path for blobs), string and want to write each string to a seperate blob with that file name. How can i do this?
You can only write to one wasb container at a time - not sure if this is part of your question, but I want to clarify either way. In addition, spark writes files to directories, not single files. If you want to accomplish exactly what youre asking for, you'll have to repartition to 1 partition and partition by filepath.
After that step you'll need to use the azure sdk to rename the files and move them up to the parent directory.
UPDATED ANSWER:
I found a much simpler way of accomplishing this using dbutils.fs.put. You would need to loop through each row of your DataFrame, calling dbutils.fs.put() for each row.
Assuming your input file (assumed CSV) with two columns looks something like:
filepath, stringValue
wasbs://container#myaccount.blob.core.windows.net/demo1.txt,"demo string 1"
wasbs://container#myaccount.blob.core.windows.net/demo2.txt,"demo string 2"
wasbs://container#myaccount.blob.core.windows.net/demo3.txt,"demo string 3"
wasbs://container#myaccount.blob.core.windows.net/demo4.txt,"demo string 4"
wasbs://container#myaccount.blob.core.windows.net/demo5.txt,"demo string 5"
You can use the following to loop through each row in your input DataFrame:
df = spark.read.option("header", True).csv("wasbs://container#myaccount.blob.core.windows.net/demo-data.csv")
rowList = df.rdd.collect()
for row in rowList:
dbutils.fs.put(str(row[0]), str(row[1]), True)
The put method writes a given String out to a file, encoded in UTF-8, so using this you can loop through each record in your DataFrame, passing the first column in as
the file path, and the second as the string contents to write to the file.
This also has the benefit of writing the string to a single file, so you don't need to go through the process of renaming and moving files.
OLD ANSWER:
Due to the distributed nature of Spark, writing a DataFrame to files results in a directory being created which will contain multiple files. You can use coalesce to force the processing to a single worker and file, whose name will start with part-0000.
DISCLAIMER: This is recommended only for small files, as larger data files can lead to out of memory exceptions.
To accomplish what you are attempting, you would need to loop through each row of your DataFrame, creating a new DataFrame for each row which contains only the string value you want written to the file.
Assuming your input file (assumed CSV) with two columns looks something like:
filepath, stringValue
wasbs://container#myaccount.blob.core.windows.net/demo1,"demo string 1"
wasbs://container#myaccount.blob.core.windows.net/demo2,"demo string 2"
wasbs://container#myaccount.blob.core.windows.net/demo3,"demo string 3"
wasbs://container#myaccount.blob.core.windows.net/demo4,"demo string 4"
wasbs://container#myaccount.blob.core.windows.net/demo5,"demo string 5"
You can use the following to loop through each row in your input DataFrame:
from pyspark.sql import *
from pyspark.sql.types import StringType
df = spark.read.option("header", True).csv("wasbs://container#myaccount.blob.core.windows.net/demo-data.csv")
rowList = df.rdd.collect()
for row in rowList:
dfRow = spark.createDataFrame([str(row[1])], StringType())
dfRow.coalesce(1).write.mode("overwrite").text(row[0])
This will result in directories being created in your Blob Storage account container named demo1, demo2, demo3, demo4, and demo5. Each of those will contain multiple files. The file within each directory whose name begins with part-0000 is the file that will contain your string value.
If you need those files to have different names, and be in a different location, you can then use dbutils.fs methods to handle moving the files and doing the renames. You can also use this to do any cleanup of the directories that were created, if desired.
Related
I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()
I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.
I have reproduced the above and got the below results.
For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.
To get the desired schema enable mergeSchemaand header while reading the files.
Code:
df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()
If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.
The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:
paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://test1#lmne.dfs.core.windows.net/csvs/bbc_news.csv"]
# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])
# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)
Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.
I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/
If I do something like below:
pc = ["year", "month", "day", "hour"]
df.write.partitionBy(*pc).parquet("base_path/", mode = 'append')
It creates the location as: base_path/year=2022/month=04/day=25/hour=10/.
I do not want the column names like year, month, day and hour to be the part of path but something like: base_path/2022/04/25/10/. Any solution for this?
The column names are written as part of the path because they are not written in the object itself so you need the column name in the path in order to be able to read it back (following hive style convention).
For more information about this see here.
If you would still want to write the data with the above path you can use multiple write commands with the explicit path and filter according to the partition values.
The current logic for determining the partition path is located here and there doesn't seem to be a way to replace this in a pluggable way (you could technically load a different implementation in the JVM or write you own writer implementation but I would not recommend that)
On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')
I'm trying to remove header from the Dataset<Row> which is created with the data from csv file. There are bunch of ways to do it.
So, I'm wondering whether the first row in Dataset<Row> is always equals to the first row in the file (from which the Dataset<Row> is created)?
When you read the files, the records in the RDD/Dataframe/Dataset are in the order as they were in the files. But if you perform any operation that requires shuffling the order changes.
So you can remove the first row as soon as reading the file and before any operation that requires shuffling.
The best option would be using csv data source as
spark.read.option("header", true).csv(path)
This will take the first row as a header and use it as column name.
I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()