How to create Dataframes on partitioned files - python-3.x

I have 1000+ parquet files in a folder, which is a partitioned folder.
Now we got a requirement to use those files to perform some transformations on it.
i need to create data frame using those parquet file. Any suggestions?

Try below code:
DF = sqlContext.read.parquet(r"<folderpath>/*")
* indicates all files present under the specified folder.
DF will be a dataframe that will contain the data from all the parquet files inside the <folderpath>. Then you can perform your transformation on DF.

Related

How to import randam part files in hive gcs bucket into dataframe

I have a hive table storage location in a GCS bucket. With table name as folder name, date partitions as child folders and part files in orc format inside partition folder. I need to sample the table with repracentation accross partitions. Access to hive meta store and database are not provided.
I have imported every single part file into a data frame and sampled from the data frame but this approach is creating performance issues since importing table with 20 million rows.
What i want is a function to go into random partition folders and import random part files in that folder till i reach required number of sample rows.

Writing Dataframe to a parquet file but no headers are being written

I have the following code:
print(df.show(3))
print(df.columns)
df.select('port', 'key', 'return_b', 'return_a', 'return_c', 'return_d', 'return_g').write.format("parquet").save("qwe.parquet")
For some reason this doesn't write the Dataframe into the parquet file with the headers. The print statement above shows me those columns exist but the parquet file doesn't have those headers.
I have also tried:
df.write.option("header", "true").mode("overwrite").parquet(write_folder)
You may find df.to_parquet(...) more convenient.
If you wish to project down to selected columns,
do that first,
and then write to parquet.

create different dataframe based on field value in Spark/Scala

I have a dataframe in below format with 2 fields. One of the field contains code and other field contains XML.
EventCd|XML_VALUE
1.3.6.10|<nt:SNMP>
<nt:var id="1.3.0" type="STRING"> MESSAGE </nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:SNMP>
1.3.6.11|<nt:SNMP>
<nt:var id="1.3.1" type="STRING"> CALL </nt:var>
<nt:var id="1.3.2" type="STRING">XX-AC-EF</nt:var>
</nt:SNMPe
Based on value in code field I want to create different dataframe conditionally and place the data in corresponding hdfs folder.
if code is 1.3.6.10, it should create message dataframe and place files under ../message/ HDFS folder and if the code is 1.3.6.11, it should create call dataframe and write data into call hdfs folder like ../call/
I am able to create the dataframes using multiple filter options but is there any option to call only one dataframe and corresponding HDFS write command.
Can someone suggest how can I do this in spark/scala please.

Load single column from csv file

I have a csv file that contains large number of columns. I want to load just one column from that file using spark.
I know that we can use select statement to filter a column. But what i want, while doing the read operation itself, it should load just one column.
In this way, i should be able to avoid extra memory getting used by other columns. Is there any way to do this?
Spark will load complete file and parse for columns. As you mentioned, you can use select to restrict columns in dataframe, so dataframe will have only one column.
Spark will load the complete file in memory and will filter down the column you want with the help of select statements which you have mentioned.
Because all the read operation in spark, reads and scans the whole file as a distributed stream reader gets created (the reader gets instantiated at every node where the data has been stored).
And if your problem is to read the data column-wise then you can store the file in parquet format and read that file. Indeed, parquet is columnar storage and it is exactly meant for this type of use case(you can verify it using the explain).

Refresh Dataframe created by writing to Parquet files in append mode

Is there a way that I can refresh a dataframe that is being created by appending to the parquets in pyspark?
Basically I am writing to parquet in append mode, data that I get daily.
If I want to check the parquet file created I load it up in pyspark and do a count of the data. However if new data is appended to the parquet and I try a count again on the dataframe without reloading the dataframe I do not get the updated count. Basically, I have to create a new dataframe everytime there are any changes to my parquet file. Is there a way in Spark, where the changes are automatically loaded once my parquet is updated?

Resources