How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark?

How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark? - apache-spark

Lets say I have a directory called 'all_data', and inside this, I have several other directories based on the date of the data that it contains. These directories are named date_2020_11_01 to date_2020_11_30 and each one of these contain csv files which I intend to read in a single dataframe.
But I don't want to read the data for date_2020_11_15 and date_2020_11_16. How do I do it?

I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16:
spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")

df= spark.read.format("parquet").option("header", "true").load(paths)
where paths is a list of all the paths where data is present, worked for me.

Simple method is, read all data directory as it is and apply filter condition
df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'")
Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.

Related

python 3.x: how to combine dictionary zipped log file if exist in two different directories into another directory

I have below structure of directories having .log.gz files. Also these logs files are having String dictionary type. Anyone of these directories may not exist as well.
xyz1/
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
xyz12/
2022-08-08T01:30Z.log.gz
2022-08-08T01:33Z.log.gz
I want to create another directory and combine above files
xyz/
2022-08-08T01:30Z.log.gz
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
Conditions:
any one of xyz1 and xyz2 may exist or both can exist
If same name file exist in both directory combine them into third one "xyz"
While combining the String dictionary format should be retained
Solution I opted:
Check one directory if it exists and iterate over files.
For each files check if same file exist in other directory.
If yes, than decompress both files and combine them, zip them into xyz directory
if not than copy it into xyz
Is there any better way to perform above operation. Below is how I combine two log files.
combinefile = {}
combinefile.update(json.loads(xyz1/file1.log))
combinefile.update(json.loads(xyz2/file1.log))

Copy a set of files using ADF

I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!

There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?

As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:

You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

ADF Azure Data-Factory loop over folder syntax - wilcard?

i'm tryimg to loop over a diffrent countries folder that got fixed sub folder named survey (i.e Spain/survey , USA/survey ).
where and how I Need to define a wildcard / parameter for the countries so I could loop over all the files that in the survey folder ?
what is the right wildcard syntax ? ( the equivalent of - like 'survey%' in SQL) ?
I tried several ways to define it with no success and I would be happy to get some help on this - Thanks !

In case if the list of paths are static, you can create a parameter or add it in a SQL database and get that result from a lookup activity.
Pass the output to a for each activity and within foreach activity use a copy activity.
You can parameterize the input dataset to get the file paths thereby you need not think of any wildcard characters but use the actual paths itself.
Hope this is helpful.

run ibm data stage job with different file in same job

I created a job to input excel data into database. I need the job to be reusable for different excel version. The columns of the excel will be the same but only the values will change, it's like inserting newest excel values version to the database.
Example, the file of sales_report_january.xlsx , sales_report_february.xlsx both have same columns and only the row values is different. I need the job to be able to process both files without changing anything else except the file path. Because recreating different job with the same everything(except for the filepath) for the same task seems inefficient.
Is it available to do this in ibm data stage or do i need to remap everything despite it doesn't need any change? i already tried it by changing the file path manually but it raised error.

In a word: Parameter
Construct your job using a job parameter for the pathname of the Excel workbook.
Whichever stage you are using to read the worksheet will have the workbook name set up as reference(s) to that parameter.
Tip: Use two parameters; one for the dirname part of the pathname and one for the actual name of the workbook. This is a more flexible design in the long run.

I can think of at least four ways to do this. Usually, if the files are all in the same directory, we use looping in the sequence job to process a list of the file names obtained through an appropriate command (such as ls -m pattern for UNIX/Linux). Capture the output, convert the newlines to a delimiter such as comma if necessary, and use that list in the StartLoop activity.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark? - apache-spark

I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16: spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")

df= spark.read.format("parquet").option("header", "true").load(paths) where paths is a list of all the paths where data is present, worked for me.

Simple method is, read all data directory as it is and apply filter condition df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'") Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.

Related

python 3.x: how to combine dictionary zipped log file if exist in two different directories into another directory

Copy a set of files using ADF

Azure Data Factory removing spaces from column names of csv file

ADF Azure Data-Factory loop over folder syntax - wilcard?

run ibm data stage job with different file in same job

Categories

Resources