How to load all config files in pythanic way? - linux

Just want to know is there any proper way to load multiple config files to python scripts.
Directory structure as below.
dau
|-APPS
|---kafka
|---brokers
|-ENVS
As per the above, my base directory is dau. I'm planing to hold the script in Kafka and Broker directories. All global environments store in ENVS directory with ".ini" format. I want to load those ini files to all the script without adding one by one, because we may have to add more environments files in the future , in that case we don't have to add them manually on each and every scripts.
Sample env.ini
[DEV]
SERVER_NAME = dev123.abcd.net
i was trying to use the answer of below link, but still we have to add them manually, or if the parent path change in the dau directory, we have to edit the code.
Stack-flow-answer

Hi I came up with below solution, Thanks for the support.
Below code will get the all .ini file as list and return.
import os
def All_env_files():
try:
BASE_PATH = os.path.abspath(os.path.join(__file__,"../.."))
ENV_INI_FILES = [os.path.join(BASE_PATH + '/ENVS/',each) for each in os.listdir(BASE_PATH + '/ENVS') if each.endswith('.ini')]
return ENV_INI_FILES
except ValueError:
raise ValueError('Issue with Gathering Files from ENVS Directory')
Below code will take the list ini files and provide it to ConfigParser.
import ConfigParser, sys , os
"""
This is for kafka broker status check
"""
#Get Base path
Base_PATH = os.path.abspath(os.path.join(__file__,"../../.."))
sys.path.insert(0, Base_PATH)
#Importing configs python file on ../Configs.py
import Configs, edpCMD
#Taking All the ENVS ini file as list
List_ENVS = Configs.All_env_files()
Feel free to provide any shorter way to this.

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

Import Python custom package, with Args, into another py File from a network location directory

I need to build a solution for a Use Case and I am still a bit of a novice on Python capabilities for version 3.9.9.
Use Case:
User Billy wants to run a script against a Snowflake server Azure database, call it sandbox, using a his own python script on his local machine.
Billy's python script, to keep connection settings secure, needs to call a snowflake_conn.py script, which is located in another network folder location (\abs\here\is\snowflake_conn.py), and pass arguments for DB & Schema.
The call will return a connection to Snowflake Billy can use to run his SQL script.
I am envisioning something like:
import pandas as pd
import snowflake_conn # I need to know how to find this in a network folder, not local.
# and then call the custom conn function
snowflake_connect('database','schema')
# where it returns the snowflake.connector.connect.cursor() as sfconn
conn1 = sfconn.conn()
qry = r'select * from tablename where 1=1'
conn1.execute(qry)
df = conn1.fetch_pandas_all()
I saw something like this..but that was from back in 2016 and likely prior to 3.9.9.
import sys
sys.path.insert(0, "/network/modules/location") # OR "\\abs\here\is\" ??
import snowflake_conn
That snowflake_conn.py file uses a configparser.ConfigParser() .read() command to open a config.ini file in the same folder as the snowflake_conn.py script.
I am following the instructions in another stackoverflow question, link below that is 4 years old, to help get the config.ini setup completed.
import my database connection with python
I also found this link, which seems to also point to only a local folder structure, not network folder.
https://blog.finxter.com/python-how-to-import-modules-from-another-folder/
Eventually I want to try to encrypt the .ini file to protect the contents of that .ini file for increased security, but not sure where to start on that yet.

How to write batch of data to Django's sqlite db from a custom written file?

For a pet project I am working on I need to import list of people to sqlite db. I have 'Staff' model, as well as a users.csv file with list of users. Here is how I am doing it:
import csv
from staff.models import Staff
with open('users.csv') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
firstname = row['firstname']
lastname = row['lastname']
email = row['email']
staff = Staff(firstname=firstname, lastname=lastname, email=email)
staff.save()
csv_file.close()
However, I am getting below error message:
raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
Is what I am doing correct? If yes what I am missing here?
Django needs some environment variables when it is being bootstrapped to run. DJANGO_SETTINGS_MODULE is one of these, which is then used to configure Django from the settings. Typically many developers don't even notice because if you stay in Django-land it isn't a big deal. Take a look at manage.py and you'll notice it sets it in that file.
The simplest thing is to stay in django-land and run your script in its framework. I recommend creating a management command. Perhaps a more proper way is to create a data migration and put the data in a storage place like S3 if this is something many many people need to do for local databases... but it seems like a management command is the way to go for you. Another option (and the simplest if this is really just a one-time thing) is to just run this from the django shell. I'll put that at the bottom.
It's very simple and you can drop in your code almost as you have it. Here are the docs :) https://docs.djangoproject.com/en/3.2/howto/custom-management-commands/
For you it might look something like this:
/app/management/commands/load_people.py <-- the file name here is what manage.py will use to run the command later.
from django.core.management.base import BaseCommand, CommandError
import csv
from staff.models import Staff
class Command(BaseCommand):
help = 'load people from csv'
def handle(self, *args, **options):
with open('users.csv') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
firstname = row['firstname']
lastname = row['lastname']
email = row['email']
staff = Staff(firstname=firstname, lastname=lastname, email=email)
staff.save()
# csv_file.close() # you don't need this since you used `with`
which you would call like this:
python manage.py load_people
Finally, the simplest solution is to just run the code in the Django shell.
python manage.py shell
will open up an interactive shell with everything loaded properly. You can execute your code there and it should work.

Automate the Script whenever a new folder/file is added in directory in Python

I have multiple folders in a directory and each folder has multiple files. I have a code which checks for a specific file in each folder and does some data preprocessing and analysis if the specific file is present.
A snippet of it is given below.
import pandas as pd
import json
import os
rootdir = os.path.abspath(os.getcwd())
df_list = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file.startswith("StudyParticipants") and file.endswith(".csv"):
temp = pd.read_csv(os.path.join(subdir, file))
.....
.....
'some analysis'
Merged_df.to_excel(path + '\Processed Data Files\Study_Participants_Merged.xlsx')
Now, I want to automate this process. I want this script to be executed whenever a new folder is added. This is my first in exploring automation process and I ham stuck on this for quite a while without major progress.
I am using windows system and Jupyter notebook to create these dataframes and perform analysis.
Any help is greatly appreciated.
Thanks.
I've wrote a script which you should only run once and it will work.
Please note:
1.) This solution does not take into account which folder was created. If this information is required I can rewrite the answer.
2.) This solution assumes folders won't be deleted from the main folder. If this isn't the case, I can rewrite the answer as well.
import time
import os
def DoSomething():
pass
if __name__ == '__main__':
# go to folder of interest
os.chdir('/home/somefolders/.../A1')
# get current number of folders inside it
N = len(os.listdir())
while True:
time.sleep(5) # sleep for 5 secs
if N != len(os.listdir()):
print('New folder added! Doing something useful...')
DoSomething()
N = len(os.listdir()) # update N
take a look at watchdog.
http://thepythoncorner.com/dev/how-to-create-a-watchdog-in-python-to-look-for-filesystem-changes/
you could also code a very simple watchdog service on your own.
list all files in the directory you want to observe
wait a time span you define, say every few seconds
make again a list of the filesystem
compare the two lists, take the difference of them
the resulting list from this difference are your filesystem changes
Best greetings

pandas : read_csv not accepting relative path

I have python code in Jupyter notebook and accompanying data in the same folder. I will be bundling both the code and data into a zip file and submitting for evaluation. I am trying to read the data inside the Notebook using pandas.read_csv using a relative path and thats not working. the API doesnt seem to work with relative path. What is the correct way to handle this?
Update:
My findings so far seem to suggest that, I should be using os.chdir() to set the current working directory. But I wouldn't know where the zip file will get extracted. The code is supposed to be read-only..So I cannot expect the receiver to update the path as appropriate.
You could append the current working directory with the relative path to avoid problem as such:
import os
import pandas as pd
BASE_DIR = os.getcwd()
csv_path = "csvname.csv"
df = pd.read_csv(os.path.join(BASE_DIR, csv_path)
where csv_path is the relative path.
I think first of all you should make a unzip file then you can run.
You may use the below code to unzip file,
from zipfile import ZipFile
file_name = "folder_name.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print("Done !")

Resources