calulcate size of data in each directory using pyspark - azure

I am using the following code snippet to calculate the size of all the folders in each directory. I can pass the file path as a parameter in the form of widgets. I can achieve the requirement by giving the directory names one after the other however, the requirement is to achieve the size of the folders in a recursive manner:
For example, the following input paths are as below:
/mnt/stoREC/datamart/export/
/mnt/stoREC/datamart/export//BRedem/
/mnt/stoREC/datamart/export/Sell/
/mnt/ADLS/Prepared/ModelExecution/gen/
/mnt/ADLS/Prepared/ModelExecution/hhp/
The expected output is:
/mnt/stoREC/datamart/export/ 457783298
/mnt/stoREC/datamart/export//BRedem/ 846262827
/mnt/stoREC/datamart/export/Sell/ 88736291
/mnt/ADLS/Prepared/ModelExecution/gen/ 346727682
/mnt/ADLS/Prepared/ModelExecution/hhp/ 52781528
Below is the code I am using
def dirsize(path):
total = 0
dir_files = dbutils.fs.ls(path)
for file in dir_files:
if file.isDir():
total += dirsize(file.path)
else:
total = file.size
# print(path)
return (total)
path =dbutils.fs.ls ("/mnt/stoREC/datamart/export")
path2=[]
for i in path:
path1=i[0]
# print(path1)
x=path1.replace('dbfs:','')
path2= path1[i]
path1[i]=path2[i]
path2[i]=path2[]
print(x)
I believe I am getting the logic incorrect somewhere since I am trying to use the swapping logic.

I modified the code and used it to find the size of folders that each path contains. The following is a demonstration of of the same.
The following are the list of input files for which I need to calculate size.
print(mypaths)
#output (my sample paths for demo)
['/mnt/repro', '/mnt/repro2', '/mnt/repro/a', 'mnt/repro2/b']
I have made slight changes to the dirsize() function code that you have provided.
def dirsize(path):
total=0
dir_files = dbutils.fs.ls(path)
for file in dir_files:
#print(file)
if file.isDir():
#print(total)
total += dirsize(file.path)
else:
total += file.size
#print(file.path+"/ "+str(total))
return (total)
Now when you can iterate through your input paths (mypaths) to get the total size of the files/folders that it holds.
for path in mypaths:
print(path+" : "+ str(dirsize(path)))
#output
/mnt/repro : 3445
/mnt/repro2 : 254
/mnt/repro/a : 2292
mnt/repro2/b : 254
Output Image:

Related

Reading images from pdf and extract Text from it

Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.
What I tried: I have to do this in python, and the only library I found with the best result is pytesserac.
I am pasting the sample code which I tried
fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
if 3.3 < zoom < 3.5:
zoom_config += ' --oem 3 --psm 4'
elif 0 != page_number_list[0]:
zoom_config += ' --psm 6'
full_text, page_length = '', kw['doc'].pageCount
if recursion and index >= 10:
return fn.get('most_correct') or fn.get(page_number_list[0])
mat = fitz.Matrix(zoom, zoom) # increase resolution
for page_no in page_number_list:
page = kw['doc'].loadPage(page_no) # number of page
pix = page.getPixmap(matrix=mat)
with Image.open(io.BytesIO(pix.getImageData())) as img:
text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
fn[page_no] = text_of_each_page
full_text = '\n'.join((full_text, text_of_each_page, '\n'))
_logger.critical(f"full text in load immage {full_text}")
args = (full_text, page_number_list)
load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
if recursion and load:
return self.load_image
return full_text
The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , = randomly whereas in pdf these things are not even there.
For accuracy
1. I tried converting the image to .tiff format but it didn't work for me.
2. Tried adjusting the resolution of the image.
You can use pdftoppm tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads).
You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.

pytorch-kaldi timit tutorial error shared_list[0] IndexError: list index out of range

# ***** Reading the Data********
if processed_first:
#Reading all the features and labels for this chunk
shared_list = []
p = threading.Thread(target=read_lab_fea, args=(cfg_file, is_production, shared_list, output_folder))
p.start()
p.join()
data_name = shared_list[0]
data_end_index = shared_list[1]
fea_dict = shared_list[2]
lab_dict = shared_list[3]
arch_dict = shared_list[4]
data_set = shared_list[5]
enter image description here
First I did run kaldi's run.sh file
When I did that, I corrected cmd.sh's contents.
Original --> call queue.pl
to --> call run.pl
Because I met bug when i run original source
Reference : https://www.google.com/url?q=https://groups.google.com/g/kaldi-help/c/tokwXTLdGFY?pli%3D1&sa=D&source=editors&ust=1631002151871000&usg=AOvVaw1FYQHJEmI-kkAAeAB2tcKt
enter image description here
I found that fea_dict and lab_dict in data_io.py has no shared element. How can I progress the TIMIT tutorial experiments?
I'm doing experiment using cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg file. Just correcting absolute directory of linux.
I refered https://github.com/mravanelli/pytorch-kaldi/issues/185
run copy-feats.
I saw--> kaldierror::KaldiFatalError

snakemake - replacing wildcards in input directive by anonymous function

I am writing a snakemake that will run a bioinformatics pipeline for several input samples. These input files (two for each analysis, one with the partial string match R1 and the second with the partial string match R2) start with a pattern and end with the extension .fastq.gz. Eventually I want to perform multiple operations, though, for this example I just want to align the fastq reads against a reference genome using bwa mem. So for this example my input file is NIPT-N2002394-LL_S19_R1_001.fastq.gz and I want to generate NIPT-N2002394-LL.bam (see code below specifying the directories where input and output are).
My config.yaml file looks like so:
# Run_ID
run: "200311_A00154_0454_AHHHKMDRXX"
# Base directory: the analysis directory from which I will fetch the samples
bd: "/nexusb/nipt/"
# Define the prefix
# will be used to subset the folders in bd
prefix: "NIPT"
# Reference:
ref: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta"
And below is my snakefile
import os
import re
#############
# config file
#############
configfile: "config.yaml"
#######################################
# Parsing variables from config.yaml
#######################################
RUN = config['run']
BD = config['bd']
PREFIX = config['prefix']
FQDIR = f'/nexusb/Novaseq/{RUN}/Unaligned/'
BASEDIR = BD + RUN
SAMPLES = [sample for sample in os.listdir(BASEDIR) if sample.startswith(PREFIX)]
# explanation: in BASEDIR I have multiple subdirectories. The names of the subdirectories starting with PREFIX will be the name of the elements I want to have in the list SAMPLES, which eventually shall be my {sample} wildcard
#############
# RULES
#############
rule all:
input:
expand("aligned/{sample}.bam", sample = SAMPLES)
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
But I am getting:
Building DAG of jobs...
WildcardError in line 55 of /nexusb/nipt/200311_A00154_0454_AHHHKMDRXX/testMetrics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
's'
When calling snakemake -np
I believe my error lies in the definitions of R1 and R2 in the input directive. I find it puzzling because according to the official documentation snakemake should interpret any wildcard as the regex .+. But it is not doing that for sample NIPT-PearlPPlasma-05-PPx, whose R1 and R2 should be NIPT-PearlPPlasma-05-PPx_S5_R1_001.fastq.gz and NIPT-PearlPPlasma-05-PPx_S5_R2_001.fastq.gz, respectively.
Take a look again at the snakemake tutorial on how input is inferred from output, anyways I think the problem lies in this piece of code:
output:
expand("aligned/{sample}.bam", sample = SAMPLES)
And needs to be changed into
output:
"aligned/{sample}.bam"
What you had didn't work because before expand("aligned/{sample}.bam", sample = SAMPLES) basically becomes a list like this ["aligned/sample0.bam","aligned/sample1.bam"]. When you remove the expand, you only give a "description" of how the output should look like, and thus snakemake can infer the wildcards and input.
edit:
It's difficult to test it since I don't have the actual files, but you should do something like this. Won't work if multiple S-thingies exist.
def get_reads(wildcards):
R1 = FQDIR + f"{wildcards.sample}_S{{s}}_R1_001.fastq.gz"
R2 = FQDIR + f"{wildcards.sample}_S{{s}}_R2_001.fastq.gz"
globbed = glob_wildcards(R1)
R1, R2 = expand([R1, R2], s=globbed.s)
return {"R1": R1, "R2": R2}
rule bwa_map:
input:
unpack(get_reads),
REF = config['ref']
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
The problem is here:
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
Your output clearly defines a pattern where {sample} is a wildcard. When Snakemake builds the DAG and finds that any other rule requires a file that matches this pattern, it sets a concrete value to the wildcard.sample. At this moment all the inputs shall be defined, but you are introducing one more level of indirection: the wildcard {s} which is not defined.
The value of {s} shall be clearly inferred from the output. If you can do it in design time, substitute the it with the concrete values, otherwise you may use checkpoint feature of Snakemake.

Run code on specific files in a directory separately (by the name of file)

I have N files in the same folder with different index numbers like
Fe_1Sec_1_.txt
Fe_1Sec_2_.txt
Fe_1Sec_3_.txt
Fe_2Sec_1_.txt
Fe_2Sec_2_.txt
Fe_2Sec_3_.txt
.
.
.
and so on
Ex: If I need to run my code with only the files with time = 1 Sec, I can make it manually as follow:
path = "input/*_1Sec_*.txt"
files = glob.glob(path)
print(files)
which gave me:
Out[103]: ['input\\Fe_1Sec_1_.txt', 'input\\Fe_1Sec_2_.txt', 'input\\Fe_1Sec_3_.txt']
In case of I need to run my code for all files separately (depending on the measurement time in seconds, i.e. the name of file)
I tried this code to get the path for each time of measurement:
time = 0
while time < 4:
time += 1
t = str(time)
path = ('"input/*_'+t+'Sec_*.txt"')
which gives me:
"input/*_1Sec_*.txt"
"input/*_2Sec_*.txt"
"input/*_3Sec_*.txt"
"input/*_4Sec_*.txt"
After that I tried to use this path as follow:
files = glob.glob(path)
print(files)
But it doesn't import the wanted files and give me :
"input/*_1Sec_*.txt"
[]
"input/*_2Sec_*.txt"
[]
"input/*_3Sec_*.txt"
[]
"input/*_4Sec_*.txt"
[]
Any suggestions, please??
I think the best way would be to simply do
for time in range(1, 5): # 1,2,3,4
glob_path = 'input/*_{}Sec_*.txt'.format(time)
for file_path in glob.glob(glob_path):
do_something(file_path, measurement) # or whatever

Error in using os.path.walk() correctly

So I created this Folder C:\TempFiles to test run the following code snippet
Inside this folder i had two files -> nd1.txt, nd2.txt and a folder C:\TempFiles\Temp2, inside which i had only one file nd3.txt
Now when i execute this code:-
import os,file,storage
database = file.dictionary()
tools = storage.misc()
lui = -1 # last used file index
fileIndex = 1
def sendWord(wrd, findex): # where findex is the file index
global lui
if findex!=lui:
tools.refreshRecentList()
lui = findex
if tools.mustIgnore(wrd)==0 and tools.toRecentList(wrd)==1:
database.addWord(wrd,findex) # else there's no point adding the word to the database, because its either trivial, or has recently been added
def showPostingsList():
print("\nPOSTING's LIST")
database.display()
def parseFile(nfile, findex):
for line in nfile:
pl = line.split()
for word in pl:
sendWord(word.lower(),findex)
def parseDirectory(dirname):
global fileIndex
for root,dirs,files in os.walk(dirname):
for name in dirs:
parseDirectory(os.path.join(root,name))
for filename in files:
nf = open(os.path.join(root,filename),'r')
parseFile(nf,fileIndex)
print(" --> "+ nf.name)
fileIndex+=1
nf.close()
def main():
dirname = input("Enter the base directory :-\n")
print("\nParsing Files...")
parseDirectory(dirname)
print("\nPostings List has Been successfully created.\n",database.entries()," word(s) sent to database")
choice = ""
while choice!='y' and choice!='n':
choice = str(input("View List?\n(Y)es\n(N)o\n -> ")).lower()
if choice!='y' and choice!='n':
print("Invalid Entry. Re-enter\n")
if choice=='y':
showPostingsList()
main()
Now I should Traverse the three files only once each, and i put a print(filename) to test that, but apparently I am traversing the inside folder twice:-
Enter the base directory :-
C:\TempFiles
Parsing Files...
--> C:\TempFiles\Temp2\nd3.txt
--> C:\TempFiles\nd1.txt
--> C:\TempFiles\nd2.txt
--> C:\TempFiles\Temp2\nd3.txt
Postings List has Been successfully created.
34 word(s) sent to database
View List?
(Y)es
(N)o
-> n
Can Anyone tell me how to modify the os.path.walk() as such to avoid the error
Its not that my output is incorrect, but its traversing over one entire folder twice, and that's not very efficient.
Your issue isn't specific to Python 3, it's how os.walk() works - iterating already does the recursion to subfolders, so you can take out your recursive call:
def parseDirectory(dirname):
global fileIndex
for root,dirs,files in os.walk(dirname):
for filename in files:
nf = open(os.path.join(root,filename),'r')
parseFile(nf,fileIndex)
print(" --> "+ nf.name)
fileIndex+=1
nf.close()
By calling parseDirectory() for the dirs, you were starting another, independant walk of your only subfolder.

Resources