I created a script which will take file names from a csv file and extract image with corresponding file name to an another
I found out that in my pc if i use imwrite with single backslash for directories it wont work but when I load the directories it is getting converted to single slash and hence img becomes null type I have included a screenshot as well
import os
import pandas as pd
import cv2
file=pd.read_csv("csv4.csv")
list=[]
list=file
path=os.getcwd()
img_path=path+"\\kaggle_35000"
kaggle_folders=os.listdir(img_path)
length_cv=len(file)
for x in kaggle_folders:
img_in_each_fldr=os.listdir(img_path+"\\"+x)
print(">>>>>>>>>>>>>>>The primary folder Now is:"+str(x)+"<<<<<<<<<<<<<<<<<<<<<<")
y=img_in_each_fldr
length_file=len(y)
for z in y:
print("The image from folder:"+str(x)+"is being checked the file is"+str(z))
for q in range(0,length_cv-1):
#print()
x1=str(file['16_left'][q])
x1=x1+ ".jpeg"
if z == x1:
write_path=img_path+"\\"+x+str(x1)
img=cv2.imread(write_path,1)
destination="D:\\image_extract_python\\result\\kaggle_stage0\\"+str(z)
cv2.imwrite(destination,img)
print('file written'+str(z))
Related
So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)
everyone, I am fairly new to using python for data analysis,so apologies for silly questions:
IDE : PyCharm
What I have : A massive .xyz file (with 4 columns) which is a combination of several datasets, each dataset can be determined by the third column of the file which goes from 10,000 to -10,000 with 0 in between and 100 as spacing and repeats (so every 201 rows is one dataset)
What I want to do : Split the massive file into its individual datasets (201 rows each)and save each file under a different name.
What I have done so far :
# Import packages
import os
import pandas as pd
import numpy as np #For next steps
import math #For next steps
#Check and Change directory
path = 'C:/Clayton/lines/profiles_aufmod'
os.chdir(path)
print(os.getcwd()) #Correct path is printed
# split the xyz file into different files for each profile
main_xyz = 'bathy_SPO_1984_50x50_profile.xyz'
number_lines = sum(1 for row in (open(main_xyz)))
print(number_lines) # 10854 is the output
rowsize = 201
for i in range(number_lines, rowsize):
profile_raw_df = pd.read_csv(main_xyz, delimiter=',', header=None, nrows=rowsize,
skiprows=i)
out_xyz = 'Profile' + str(i) + '.xyz'
profile_raw_df.to_csv(out_xyz, index=False,
header=False, mode='a')
Problems I am facing :
The for loop was at first giving output files as seen in the image,check Proof of output but now it does not produce any outputs and it is not rewriting the previous files either. The other mystery is that I am not getting an error either,check Code executed without error.
What I tried to fix the issue :
I updated all the packages and restarted Pycharm
I ran each line of code one by one and everything works until the for loop
While counting the number of rows in
number_lines = sum(1 for row in (open(main_xyz)))
you have exhausted the iterator that loops over the lines of the file. But you do not close the file. But this should not prevent Pandas from reading the same file.
A better idiom would be
with open(main_xyz) as fh:
number_lines = sum(1 for row in fh)
Your for loop as it stands does not do what you probably want. I guess you want:
for i in range(0, number_lines, rowsize):
so, rowsize is the step-size, instead of the end value of the for loop.
If you want to number the output files by data set, keep a counnt of the dataset, like this
data_set = 0
for i in range(0, number_lines, rowsize):
data_set += 1
...
out_xyz = f"Profile{data_set}.xyz"
...
I want to remove a file's last characters that it's name is somedigits plus .py and plus .BR like 0001.py.BR or 0005.py.BR and remove the .BR from the string.
I tried this code
import os
x = input("")
os.rename(x, x[7])
but it sometimes don't work for some file that their names are larger like 00001.py.BR it renames it to 00001.p so is there a way that I just do like this x - ".BR".
if you talking about file path,
then use os.path.splitext()
>>> import os
>>> os.path.splitext('00001.py.BR')[0]
'00001.py'
>>>
You can use the built-in split function like this:
import os
x=input("")
x_new = x.split(".BR")[0]
os.rename(x, x_new)
If you're using Python 3, check the standard pathlib:
from pathlib import Path
old_path = Path(input(""))
if old_path.suffix == '.BR':
old_path.rename(old_path.stem)
else:
print('this is not a .BR file')
Normally I don't ask questions, because I find answers on this forum. This place is a goldmine.
I am trying to move some files from a legacy storage system(CIFS Share) to BOX using python SDK. It works fine as long as the file path is less than 255 characters.
I am using os.walk to pass the share name in unix format to list files in the directory
Here is the file name.
//dalnsphnas1.mydomain.com/c$/fs/hdrive/home/abcvodopivec/ENV Resources/New Regulation Review/Regulation Reviews and Comment Letters/Stormwater General Permits/CT S.W. Gen Permit/PRMT0012_FLPR Comment Letter on Proposed Stormwater Regulations - 06-30-2009.pdf
I also tried to escape the file, but still get FileNotFoundError, even though file is there.
//dalnsphnas1.mydomain.com/c$/fs/hdrive/home/abcvodopivec/ENV Resources/New Regulation Review/Regulation Reviews and Comment Letters/Stormwater General Permits/CT S.W. Gen Permit/PRMT0012_FLPR\ Comment\ Letter\ on\ Proposed\ Stormwater\ Regulations\ -\ 06-30-2009.pdf
So I tried to shorten the path using win32api.GetShortPathName, but it throws the same FileNotFoundError. This works fine on files with path length less than 255 characters.
Also tried to copy the file using copyfile(src, dst) to another destination folder to overcome this issue, and still get the same error.
import os, sys
import argparse
import win32api
import win32con
import win32security
from os import walk
parser = argparse.ArgumentParser(
description='Migration Script',
)
parser.add_argument('-p', '--home_path', required = True, help='Home Drive Path')
args = vars(parser.parse_args())
if args['home_path']:
pass
else:
print("Usage : script.py -p <path>")
print("-p <directory path>/")
sys.exit()
dst = (args['home_path'] + '/' + 'long_file_path_dir')
for dirname, dirnames, filenames in os.walk(args['home_path']):
for filename in filenames:
file_path = (dirname + '/' + filename)
path_len = len(file_path)
if(path_len > 255):
#short_path = win32api.GetShortPathName(file_path)
copyfile(file_path, dst, follow_symlinks=True)
After a lot of trial and error, figured out the solution (thanks to stockoverflow forum)
switched from unix format to UNC path
Then appending each file generated through os.walk with r'\\?\UNC' like below. UNC path starts with two backward slashes, I have to remove one to make it to work
file_path = (r'\\?\UNC' + file_path[1:])
Thanks again for everyone who responded.
Shynee
I have to find all files older than y days in archival folder and move those files to somefolder.I have found some files older than y days in archival and tried moving to other folder.i have written code using python.while running the code i'm getting this error "java.io.FileNotFoundException: /dbfs/FileStore/Archival/testparquet.parquet".I have checked,file exists in dbfs .Can someone please help me on this
from pathlib import Path
import arrow
import os, time, sys
vFilePath="/dbfs/FileStore/"
path = "/dbfs/FileStore/Archival/"
path1="dbfs:/FileStore/Archival/"
#####FOR Dbutils path###
vDbuPath="/FilsStore/Archival/"
deleteFullPath="FileStore/Deleted/"
now = time.time()
print (now)
vdelFullPath=deleteFullPath+"/"
for f in os.listdir(path):
Filename=str(print(f))
print(Filename)
f = os.path.join(path,f)
print(os.stat(os.path.join(path,f)).st_mtime)
if os.stat(os.path.join(path,f)).st_mtime < now - 1 * 86400:
print("f value: "+f)
filename=os.path(f)
print("dbutilspath: " +filename)
if not os.path.exists("dbfs:/"+deleteFullPath + Filename): dbutils.fs.mv(filename,"dbfs:/"+deleteFullPath+"testparquet.parquet",recurse=True)
One way to do this is using hadoop filesystem. Below you will get a list of dictionnaries with the file dates and names.
You can then do your magic to move the files if they are old enough.
import time
from time import mktime
from datetime import datetime
list_of_files=[]
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
path_exists = fs.exists(spark._jvm.org.apache.hadoop.fs.Path(source_dir))
if path_exists == True:
file_list = fs.listFiles(spark._jvm.org.apache.hadoop.fs.Path(source_dir), True)
while file_list.hasNext():
file = file_list.next()
list_of_files.append({'filedate' : datetime.fromtimestamp(mktime(time.localtime(int(str(file.getModificationTime())[:-3])))),"filename" : str(file.getPath())})