How to read files from folder based on column value of dataframe

How to read files from folder based on column value of dataframe - python-3.x

I have column with some numbers , for each number i want to check in folder if this match to any file name in folder read this file ,if not match number go for next ...
df=pd.DataFrame({'x':['2000','5000','10000']})
files_folder:
P2000.csv
P4000.csv
P5000.csv
P6000.csv
P4000.csv
result:
read files :
P2000.csv
P5000.csv

Use glob with test substring in any with list comprehension:
import glob
df=pd.DataFrame({'x':['2000','5000','10000']})
for f in glob.glob('files_folder/*.csv'):
if any([x in f for x in df['x']]):
print (f)
files_folder\P2000.csv
files_folder\P5000.csv
List comprehension:
files = [f for f in glob.glob('files_folder/*.csv') if any([x in f for x in df['x']])]
print (files)
['files_folder\\P2000.csv', 'files_folder\\P5000.csv']

You can use glob.glob() to list all csv files in your files_folder. Then use apply() to check if value in x is in those filename list.
import glob
import numpy as np
files = glob.glob("/path/to/*.csv")
files = df['x'].apply(lambda x: f'P{x}.csv' if any([f'P{x}' in k for k in files]) else np.nan ).dropna().tolist()
print(files)
['P2000.csv', 'P5000.csv']

Related

python3: How to append a list to an empty list in an external python file?

There is a list-
#main_file.py
f = ['foo','bar']
that I want to append to an empty list in an external python file.
The external file already has an empty list g
#external_file.py
g = []
So far I have tried this in the main file-
#main_file.py
h = open("external_file.py", "a")
g.append(f)
h.close()
But it didn't work

You can do
import re
with open("external_file.py", "r+") as h:
stmt = h.read()
exec(stmt)
val_name = stmt[:stmt.find("=")].strip()
eval(val_name).extend(f)
h.seek(0)
h.write(re.sub("\[.*\]", str(eval(val_name)), stmt))
Because you are reading for a file then the statement "g = []" is text so you need to exec() it the it will be converted to g = [].
After there is a need to extend the f list in to the g and to remove the original "g = []" with the new extended list and write it back to the file.

You can try by importing the first external_file.py and then append the list f from the main_file.py
#main_file.py
import external_file as externalFile
f = ['foo','bar']
externalFile.g.append(f)
#print(externalFile.g)

Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x

The python 3.x code (listed below) does a great job of comparing files from two different directories (Input_1 and Input_2) and finding the files that match (are the same between the two directories). Is there a way I can alter the existing code (below) to find files that are the same BY NAME ONLY between the two directories. (i.e. find matches by name only and not name + extension)?
comparison = filecmp.dircmp(Input_1, Input_2) #Specifying which directories to compare
common_files = ', '.join(comparison.common) #Finding the common files between the directories
TextFile.write("Common Files: " + common_files + '\n') # Writing the common files to a new text file
Example:
Directory 1 contains: Tacoma.xlsx, Prius.txt, Landcruiser.txt
Directory 2 contains: Tacoma.doc, Avalon.xlsx, Rav4.doc
"TACOMA" are two different files (different extensions). Could I use basename or splitext somehow to compare files by name only and have it return "TACOMA" as a matching file?

To get the file name, try:
from os import path
fil='..\file.doc'
fil_name = path.splitext(fil)[0].split('\\')[-1]
This stores file in file_name. So to compare files, run:
from os import listdir , path
from os.path import isfile, join
def compare(dir1,dir2):
files1 = [f for f in listdir(dir1) if isfile(join(dir1, f))]
files2 = [f for f in listdir(dir2) if isfile(join(dir2, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0] == path.splitext(j)[0]): #this compares it name by name.
common_files.append(i)
return common_files
Now just call it:
common_files = compare(dir1,dir2)
As you know python is case-sensitive, if you want common files, no matter if they contain uppers or lowers, then instead of:
if(path.splitext(i)[0] == path.splitext(j)[0]):
use:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
You're code worked very well! Thank you again, Infinity TM! The final use of the code is as follows for anyone else to look at. (Note: that Input_3 and Input_4 are the directories)
def Compare():
Input_3 = #Your directory here
Input_4 = #Your directory here
files1 = [f for f in listdir(Input_3) if isfile(join(Input_3, f))]
files2 = [f for f in listdir(Input_4) if isfile(join(Input_4, f))]
common_files = []
for i in files1:
for j in files2:
if(path.splitext(i)[0].lower() == path.splitext(j)[0].lower()):
common_files.append(path.splitext(i)[0])

os.walk help - processing data in chunks - python3

I have some files that are scattered in many different folders within a directory, and I was wondering if there were a way to iterate through these folders in chunks.
Here's a picture of my directory tree
I'd want to go through all the files in my 2010A folder, then 2010B folder, then move to 2011A and 2011B etc..
My goal is to amend my current script, which only works for a single folder, so that it flows like this:
Start: ROOT FOLDER >
2010 > 2010A >
output to csv> re-start loop >
2010B > append csv after the last row
re-start loop > 2011 > 2011A >
append csv after the last row > and so on...
Is this possible?
Here's my code, it currently works if I run it on a single folder containing my txt files, e.g., for the 2010A folder:
import re
import pandas as pd
import os
from collections import Counter
#get file list in current directory
filelist = os.listdir(r'root_folder\2010\2010A')
dict1 = {}
#open and read files, store into dictionary
for file in filelist:
with open(file) as f:
items = f.read()
dict1[file] = items
#create filter for specific words
filter = [ "cat", "dog", "elephant", "fowl"]
dict2 = {}
# count occurrence of words in each file
for k, v in dict1.items():
list= []
for i in filter:
list.extend(re.findall(r"{}".format(i),v))
dict2[k] = dict(Counter(new))
dict3 ={}
# count total words in each file, store in separate dictionary
dict3 = {k: {'total':len(v)} for k,v in dict1.items()}
join_dict = {}
#join both dictionaries
join_dict = {k:{**dict2[k], **dict3[k]} for k in out}
#convert to pandas dataframe
df = pd.DataFrame.from_dict(join_dict, orient='index').fillna(0).astype(int)
#output to csv
df.to_csv(r'path\output.csv',index = True, header=True)
I have a feeling I need to replace:
for file in filelist:
with for (root,dirs,files) in os.walk(r'root_folder', topdown=True):
But I'm not exactly sure how, since I'm quite new to coding and python in general.

You can use glob to get list of files like this
import glob
files = glob.glob('root_folder\\*.txt', recursive=True)

Parse filename information into multiple columns in the concatenated csv file

I have multiple csv files in a folder and each has a unique file name such as W10N1_RTO_T0_1294_TL_IV_Curve.csv. I would like to concatenate all files together and create multiple columns based on the filename information. For example, W10N1 is one column called DieID.
I am a beginner on programming and Python. I couldn't figure how to do it easily.
import os
import glob
import pandas as pd
import csv
os.chdir('filepath')
extension='csv'
all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv('combined_csv.csv',index=False

import os
os.listdir("your_target_direcotry")
will return a list of all files and directories in "your_target_direcotry".
Then it is just string manipulation. e.g
x = ‘blue_red_green’
x.split(“_”)
[‘blue’, ‘red’, ‘green’]
>>>
>>> a,b,c = x.split(“_”)
>>> a
‘blue’
>>> b
‘red’
>>> c
‘green’
Also do separate for "." first to remove .csv
At last, create a CSV which can operate by any separator u want.
f= open("yourfacnyname.csv","w+")
f.write("DieID You_fancy_other_IDs also_if_u_want_variable_use_this_%d\r\n" % (i+1))
f.close()
EZ as A B C

Extracting file names from a list using single line for loop - python

I am trying to see if I can extract the file names from a os.listdir() output by omitting the '.csv' part in one single line for loop.
for example my list of file names look like this :
files = ['OPS020.csv','OPS340.csv',OPS230.csv','OPS349.csv']
Then all i could do was this
file_names = [f.split('.') for f in files]
file_names = [f[0] for f in file_names]
Is there a more elegant and shorter way to do this ?
the output i'm expecting is
file_names : ['OPS020','OPS340','OPS230','OPS349']

I guess, something like this would work.
from os import path
files = ['OPS020.csv','OPS340.csv','OPS230.csv','OPS349.csv']
filenames = [path.splitext(x)[0] for x in files]
Docs

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to read files from folder based on column value of dataframe - python-3.x

Related

python3: How to append a list to an empty list in an external python file?

Compare by NAME only, and not by NAME + EXTENSION using existing code; Python 3.x

os.walk help - processing data in chunks - python3

Parse filename information into multiple columns in the concatenated csv file

Extracting file names from a list using single line for loop - python

Categories

Resources