Adding attributes to large datasets using command line tools

Adding attributes to large datasets using command line tools - attributes

I have an extremely large dataset (approx. 150 MB; 500 targets; 700,000+ attributes). I need to add one attribute to the end of each file.
The date file that I am working with has the following structure:
#relation 'filename'
#attribute "place" string
#attribute "institution" string
#attribute "food" string
#attribute "book" string
#data
3.8,6,0,0,church
86.3,0,63.1,0,man
0,0,0,37,woman
I need to add one attribute of information to each of the rows following #data.
However, due to its sheer number of attributes, I cannot open and modify the data in a text editor.
The attribute that I need to include I have in a separate tab separated file that has the following structure:
church 1
man 1
woman 0
The desired result would have the data set look like this:
#relation 'filename'
#attribute "place" string
#attribute "institution" string
#attribute "food" string
#attribute "book" string
#data
3.8,6,0,0,church,1
86.3,0,63.1,0,man,1
0,0,0,37,woman,0
Where the command would look to match the end of each line after #data with each line of the second file and if it is a match add the corresponding 0 or 1.
I have been searching for a solution for this and my searches have mostly came up with answers that are pointing to the direction to use a text editor. As I mentioned earlier, the problem with the text editors is not necessarily opening the file (UltraEdit for instance can handle for the most part a file of this size). It is manually inserting one attribute after more than 700,000 attributes, which is an extremely time consuming task.
So, I ask the community if what I need to do is possible using a command line argument (awk/grep, etc.) to achieve the desired result?

Python is great because it's installed by default on a lot of POSIX-based systems :)
Now some caveats:
this is simple python and intended for you to learn as you go, so it could be much more optimized
this will read the entire file into memory while processing, so if your file is in the GB, it's going to hit your computer a bit
I recommend throwing some print statements, or using the python debugger to step through the program if you want to know what's going on.
Here's what I came up with:
lookup = {}
output_list = []
# build a lookup based on the lookup file
with open('lookup.csv', 'rb') as lookup_file:
rows = lookup_file.readlines()
for row in rows:
key, value = row.split()
lookup[key] = value
# loop through the big file and add the values
with open('input-big-data.txt', 'rb') as input_file:
rows = input_file.readlines()
target_zone = False
for row in rows:
# keep a copy of every row
output_for_this_row = row
# skip the normal attribute rows
if row.startswith('#'):
target_zone = False
# check to see if we are in the 'target zone'
if row.startswith('#data'):
target_zone = True
# start parsing the rows, but not if they have the attribute flag
if target_zone and not row.startswith('#'):
# do your data processing here
# strip to clobber the newline, then break it into pieces
row_list = row.strip().split(',')
# grab the last item
lookup_key = row_list[-1].strip()
# grab the value for that last item
row_list.append(lookup[lookup_key])
# put the row back in it's original state
output_for_this_row = ",".join(row_list) + "\n"
output_list.append(output_for_this_row)
with open('output-big-data.txt', 'wb') as output_file:
for line in output_list:
output_file.write("{}".format(line))
I've commented pretty thoroughly throughout, so it should be pretty self-explanatory.
From the files in your question, I've named them in order: input-big-data.txt, lookup.csv, and output-big-data.csv.
Here's the output from my example:
#relation 'filename'
#attribute "place" string
#attribute "institution" string
#attribute "food" string
#attribute "book" string
#data
3.8,6,0,0,church,1
86.3,0,63.1,0,man,1
0,0,0,37,woman,0
Hth,
Aaron

As commented below, python can solve this problem quite simply as demonstrated by the solution that I found and used on this blog: http://margerytech.blogspot.it/2011/03/python-appending-column-to-end-of-tab.html.
It is not a command line argument (as I indicated I wanted to use in the question), but it solves the problem all the same.

Related

Getting KeyError for pandas df column name that exists

I have
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", sep=";", encoding='cp1252')
So, when I try to access these rows:
data_combined = data_combined[(data_combined["wals_code"]=="abk") &(data_combined["wals_code"]=="aco")]
I get a KeyError 'wals_code'. I then checked my list of col names with
print(data_combined.columns.tolist())
and saw the col name 'wals_code' in the list. Here's the first few items from the print out.
[',"wals_code","Order of subject, object and verb","Order of genitive and noun","Order of adjective and noun","Order of adposition and NP","Order of demonstrative and noun","Order of numeral and noun","Order of RC and noun","Order of degree word and adjective"]
Anyone have a clue what is wrong with my file?

The problem is the delimiter you're using when reading the CSV file. With sep=';', you instruct read_csv to use semicolons (;) as the separators for columns (cells and column headers), but it appears from your columns print out that your CSV file actually uses commas (,).
If you look carefully, you'll notice that your columns print out displays actually a list with one long string, not a list of individual strings representing the columns names.
So, use sep=',' instead of sep=';' (or just omit it entirely as , is the default value for sep):
data_combined = pd.read_csv("/path/to/creole_data/data_combined.csv", encoding='cp1252')

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

I have python code that loads a group of exam results. Each exam is saved in it's own csv file.
files = glob.glob('Exam *.csv')
frame = []
files1 = glob.glob('Exam 1*.csv')
for file in files:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
for file in files1:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
There is one person in the whole dataframe in their name column it shows up as
\ufeffStudents Name
It happens for every single exam. I tried using the encoding argument but that's not fixing the issue. I am out of ideas. Anyone else have anything?

That character is the BOM or "Byte Order Mark."
There are serveral ways to resovle it.
First, I want to suggest to add engine parameter (for example, engine='python' in pd.read_csv() when reading csv files.
pd.read_csv(file, index_col=[0], engine='python', encoding='utf-8-sig')
Secondly, you can simply remove it by replacing with empty string ('').
df['student_name'] = df['student_name'].apply(lambda x: x.replace("\ufeff", ""))

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks

Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

Iterate over images with pattern

I have thousands of images which are labeled IMG_####_0 where the first image is IMG_0001_0.png the 22nd is IMG_0022_0.png, the 100th is IMG_0100_0.png etc. I want to perform some tasks by iterating over them.
I used this fnames = ['IMG_{}_0.png'.format(i) for i in range(150)] to iterate over the first 150 images but I get this error FileNotFoundError: [Errno 2] No such file or directory: '/Users/me/images/IMG_0_0.png' which suggests that it is not the correct way to do it. Any ideas about how to capture this pattern while being able to iterate over the specified number of images i.e in my case from IMG_0001_0.png to IMG_0150_0.png

fnames = ['IMG_{0:04d}_0.png'.format(i) for i in range(1,151)]
print(fnames)
for fn in fnames:
try:
with open(fn, "r") as reader:
# do smth here
pass
except ( FileNotFoundError,OSError) as err:
print(err)
Output:
['IMG_0000_0.png', 'IMG_0001_0.png', ..., 'IMG_0148_0.png', 'IMG_0149_0.png']
Dokumentation: string-format()
and format mini specification.
'{:04d}' # format the given parameter with 0 filled to 4 digits as decimal integer
The other way to do it would be to create a normal string and fill it with 0:
print(str(22).zfill(10))
Output:
0000000022
But for your case, format language makes more sense.

You need to use a format pattern to get the format you're looking for. You don't just want the integer converted to a string, you specifically want it to always be a string with four digits, using leading 0's to fill in any empty space. The best way to do this is:
'IMG_{:04d}_0.png'.format(i)
instead of your current format string. The result looks like this:
In [2]: 'IMG_{:04d}_0.png'.format(3)
Out[2]: 'IMG_0003_0.png'

generate list of possible names and try if exist is slow and horrible way to iterate over files.
try look to https://docs.python.org/3/library/glob.html
so something like:
from glob import iglob
filenames = iglob("/path/to/folder/IMG_*_0.png")

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.

At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Adding attributes to large datasets using command line tools - attributes

Related

Getting KeyError for pandas df column name that exists

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

Iterate over images with pattern

str.format places last variable first in print

Categories

Resources